PostgreSQL#

Note

The Deploy documentation covers many topics relating to PostgreSQL, including: configuration, maintenance, and usage by clients. This page addresses topics relating to software development.

Connect to a database#

Connect to the database using a connection string stored in the DATABASE_URL environment variable.

In Python, connect to the database using dj-database-url if using Django, or psycopg2.connect() otherwise.

To set the search path for a PostgreSQL connection, append to the connection string:

psycopg2#
?options=-csearch_path%3Dmyschema,public
dj-database-url#
?currentSchema=myschema,public

Identify the client#

Database administrators need to identify the sources of queries, in order to notify developers of inefficient queries or alert users whose queries will be interrupted by maintenance. For example:

  • Django applications set the application_name query string parameter in the PostgreSQL connection URI, or use a service-specific user

  • Kingfisher Summarize uses the psycopg2 package, and adds /* kingfisher-summarize {identifier} */ as a comment to expensive queries

  • Kingfisher Colab uses the ipython-sql package, and adds the Google Colaboratory notebook URL as a comment to all queries

Construct SQL statements#

Follow best practices to avoid accidental errors and SQL injection. The code samples below use the psycopg2 Python package.

  • Pass parameters to SQL queries, using the second argument to the execute method. This adapts the Python value’s type (like bool, int, str) to the correct SQL representation:

    cur.execute("SELECT * FROM data WHERE data->>'date' > %(date)s", {'date': '2020-01-01'})
    

    DO NOT use string interpolation (%):

    cur.execute("SELECT * FROM data WHERE data->>'date' > '%(date)s'" % {'date': '2020-01-01'})  # WRONG
    

    DO NOT use string concatenation (+):

    cur.execute("SELECT * FROM data WHERE data->>'date' > '" + '2020-01-01' + "'")  # WRONG
    

    AVOID using literal values:

    cur.execute("SELECT * FROM data WHERE data->>'date' > '2020-01-01'")  # AVOID
    

    For example, if you forget that dates are represented as strings in SQL, you might do the following, which evaluates 2020-12-31 to 1977, which will match everything in the database:

    cur.execute("SELECT * FROM data WHERE data->>'date' > 2020-12-31")  # BROKEN
    
  • Use named placeholders like %(collection_id)s. This allows you to use the same placeholder multiple times in the query, while only having to pass a single parameter, and to edit and re-order your query without re-ordering your parameters.

    cur.execute("""
        SELECT * FROM release WHERE collection_id = %(collection_id)s
        UNION
        SELECT * FROM record WHERE collection_id = %(collection_id)s AND ocid = %(ocid)s
    """, {'collection_id': 1, 'ocid': 'ocds-213czf-1'})
    

    AVOID use anonymous placeholders (%s):

    cur.execute("""
        SELECT * FROM release WHERE collection_id = %s
        UNION
        SELECT * FROM record WHERE collection_id = %s AND ocid = %s
    """, (1, 1, 'ocds-213czf-1'))  # AVOID
    
  • If you are writing a query template in which you want to substitute column names or table names, use the format method and the SQL and Identifier classes (documentation):

    from psycopg2.sql import SQL, Identifier
    
    cur.execute(SQL("SELECT * FROM {table}").format(table=Identifier('collection')))
    

    You can use this together with passing parameters:

    cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s").format(
        table=Identifier('collection')), {'id': 1})
    

    Remember to format the SQL() object. DO NOT format the string itself:

    cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s".format(
        table='collection'), {'id': 1})  # WRONG
    

    DO NOT use string interpolation (%):

    cur.execute("SELECT * FROM %s" % 'collection')  # WRONG
    

    DO NOT use string concatenation (+):

    cur.execute("SELECT * FROM " + 'collection')  # WRONG
    

    AVOID using anonymous placeholders:

    cur.execute(SQL("SELECT * FROM {}".format('collection'))  # AVOID
    

Generate Entity Relationship Diagram#

  1. Install SchemaSpy

  2. Download the PostgreSQL JDBC Driver

  3. Rename the JAR files to schemaspy.jar and postgresql.jar

  4. Move the JAR files to a preferred location

Run SchemaSpy, using appropriate values for the -db (database name), -s (schema, optional), -u (user) and -p (password, optional) arguments:

java -jar schemaspy.jar -t pgsql -dp postgresql.jar -host localhost -db DATABASE -s SCHEMA -u USER -p PASSWORD -o schemaspy -norows

Use either the schemaspy/diagrams/summary/relationships.real.compact.png or schemaspy/diagrams/summary/relationships.real.large.png file and check the schemaspy/diagrams/orphans/ directory.