PostgreSQL

Note

The Deploy documentation covers many topics relating to PostgreSQL, including: configuration, maintenance, and usage by clients. This page addresses topics relating to software development.

Connect to a database

Connect to the database using a connection string stored in the DATABASE_URL environment variable.

In Python, connect to the database using dj-database-url if using Django, or psycopg2 otherwise.

To set the search path for a PostgreSQL connection, append to the connection string:

psycopg2
?options=-csearch_path%3Dmyschema,public
dj-database-url
?currentSchema=myschema,public

Identify the client

Database administrators need to identify the sources of queries, in order to notify developers of inefficient queries or alert users whose queries will be interrupted by maintenance. For example:

  • Django applications set the application_name query string parameter in the PostgreSQL connection URI, or use a service-specific user

  • Kingfisher Summarize uses the psycopg2 package, and adds /* kingfisher-summarize {identifier} */ as a comment to expensive queries

  • Kingfisher Colab uses the ipython-sql package, and adds the Google Colaboratory notebook URL as a comment to all queries

Define tables

  • In PostgreSQL, use TEXT instead of other character types, as there is no performance difference.

  • Use NOT NULL with character types, as recommended by Django.

  • Use NOT NULL with JSON types, and set the default to an empty object, array or string.

  • In Python, do not set default values to {} or []. In Django, use default=dict and default=list. In Pydantic (including SQLModel), use default_factory=dict and default_factory=list.

  • JSON data often exceeds 2kb, and is therefore TOASTed. If the application needs to SELECT a value from the JSON data, it is faster to extract that value to a column: for example, release_date.

See also

Django models

Name conventions

  • Timestamp columns: created_at, updated_at and deleted_at. (Some projects use created and modified.)

Define indexes

  • Add an index for every foreign key with a corresponding JOIN query. If the JOIN and/or WHERE clauses use multiple columns of the same table, create a multi-column index, with the most used column as the index’s first column.

  • Use EXPLAIN to figure out why a query is slow. It could be due to a missing index (sequential scan), an unused index, or a slower plan (for example, using index scan instead of bitmap index scan)

    Note

    When using a tool like Dalibo or pgMustard, follow these instructions to get the query plan. Otherwise, if you don’t know how slow the query is, omit the ANALYZE and BUFFERS options, to only plan and not execute the query.

Load (or dump) data

Use the copy meta-command instead of the COPY command, so that file accessibility and privileges are those of the user, not the server – such that no SQL superuser privileges are required.

Construct SQL statements

Tip

Add SQLFluff to pre-commit if a project contains SQL files.

Follow best practices to avoid accidental errors and SQL injection. The code samples below use the psycopg2 Python package.

  • Pass parameters to SQL queries, using the second argument to the execute method. This adapts the Python value’s type (like bool, int, str) to the correct SQL representation:

    cur.execute("SELECT * FROM release WHERE release_date > %(date)s", {'date': '2020-01-01'})
    

    DO NOT use string interpolation (%):

    cur.execute("SELECT * FROM release WHERE release_date > '%(date)s'" % {'date': '2020-01-01'})  # WRONG
    

    DO NOT use string concatenation (+):

    cur.execute("SELECT * FROM release WHERE release_date > '" + '2020-01-01' + "'")  # WRONG
    

    AVOID using literal values:

    cur.execute("SELECT * FROM release WHERE release_date > '2020-01-01'")  # AVOID
    

    For example, if you forget that dates are represented as strings in SQL, you might do the following, which evaluates 2020-12-31 to 1977, which will match everything in the database:

    cur.execute("SELECT * FROM release WHERE release_date > 2020-12-31")  # BROKEN
    
  • Use named placeholders like %(collection_id)s. This allows you to use the same placeholder multiple times in the query, while only having to pass a single parameter, and to edit and re-order your query without re-ordering your parameters.

    cur.execute("""
        SELECT * FROM release WHERE collection_id = %(collection_id)s
        UNION
        SELECT * FROM record WHERE collection_id = %(collection_id)s AND ocid = %(ocid)s
    """, {'collection_id': 1, 'ocid': 'ocds-213czf-1'})
    

    AVOID use anonymous placeholders (%s):

    cur.execute("""
        SELECT * FROM release WHERE collection_id = %s
        UNION
        SELECT * FROM record WHERE collection_id = %s AND ocid = %s
    """, (1, 1, 'ocds-213czf-1'))  # AVOID
    
  • If you are writing a query template in which you want to substitute column names or table names, use the format method and the SQL and Identifier classes (documentation):

    from psycopg2.sql import SQL, Identifier
    
    cur.execute(SQL("SELECT * FROM {table}").format(table=Identifier('collection')))
    

    You can use this together with passing parameters:

    cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s").format(
        table=Identifier('collection')), {'id': 1})
    

    Remember to format the SQL() object. DO NOT format the string itself:

    cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s".format(
        table='collection'), {'id': 1})  # WRONG
    

    DO NOT use string interpolation (%):

    cur.execute("SELECT * FROM %s" % 'collection')  # WRONG
    

    DO NOT use string concatenation (+):

    cur.execute("SELECT * FROM " + 'collection')  # WRONG
    

    AVOID using anonymous placeholders:

    cur.execute(SQL("SELECT * FROM {}".format('collection'))  # AVOID
    

Paginate rows

Do not use LIMIT with OFFSET. OFFSET becomes more inefficient as its value increases. Instead, filter on the table’s primary key, which has near-constant performance. For example:

SELECT id, mycolumn
FROM mytable
WHERE
    id > %s
    AND myfilter = %s
ORDER BY id
LIMIT 1000

Format code

Format SQL files with SQLFluff.

pyproject.toml
[tool.sqlfluff.core]
dialect = "postgres"
exclude_rules = ["ST07"]

[tool.sqlfluff.rules.capitalisation.keywords]
capitalisation_policy = "upper"

[tool.sqlfluff.rules.capitalisation.literals]
capitalisation_policy = "upper"

[tool.sqlfluff.rules.capitalisation.functions]
extended_capitalisation_policy = "lower"

[tool.sqlfluff.rules.capitalisation.identifiers]
extended_capitalisation_policy = "lower"

[tool.sqlfluff.rules.capitalisation.types]
extended_capitalisation_policy = "lower"

[tool.sqlfluff.rules.convention.casting_style]
preferred_type_casting_style = "shorthand"

[tool.sqlfluff.rules.convention.not_equal]
preferred_not_equal_style = "ansi"

Generate entity relationship diagram

  1. Install SchemaSpy

  2. Download the PostgreSQL JDBC Driver

  3. Rename the JAR files to schemaspy.jar and postgresql.jar

  4. Move the JAR files to a preferred location

Run SchemaSpy, using appropriate values for the -db (database name), -s (schema, optional), -u (user) and -p (password, optional) arguments:

java -jar schemaspy.jar -t pgsql -dp postgresql.jar -o schemaspy -norows -host localhost -db DATABASE -s SCHEMA -u USER -p PASSWORD

Use either the schemaspy/diagrams/summary/relationships.real.compact.png or schemaspy/diagrams/summary/relationships.real.large.png file and check the schemaspy/diagrams/orphans/ directory.

Reference