PostgreSQL¶

Note

The Deploy documentation covers many topics relating to PostgreSQL, including: configuration, maintenance, and usage by clients. This page addresses topics relating to software development.

Connect to a database¶

Connect to the database using a connection string stored in the DATABASE_URL environment variable.

In Python, connect to the database using dj-database-url if using Django, or psycopg otherwise.

To set the search path for a PostgreSQL connection, append to the connection string:

psycopg¶

?options=-csearch_path%3Dmyschema,public

dj-database-url¶

?currentSchema=myschema,public

Identify the client¶

Database administrators need to identify the sources of queries, in order to notify developers of inefficient queries or alert users whose queries will be interrupted by maintenance. For example:

Django applications set the application_name query string parameter in the PostgreSQL connection URI, or use a service-specific user
Kingfisher Summarize uses the psycopg package, and adds /* kingfisher-summarize {identifier} */ as a comment to expensive queries
Kingfisher Colab uses the ipython-sql package, and adds the Google Colaboratory notebook URL as a comment to all queries

Define tables¶

In PostgreSQL, use TEXT instead of other character types, as there is no performance difference.
Use NOT NULL with character types, as recommended by Django.
Use NOT NULL with JSON types, and set the default to an empty object, array or string.
In Python, do not set default values to {} or []. In Django, use default=dict and default=list. In Pydantic (including SQLModel), use default_factory=dict and default_factory=list.
JSON data often exceeds 2kb, and is therefore TOASTed. If the application needs to SELECT a value from the JSON data, it is faster to extract that value to a column: for example, release_date.

Name conventions¶

Timestamp columns: created_at, updated_at and deleted_at. (Some projects use created and modified.)

Define indexes¶

Add an index for every foreign key with a corresponding JOIN query. If the JOIN and/or WHERE clauses use multiple columns of the same table, create a multi-column index, with the most used column as the index’s first column.
Use EXPLAIN to figure out why a query is slow. It could be due to a missing index (sequential scan), an unused index, or a slower plan (for example, using index scan instead of bitmap index scan)

Note

When using a tool like Dalibo or pgMustard, follow these instructions to get the query plan. Otherwise, if you don’t know how slow the query is, omit the ANALYZE and BUFFERS options, to only plan and not execute the query.

Load (or dump) data¶

Use the copy meta-command instead of the COPY command, so that file accessibility and privileges are those of the user, not the server – such that no SQL superuser privileges are required.

Construct SQL statements¶

Tip

Add SQLFluff to pre-commit if a project contains SQL files.

See also

GitLab SQL Style Guide

Follow best practices to avoid accidental errors and SQL injection. The code samples below use the psycopg Python package.

Pass parameters to SQL queries, using the second argument to the execute method. This adapts the Python value’s type (like bool, int, str) to the correct SQL representation:

cur.execute("SELECT * FROM release WHERE release_date > %(date)s", {'date': '2020-01-01'})

DO NOT use string interpolation (%):

cur.execute("SELECT * FROM release WHERE release_date > '%(date)s'" % {'date': '2020-01-01'})  # WRONG

DO NOT use string concatenation (+):

cur.execute("SELECT * FROM release WHERE release_date > '" + '2020-01-01' + "'")  # WRONG

AVOID using literal values:

cur.execute("SELECT * FROM release WHERE release_date > '2020-01-01'")  # AVOID

For example, if you forget that dates are represented as strings in SQL, you might do the following, which evaluates 2020-12-31 to 1977, which will match everything in the database:

cur.execute("SELECT * FROM release WHERE release_date > 2020-12-31")  # BROKEN

Use named placeholders like %(collection_id)s. This allows you to use the same placeholder multiple times in the query, while only having to pass a single parameter, and to edit and re-order your query without re-ordering your parameters.

cur.execute("""
    SELECT * FROM release WHERE collection_id = %(collection_id)s
    UNION
    SELECT * FROM record WHERE collection_id = %(collection_id)s AND ocid = %(ocid)s
""", {'collection_id': 1, 'ocid': 'ocds-213czf-1'})

AVOID use anonymous placeholders (%s):

cur.execute("""
    SELECT * FROM release WHERE collection_id = %s
    UNION
    SELECT * FROM record WHERE collection_id = %s AND ocid = %s
""", (1, 1, 'ocds-213czf-1'))  # AVOID

If you are writing a query template in which you want to substitute column names or table names, use the format method and the SQL and Identifier classes (documentation):

from psycopg.sql import SQL, Identifier

cur.execute(SQL("SELECT * FROM {table}").format(table=Identifier('collection')))

You can use this together with passing parameters:

cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s").format(
    table=Identifier('collection')), {'id': 1})

Remember to format the SQL() object. DO NOT format the string itself:

cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s".format(
    table='collection'), {'id': 1})  # WRONG

DO NOT use string interpolation (%):

cur.execute("SELECT * FROM %s" % 'collection')  # WRONG

DO NOT use string concatenation (+):

cur.execute("SELECT * FROM " + 'collection')  # WRONG

AVOID using anonymous placeholders:

cur.execute(SQL("SELECT * FROM {}".format('collection'))  # AVOID

Paginate rows¶

Do not use LIMIT with OFFSET. OFFSET becomes more inefficient as its value increases. Instead, filter on the table’s primary key, which has near-constant performance. For example:

SELECT id, mycolumn
FROM mytable
WHERE
    id > %s
    AND myfilter = %s
ORDER BY id
LIMIT 1000

Format code¶

Format SQL files with SQLFluff.

pyproject.toml¶

[tool.sqlfluff.core]
dialect = "postgres"
exclude_rules = ["ST07"]

[tool.sqlfluff.rules.capitalisation.keywords]
capitalisation_policy = "upper"

[tool.sqlfluff.rules.capitalisation.literals]
capitalisation_policy = "upper"

[tool.sqlfluff.rules.capitalisation.functions]
extended_capitalisation_policy = "lower"

[tool.sqlfluff.rules.capitalisation.identifiers]
extended_capitalisation_policy = "lower"

[tool.sqlfluff.rules.capitalisation.types]
extended_capitalisation_policy = "lower"

[tool.sqlfluff.rules.convention.casting_style]
preferred_type_casting_style = "shorthand"

[tool.sqlfluff.rules.convention.not_equal]
preferred_not_equal_style = "ansi"

Generate entity relationship diagram¶

Install SchemaSpy
Download the PostgreSQL JDBC Driver
Rename the JAR files to schemaspy.jar and postgresql.jar
Move the JAR files to a preferred location

Run SchemaSpy, using appropriate values for the -db (database name), -s (schema, optional), -u (user) and -p (password, optional) arguments:

java -jar schemaspy.jar -t pgsql -dp postgresql.jar -o schemaspy -norows -host localhost -db DATABASE -s SCHEMA -u USER -p PASSWORD

Use either the schemaspy/diagrams/summary/relationships.real.compact.png or schemaspy/diagrams/summary/relationships.real.large.png file and check the schemaspy/diagrams/orphans/ directory.

Reference¶

Improve slow queries