PostgreSQL¶
Note
The Deploy documentation covers many topics relating to PostgreSQL, including: configuration, maintenance, and usage by clients. This page addresses topics relating to software development.
Connect to a database¶
Connect to the database using a connection string stored in the DATABASE_URL
environment variable.
In Python, connect to the database using dj-database-url if using Django, or psycopg2 otherwise.
To set the search path for a PostgreSQL connection, append to the connection string:
?options=-csearch_path%3Dmyschema,public
?currentSchema=myschema,public
Identify the client¶
Database administrators need to identify the sources of queries, in order to notify developers of inefficient queries or alert users whose queries will be interrupted by maintenance. For example:
Django applications set the application_name query string parameter in the PostgreSQL connection URI, or use a service-specific user
Kingfisher Summarize uses the psycopg2 package, and adds
/* kingfisher-summarize {identifier} */
as a comment to expensive queriesKingfisher Colab uses the ipython-sql package, and adds the Google Colaboratory notebook URL as a comment to all queries
Define tables¶
In PostgreSQL, use
TEXT
instead of other character types, as there is no performance difference.Use
NOT NULL
with character types, as recommended by Django.Use
NOT NULL
with JSON types, and set the default to an empty object, array or string.In Python, do not set default values to
{}
or[]
. In Django, usedefault=dict
anddefault=list
. In Pydantic (including SQLModel), usedefault_factory=dict
anddefault_factory=list
.JSON data often exceeds 2kb, and is therefore TOASTed. If the application needs to SELECT a value from the JSON data, it is faster to extract that value to a column: for example,
release_date
.
See also
Name conventions¶
Timestamp columns:
created_at
,updated_at
anddeleted_at
. (Some projects usecreated
andmodified
.)
Define indexes¶
Add an index for every foreign key with a corresponding
JOIN
query. If theJOIN
and/orWHERE
clauses use multiple columns of the same table, create a multi-column index, with the most used column as the index’s first column.Use EXPLAIN to figure out why a query is slow. It could be due to a missing index (sequential scan), an unused index, or a slower plan (for example, using index scan instead of bitmap index scan)
Note
When using a tool like Dalibo or pgMustard, follow these instructions to get the query plan. Otherwise, if you don’t know how slow the query is, omit the
ANALYZE
andBUFFERS
options, to only plan and not execute the query.
Load (or dump) data¶
Use the copy meta-command instead of the COPY command, so that file accessibility and privileges are those of the user, not the server – such that no SQL superuser privileges are required.
Construct SQL statements¶
Tip
Add SQLFluff to pre-commit if a project contains SQL files.
See also
Follow best practices to avoid accidental errors and SQL injection. The code samples below use the psycopg2 Python package.
Pass parameters to SQL queries, using the second argument to the
execute
method. This adapts the Python value’s type (likebool
,int
,str
) to the correct SQL representation:cur.execute("SELECT * FROM release WHERE release_date > %(date)s", {'date': '2020-01-01'})
DO NOT use string interpolation (
%
):cur.execute("SELECT * FROM release WHERE release_date > '%(date)s'" % {'date': '2020-01-01'}) # WRONG
DO NOT use string concatenation (
+
):cur.execute("SELECT * FROM release WHERE release_date > '" + '2020-01-01' + "'") # WRONG
AVOID using literal values:
cur.execute("SELECT * FROM release WHERE release_date > '2020-01-01'") # AVOID
For example, if you forget that dates are represented as strings in SQL, you might do the following, which evaluates
2020-12-31
to1977
, which will match everything in the database:cur.execute("SELECT * FROM release WHERE release_date > 2020-12-31") # BROKEN
Use named placeholders like
%(collection_id)s
. This allows you to use the same placeholder multiple times in the query, while only having to pass a single parameter, and to edit and re-order your query without re-ordering your parameters.cur.execute(""" SELECT * FROM release WHERE collection_id = %(collection_id)s UNION SELECT * FROM record WHERE collection_id = %(collection_id)s AND ocid = %(ocid)s """, {'collection_id': 1, 'ocid': 'ocds-213czf-1'})
AVOID use anonymous placeholders (
%s
):cur.execute(""" SELECT * FROM release WHERE collection_id = %s UNION SELECT * FROM record WHERE collection_id = %s AND ocid = %s """, (1, 1, 'ocds-213czf-1')) # AVOID
If you are writing a query template in which you want to substitute column names or table names, use the
format
method and theSQL
andIdentifier
classes (documentation):from psycopg2.sql import SQL, Identifier cur.execute(SQL("SELECT * FROM {table}").format(table=Identifier('collection')))
You can use this together with passing parameters:
cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s").format( table=Identifier('collection')), {'id': 1})
Remember to format the
SQL()
object. DO NOT format the string itself:cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s".format( table='collection'), {'id': 1}) # WRONG
DO NOT use string interpolation (
%
):cur.execute("SELECT * FROM %s" % 'collection') # WRONG
DO NOT use string concatenation (
+
):cur.execute("SELECT * FROM " + 'collection') # WRONG
AVOID using anonymous placeholders:
cur.execute(SQL("SELECT * FROM {}".format('collection')) # AVOID
Paginate rows¶
Do not use LIMIT
with OFFSET
. OFFSET
becomes more inefficient as its value increases. Instead, filter on the table’s primary key, which has near-constant performance. For example:
SELECT id, mycolumn
FROM mytable
WHERE
id > %s
AND myfilter = %s
ORDER BY id
LIMIT 1000
Format code¶
Format SQL files with SQLFluff.
[tool.sqlfluff.core]
dialect = "postgres"
exclude_rules = ["ST07"]
[tool.sqlfluff.rules.capitalisation.keywords]
capitalisation_policy = "upper"
[tool.sqlfluff.rules.capitalisation.literals]
capitalisation_policy = "upper"
[tool.sqlfluff.rules.capitalisation.functions]
extended_capitalisation_policy = "lower"
[tool.sqlfluff.rules.capitalisation.identifiers]
extended_capitalisation_policy = "lower"
[tool.sqlfluff.rules.capitalisation.types]
extended_capitalisation_policy = "lower"
[tool.sqlfluff.rules.convention.casting_style]
preferred_type_casting_style = "shorthand"
[tool.sqlfluff.rules.convention.not_equal]
preferred_not_equal_style = "ansi"
Generate entity relationship diagram¶
Install SchemaSpy
Download the PostgreSQL JDBC Driver
Rename the JAR files to
schemaspy.jar
andpostgresql.jar
Move the JAR files to a preferred location
Run SchemaSpy, using appropriate values for the -db
(database name), -s
(schema, optional), -u
(user) and -p
(password, optional) arguments:
java -jar schemaspy.jar -t pgsql -dp postgresql.jar -o schemaspy -norows -host localhost -db DATABASE -s SCHEMA -u USER -p PASSWORD
Use either the schemaspy/diagrams/summary/relationships.real.compact.png
or schemaspy/diagrams/summary/relationships.real.large.png
file and check the schemaspy/diagrams/orphans/
directory.