PostgreSQL
==========
.. note::
The `Deploy `__ documentation covers many topics relating to PostgreSQL, including: `configuration `__, `maintenance `__, and `usage by clients `__. This page addresses topics relating to software development.
Connect to a database
---------------------
Connect to the database using a connection string stored in the ``DATABASE_URL`` environment variable.
In Python, connect to the database using `dj-database-url `__ if using :doc:`Django<../python/django>`, or `psycopg2 `__ otherwise.
To set the search path for a PostgreSQL connection, append to the connection string:
.. code-block:: none
:caption: psycopg2
?options=-csearch_path%3Dmyschema,public
.. code-block:: none
:caption: dj-database-url
?currentSchema=myschema,public
Identify the client
-------------------
Database administrators need to identify the sources of queries, in order to notify developers of inefficient queries or alert users whose queries will be interrupted by maintenance. For example:
- Django applications set the `application_name `__ query string parameter in the PostgreSQL `connection URI `__, or use a service-specific user
- Kingfisher Summarize uses the `psycopg2 `__ package, and adds ``/* kingfisher-summarize {identifier} */`` as a comment to expensive queries
- Kingfisher Colab uses the `ipython-sql `__ package, and adds the Google Colaboratory notebook URL as a comment to all queries
Define tables
-------------
- In PostgreSQL, use ``TEXT`` instead of other character types, as there is `no performance difference `__.
- Use ``NOT NULL`` with character types, `as recommended by Django `__.
- Use ``NOT NULL`` with JSON types, and set the default to an empty object, array or string.
- In Python, do not set default values to ``{}`` or ``[]``. In Django, use ``default=dict`` and ``default=list``. In Pydantic (including SQLModel), use ``default_factory=dict`` and ``default_factory=list``.
- JSON data often exceeds 2kb, and is therefore `TOASTed `__. If the application needs to SELECT a value from the JSON data, it is faster to extract that value to a column: for example, ``release_date``.
.. seealso::
:ref:`Django models`
Name conventions
~~~~~~~~~~~~~~~~
- Timestamp columns: ``created_at``, ``updated_at`` and ``deleted_at``. (Some projects use ``created`` and ``modified``.)
Define indexes
--------------
- Add an index for every foreign key with a corresponding ``JOIN`` query. If the ``JOIN`` and/or ``WHERE`` clauses use multiple columns of the same table, create a multi-column index, with the most used column as the index's first column.
- Use `EXPLAIN `__ to figure out why a query is slow. It could be due to a missing index (sequential scan), an unused index, or a slower plan (for example, using index scan instead of bitmap index scan)
.. note::
When using a tool like `Dalibo `__ or `pgMustard `__, follow these `instructions `__ to get the query plan. Otherwise, if you don't know how slow the query is, omit the ``ANALYZE`` and ``BUFFERS`` options, to only plan and not execute the query.
Load (or dump) data
-------------------
Use the `\copy `__ meta-command instead of the `COPY `__ command, so that file accessibility and privileges are those of the user, not the server – such that no SQL superuser privileges are required.
.. _sql-statements:
Construct SQL statements
------------------------
.. tip::
Add `SQLFluff `__ to `pre-commit `__ if a project contains SQL files.
.. seealso::
`GitLab SQL Style Guide `__
Follow `best practices `__ to avoid accidental errors and `SQL injection `__. The code samples below use the psycopg2 Python package.
- `Pass parameters to SQL queries `__, using the second argument to the ``execute`` method. This adapts the Python value's type (like ``bool``, ``int``, ``str``) to the correct SQL representation:
.. code-block:: python
cur.execute("SELECT * FROM release WHERE release_date > %(date)s", {'date': '2020-01-01'})
**DO NOT** use string interpolation (``%``):
.. code-block:: python
cur.execute("SELECT * FROM release WHERE release_date > '%(date)s'" % {'date': '2020-01-01'}) # WRONG
**DO NOT** use string concatenation (``+``):
.. code-block:: python
cur.execute("SELECT * FROM release WHERE release_date > '" + '2020-01-01' + "'") # WRONG
**AVOID** using literal values:
.. code-block:: python
cur.execute("SELECT * FROM release WHERE release_date > '2020-01-01'") # AVOID
For example, if you forget that dates are represented as strings in SQL, you might do the following, which evaluates ``2020-12-31`` to ``1977``, which will match everything in the database:
.. code-block:: python
cur.execute("SELECT * FROM release WHERE release_date > 2020-12-31") # BROKEN
- Use named placeholders like ``%(collection_id)s``. This allows you to use the same placeholder multiple times in the query, while only having to pass a single parameter, and to edit and re-order your query without re-ordering your parameters.
.. code-block:: python
cur.execute("""
SELECT * FROM release WHERE collection_id = %(collection_id)s
UNION
SELECT * FROM record WHERE collection_id = %(collection_id)s AND ocid = %(ocid)s
""", {'collection_id': 1, 'ocid': 'ocds-213czf-1'})
**AVOID** use anonymous placeholders (``%s``):
.. code-block:: python
cur.execute("""
SELECT * FROM release WHERE collection_id = %s
UNION
SELECT * FROM record WHERE collection_id = %s AND ocid = %s
""", (1, 1, 'ocds-213czf-1')) # AVOID
- If you are writing a query template in which you want to substitute column names or table names, use the ``format`` method and the ``SQL`` and ``Identifier`` classes (`documentation `__):
.. code-block:: python
from psycopg2.sql import SQL, Identifier
cur.execute(SQL("SELECT * FROM {table}").format(table=Identifier('collection')))
You can use this together with passing parameters:
.. code-block:: python
cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s").format(
table=Identifier('collection')), {'id': 1})
Remember to format the ``SQL()`` object. **DO NOT** format the string itself:
.. code-block:: python
cur.execute(SQL("SELECT * FROM {table} WHERE id = %(id)s".format(
table='collection'), {'id': 1}) # WRONG
**DO NOT** use string interpolation (``%``):
.. code-block:: python
cur.execute("SELECT * FROM %s" % 'collection') # WRONG
**DO NOT** use string concatenation (``+``):
.. code-block:: python
cur.execute("SELECT * FROM " + 'collection') # WRONG
**AVOID** using anonymous placeholders:
.. code-block:: python
cur.execute(SQL("SELECT * FROM {}".format('collection')) # AVOID
Paginate rows
~~~~~~~~~~~~~
Do not use ``LIMIT`` with ``OFFSET``. ``OFFSET`` becomes more inefficient as its value increases. Instead, filter on the table's primary key, which has near-constant performance. For example:
.. code-block:: sql
SELECT id, mycolumn
FROM mytable
WHERE
id > %s
AND myfilter = %s
ORDER BY id
LIMIT 1000
Format code
-----------
Format SQL files with `SQLFluff `__.
.. code-block:: toml
:caption: pyproject.toml
[tool.sqlfluff.core]
dialect = "postgres"
exclude_rules = ["ST07"]
[tool.sqlfluff.rules.capitalisation.keywords]
capitalisation_policy = "upper"
[tool.sqlfluff.rules.capitalisation.literals]
capitalisation_policy = "upper"
[tool.sqlfluff.rules.capitalisation.functions]
extended_capitalisation_policy = "lower"
[tool.sqlfluff.rules.capitalisation.identifiers]
extended_capitalisation_policy = "lower"
[tool.sqlfluff.rules.capitalisation.types]
extended_capitalisation_policy = "lower"
[tool.sqlfluff.rules.convention.casting_style]
preferred_type_casting_style = "shorthand"
[tool.sqlfluff.rules.convention.not_equal]
preferred_not_equal_style = "ansi"
.. _postgresql-erd:
Generate entity relationship diagram
------------------------------------
#. Install `SchemaSpy `__
#. Download the `PostgreSQL JDBC Driver `__
#. Rename the JAR files to ``schemaspy.jar`` and ``postgresql.jar``
#. Move the JAR files to a preferred location
Run SchemaSpy, using appropriate values for the ``-db`` (database name), ``-s`` (schema, optional), ``-u`` (user) and ``-p`` (password, optional) arguments:
.. code-block:: bash
java -jar schemaspy.jar -t pgsql -dp postgresql.jar -o schemaspy -norows -host localhost -db DATABASE -s SCHEMA -u USER -p PASSWORD
Use either the ``schemaspy/diagrams/summary/relationships.real.compact.png`` or ``schemaspy/diagrams/summary/relationships.real.large.png`` file and check the ``schemaspy/diagrams/orphans/`` directory.
Reference
---------
- `Improve slow queries `__