File formats#

We read and write a lot of CSV and JSON files. Their format should be consistent.

JSON#

Input#

In most cases, simply use the standard library.

with open(path) as f:
    data = json.load(f)

If (and only if) the code must support Python 3.5 or earlier, use:

from collections import OrderedDict

with open(path) as f:
    data = json.load(f, object_pairs_hook=OrderedDict)

For critical paths involving small files, use orjson.

Note

We can switch to the Python bindings for simdjson: either pysimdjson or libpy_simdjson. For JSON documents with known structures, JSON Link is fastest, but the files relevant to us have unknown structures.

For large files, use the same techniques as OCDS Kit to stream input using ijson, stream output using iterencode, and postpone evaluation using iterators. See its brief tutorial on streaming and re-use its default method.

Note

ijson uses Yajl. simdjson is limited to files smaller than 4 GB and has no streaming API.

Output#

Indent with 2 spaces, use UTF-8 characters, and preserve order of object pairs. Example:

with open(path, 'w') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)
    f.write('\n')

CSV#

Input#

with open(path) as f:
    reader = csv.DictReader(f)
    fieldnames = reader.fieldnames
    rows = [row for row in reader]

Output#

Use LF (\n) as the line terminator. Example:

with open(path, 'w') as f:
    writer = csv.DictWriter(f, fieldnames, lineterminator='\n')
    writer.writeheader()
    writer.writerows(rows)