Don't let dicts spoil your code

Balance Photo by Martin Sanchez

How often your simple prototypes or ad-hoc scripts turn into fully-fledged applications? With me, it happens all the time.

Growing the code organically in Python is simple, but this simplicity has a flip side: if you are not careful enough, the code becomes too hard to maintain down the road. One of the most apparent signs of the problem is the proliferation of dicts as primary data structures. As for me, I admit that I used dicts instead of proper data structures way too often.

What’s wrong with dicts?

Dicts are opaque
Functions that accept dicts are a nightmare to extend and modify. Usually, to change the function that takes a dictionary, you need to manually trace the calls back to the roots, where this dict was created. Consider yourself lucky if you have only one call path. Quite often, there are more, and if a program grows without a plan, you’ll likely have discrepancies in the dict structures.

Dicts are mutable
It is tempting to change dict values to fit a specific workflow, and programmers often abuse this functionality. In-place mutations may have different names: pre-processing, populating, enriching, data massage, etc. The result is the same. This manipulation hinders the structure of your data and makes it dependent on the workflow of your application.

Not only do dicts allow you to change their data, but they also allow you to change the very structure of objects. You can add or delete fields or change their types at will. Resorting to this is the worst felony you can commit to your data.

For data structures inside the application, use Data Classes. Treat dicts as the wire format

A common source of dicts in the code is deserializing from JSON, for example, from a third-party API response.

1
2
3
4
5
6
7
8
>>> requests.get("https://api.github.com/repos/imankulov/empty").json()
{'id': 297081773,
 'node_id': 'MDEwOlJlcG9zaXRvcnkyOTcwODE3NzM=',
 'name': 'empty',
 'full_name': 'imankulov/empty',
 'private': False,
...
}

A dict, returned from the API.

Make a habit of treating dicts as a “wire format” and convert them immediately to data structures providing semantics. Starting from version 3.7, Python provides Data Classes out of the box.

On top of semantic clarity, Data Classes provides a natural layer that decouples the exterior architecture from your application’s business logic. In Domain-Driven Design (DDD), this pattern is known as the anti-corruption layer.

Example

Two implementations of a function retrieving repository info from GitHub:

🚫 Returning a dict

1
2
3
4
5
import requests

def get_repo(repo_name: str):
    """Return repository info by its name."""
    return requests.get(f"https://api.github.com/repos/{repo_name}").json()

The output of the function is opaque and needlessly verbose. The format is defined outside of your code.

1
2
3
4
5
6
7
8
9
>>> get_repo("imankulov/empty")
{'id': 297081773,
 'node_id': 'MDEwOlJlcG9zaXRvcnkyOTcwODE3NzM=',
 'name': 'empty',
 'full_name': 'imankulov/empty',
 'private': False,
 # Dozens of lines with unnecessary attributes, URLs, etc.
 # ...
}

✅ Returning a Data Class

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from dataclasses import dataclass

@dataclass(frozen=True)
class GitHubRepo:
    """GitHub repository."""
    owner: str
    name: str
    description: str

    def full_name(self) -> str:
        """Get the repository"""

def get_repo(repo_name: str) -> GitHubRepo:
    """Return repository info by its name."""
    data = requests.get(f"https://api.github.com/repos/{repo_name}").json()
    return GitHubRepo(data["owner"]["login"], data["name"], data["description"])
1
2
>>> get_repo("imankulov/empty")
GitHubRepo(owner='imankulov', name='empty', description='An empty repository')

While this example has more code, I argue that this solution is better than the previous one if we plan to keep maintaining and extending the codebase.

Let’s see what the differences are.

  • The data structure is clearly defined, and we can document it with as many details as necessary.
  • The class also has a method full_name() implementing some class-specific business logic. Unlike dicts, Data Classes acts as natural wrappers, allowing you to co-locate the code and data.
  • All attributes are read-only, brings peace of mind to a developer, reading maintaining your code.
  • The mode has an automatic readable and compact __repr__().

More importantly, the GitHub API’s dependency is isolated in the function get_repo(), providing an anti-corruption layer and separation of concerns. The GitHubRepo object doesn’t need to know anything about the external API and how objects are created. This way, you can modify the deserializer independently from the model or add new ways of creating objects: from pytest fixtures, the GraphQL API, the local cache, etc.

In many cases, you can and should ignore most of the fields coming from the API, adding only the fields that the application uses. Not only duplicating the fields is a waste of time, but it also makes the class structure rigid, making it hard to adopt changes in the business logic or add support to the new version of the API. From the point of view of testing, fewer fields mean fewer headaches in instantiating the objects.

When I create Data Classes, almost always, my Data Classes are defined as frozen. Instead of modifying an object, I create a new instance with dataclasses.replace().

For key-value stores, annotate dicts as mappings

A legitimate use-case of dict is a key-value store where all the values have the same type, and keys are used to look up the value by key.

1
2
3
4
5
colors = {
    "red": "#FF0000",
    "pink": "#FFC0CB",
    "purple": "#800080",
}

A dict, used as a mapping.

When instantiating or passing such dict to a function, consider hiding implementation details by annotating the variable type as Mapping or MutableMapping. On the one hand, it may sound like overkill. Dict is default and by far the most common implementation of a MutableMapping. On the other hand, by annotating a variable with mapping, you can specify the types for keys and values. Besides, in the case of a Mapping type, you send a clear message that an object is supposed to be immutable.

Example

I defined a color mapping. Note that I imported Mapping from typing. If you get an error “TypeError: ‘ABCMeta’ object is not subscriptable”, probably, you accidentally imported it from collections.abc.

Then I annotated a function. Notice how the function uses the operation that is allowed for dicts but disallowed for Mapping instances.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# file: colors.py
from typing import Mapping

colors: Mapping[str, str] = {
    "red": "#FF0000",
    "pink": "#FFC0CB",
    "purple": "#800080",
}

def add_yellow(colors: Mapping[str, str]):
    colors["yellow"] = "#FFFF00"

if __name__ == "__main__":
    add_yellow(colors)
    print(colors)

Despite wrong types, no issues in runtime.

1
2
$ python colors.py
{'red': '#FF0000', 'pink': '#FFC0CB', 'purple': '#800080', 'yellow': '#FFFF00'}

To check the validity, I can use mypy, which raises an error.

1
2
3
$ mypy colors.py
colors.py:11: error: Unsupported target for indexed assignment ("Mapping[str, str]")
Found 1 error in 1 file (checked 1 source file)

Automate serialization and validation with helper libraries

When working on serialization, you can save quite a bit of headache if you adopt one of the helpers. At the moment, the two most prominent helping libraries are marshmallow and pydantic. Choosing one or another is a matter of taste.

You can also consider adopting attrs, which is like Data Classes, but more powerful. Notably, they include converters and validators that can make your deserialization process more comfortable and more bullet-proof.

Keep an eye on your dicts. Don’t let them take control over your application. As with every piece of technical debt, the further you postpone introducing proper data structures, the harder the transition becomes.