05 May 2024

Temporary vs Permanent Errors

When working on external integrations, we often implement basic error handling. Most of the time, we just use resp.raise_for_status() and leave it for your future self to handle.

Quite often, we don’t handle errors because we genuinely don’t know how the external system will behave and what types of errors to expect from it. Indeed, it can be overwhelming to consider all the possible corner cases and provide appropriate reactions to them. What should I do if the server returns a 503 error? What if I am rate-limited? What if there’s a connection timeout, and so on? It involves a long list of exceptional cases and handlers that need to be implemented, documented, and tested.

However, we can do better than raising the same error regardless of circumstances. I found that it’s natural to introduce two broad types of exceptions that might occur. First, there are temporary errors - errors that will eventually disappear if you repeat the operation enough times. Second, there are permanent errors when there’s no way to convince the third-party system to execute your request, no matter how many times you ask. Categorizing all errors into these two groups makes it much easier to handle them.

Reaction to permanent errors is to abort the operation immediately. You can either show the error to the user (if the user input is invalid) or record it internally and show the user a generic message if it’s an unexpected error that should never occur. The decision on what to do lies with the caller. Aborting and displaying an error message is the only thing we can do in this situation.

A temporary error means you should try again later. If you’re using a job execution system like Celery, you’d typically schedule the job to retry the task later using exponential backoff and a limit on attempts. For example, you might set it to run the task 5 more times, and on the sixth attempt, you’d consider the temporary error as a permanent one. In that case, you’d probably notify the developer.

Usually, in your integration, you only need to identify the broad type of error (permanent or temporary) and then let the caller decide what to do with the result. Sometimes, for the caller’s convenience, you may choose to never raise temporary errors and instead repeat the requests to the third-party systems on your own.

This is more or less how most network systems work. Network protocols have distinct codes for permanent and temporary errors, and network clients implement the appropriate logic.

For instance, when you send an email, the receiving SMTP server can return an error from the 4xx range (temporary error) or the 5xx range (permanent error). In HTTP, the codes operate differently: 4xx codes are typically reserved for client errors (permanent errors), while 5xx codes are for server errors (temporary errors).

Integration Example

This example is based on a real scenario of a system sending emails. It’s similar to the one I previously explored in the Interface-mock-live pattern, but this time I will be focusing on a different pattern within the same system.

First, let’s define three types of exceptions:

class AppError(Exception):
    pass

class TemporaryAppError(AppError):
    pass

class PermanentAppError(AppError):
    pass

Then, you carefully read the API spec to understand the possible errors it might generate and how to respond to them. Connection errors are typically considered temporary, as well as 429 Too Many Requests. As for the rest, you should handle 5xx errors as temporary (indicating an issue on the remote side) and 4xx errors as permanent (highlighting an issue with your request).

The result may look like this.

def send_email(from_: str, to: str, subject: str, text: str) -> None:
    try:
        response = requests.post(
            f"https://api.mailgun.net/v3/{MAILGUN_DOMAIN_NAME}/messages",
            auth=("api", MAILGUN_API_KEY),
            data={
                "from": from_,
                "to": to,
                "subject": subject,
                "text": text,
            },
        )
    except requests.RequestException as error:
        raise TemporaryAppError(str(error)) from error
    if resp.status_code == 429 or resp.status_code >= 500:
        raise TemporaryAppError(resp.content)
    if resp.status_code >= 400:
        raise PermanentAppError(resp.content)

As always, abstracting errors is a good practice. The function contract specifies that we should only raise PermanentAppError() or TemporaryAppError(), and the caller doesn’t need to worry about implementation details or other exceptions.

You could go a step further and offer the caller a task that automatically reschedules itself upon encountering a temporary error. Here’s an example using Celery.

@app.task(autoretry_for=(TemporaryAppError,), retry_backoff=True)
def send_email(from_: str, to: str, subject: str, text: str) -> None:
	# The function code is the same as above
	...

Finally, the code will likely evolve over time as you discover more cases that are specific to your application or the third-party system you’re integrating with. For example, you may realize that you need a specific error type if you run out of your monthly quota. In addition to the regular actions for temporary errors, you may want to alert administrators as soon as possible.

Feel free to add exception types to handle these cases if necessary, but only after it becomes clear that your case truly requires a specific exception.

Roman Imankulov

Temporary vs Permanent Errors

Integration Example