Django API upgrade post mortem

6 min readJan 17, 2021

Our Django 2 API was holding strong but it was aging and some new features introduced in both Django 3 and the recent Django Rest Framework releases looked very interesting.
We also wanted to be more up-to-date for bug and security patches.

So yeaaah 😨 … Upgrade Time!

TL;DR

— 📜 Doing lots of planning and having tons of tests helps.
— 🔌 Having a proactive devops team is your safety net.
— 💣 Failing quickly is better than failing in production.

Now, let’s get started!

Versions

- Django 2.0 -> 3.0
- DRF 3.7 -> 3.12
- Python 3.6 -> 3.8

Note : Django 3.1 was incompatible with pandas along with some other dependencies at the time of the upgrade so we decided to only upgrade to 3.0.

Other upgrades

To keep all our third party dependencies happy we had to upgrade many other packages with some trial and errors.

Some time before starting the upgrade process pip 20.3 was out with the new dependency resolver and it made the process easier.

The most important ones were :

django-cacheops
django-redis
django-storages

Upgrade Plan

Get a rough idea of the work scope

Read documentations (and source code)
Take notes of required patches
Test some features / breaking changes locally (django shell, interactive python shell)
Share with other developers so they can add their feedbacks
Define a schedule

Do the actual work

Apply patches previously saved in the research phase
Fix incoming errors in test suite
Fix dependencies
Fix unexpected behavior
Repeat

Test on staging servers

Deploy on a specific staging server to not interfere with others devs
QA run + monitoring + DBA monitoring
Fix errors that arise
Repeat

Production deployment

Put a Django 3 instance in parallel on all regions
Monitor
Put all Django 3 instances in parallel
Shut down Django 2 instances

Side notes on testing

Testing was undoubtedly the most important part of this project.

Thankfully our API is covered by tests for all the features and utilities.
Our coverage is far from 100% but key parts are covered.

I documented each step of the refactoring followed by the test results so that other devs could also see what was happening almost in real time.
I also wanted the process to be easily repeatable if something went wrong in any of the next steps or even in the case this project was halted.
Relying on my own memory for such matter is tiring and proven not being that reliable 😅.

As the upgrade was planned to not interfere with the regular releases I rebased the branch on a daily basis. It was a bit repetitive but it also helped to really remember every bit and to better document it.
Thanks to vim I just kept the refactoring bits in the registers so it was ready to use anytime without even risking a typo.

Between major rounds of refactoring / testing, I wiped the packages and reinstalled them to be sure that I did not introduce any incompatibility and all packages were built / installed without errors.

It also meant that I was joggling between branches too much for my poor muscles so I made aliases like :

git checkout upgrade/django3 && 
git pull origin upgrade/django3 && 
python3.8 -m pip freeze | xargs python3.8 -m pip uninstall -y &&
python3.8 -m pip install requirements.txt &&
python3.8 -m pytest > /tmp/pytest_django3

and

git checkout develop && 
git pull origin develop && 
python3.6 -m pip freeze | xargs python3.6 -m pip uninstall -y &&
python3.6 -m pip install requirements.txt &&
python3.6 -m pytest > /tmp/pytest_develop

(line breaks added for readability)

Did you ever wondered what’s the difference between python -m pytest instead of pytest? I guess in this case you understand the difference… invoking pytest directly means either the 3.6 or 3.8 version would be used which could cause some headaches. In my case the virtualenv was setup with 3.6 at the beginning of the project and 3.8 at the end.

About aliases : It may look innocent, but using aliases is something that saves so much time when you do the math.
If you don’t already do it I would recommend you to start right now.

😎 What had been anticipated

There were not so many things to change from our current to the target version based on my searches, and mainly on the DRF side :

base_name in the router has been replaced by basename (DRF)
@list_route & @detail_route decorator replaced by @action (DRF)
Change some reverse() by reverse_action() (for complex routes) (DRF)
@action decorator call must include url_name in some case for reverse_action (used mostly in tests) to work properly (DRF)
Python 3.7 changed may things around the regular expressions.
For example the way it escapes strings was modified so manual escaping is now required.

And we upgraded our redis client which added the following changes :

redis-py is not casting None to string anymore so those case need to be covered.
redis-py changed zadd function signature from key, value to a dict like {‘key’:’value’}

🤯 What went wrong

Redis server / client compatibility

Upgrading packages involved breaking changes and versions conflicts.
Developing locally using Docker with no problem I naively deployed to our staging servers hoping for everything to work first time…
The first installation was fine but the first start failed : the Redis server version was not supported by the new django-redis package.

Production <> Pre-production side effect

When everything was ready to deploy in production we used our usual pipeline and started with the pre-production which usually allows us to make a last check before the big jump.

But as much as I wanted to make a perfect 0 error deployment, I missed one little thing in the refactoring. That thing is pickle.
For those who don’t know, when serializing and de-serializing objects in python, the protocol used is named pickle and the process is calling “pickling”.
By default in django-cache-redis the caches are using the latest version.

And…

Protocol version 5 was added in Python 3.8.

In python 3.6 it means it will use pickle protocol version 4.
Our pre-production environment was in 3.8, our production one was in 3.6.
So our production environment started throwing errors about pickle protocol 5 being unsupported. Oops!

We use pickle mostly for caching via two libraries :

The first is django-redis-cache and was fixed by adding PICKLE_VERSION=4 to the settings as explained here : https://django-redis-cache.readthedocs.io/en/latest/advanced_configuration.html#pickle-version
The second is django-cacheops but was a no-fix as the pickle version is hardcoded and the maintainer seems to have no interest in adding a parameter for it : https://github.com/Suor/django-cacheops/issues/362

Fortunately, the only side effect for us was that our customer service could not get real time slack notifications on user’s reports on the platform.
A bit annoying but nothing critical as the reports were available as usual anyway.

After monitoring for a fair bit we then deployed the new version in a 50/50 fashion then shut down the old Django2 / python3.6 version.

Side notes about monitoring

While pushing any refactoring to production you have to closely follow up the logs and other events to find any strange behavior, especially with asynchronous actions and third party dependencies.

At Spoon we use many different tools but I was relying a lot on Sentry and Newrelic which are both good at focusing on release specific errors.

📌 Conclusion

In retrospective the update has been deployed relatively painlessly thanks to a good coordination and communication between devs, devops and QA.

The day after global deployment the server release cycle resumed and the only noticeable change during the deployment was the use of an alternative git flow to not interfere in case something goes wrong with the upgrade.

We are usually fearful of upgrades because things can break but planning things ahead and having a team and infrastructure that can provide a nice environment for testing and deployment is the key.