I recently had the need to write an indexer that processes a lot of text files and creates a huge number of database rows (19 million+). This is a background script for an internal django website so I ended up using the standard django psycopg2 library, but tried to minimize the overhead by executing queries directly using the db cursor. Even after this consideration and special ‘del’ statements to clean things out after regular intervals, to my surprise, the process was still taking up a lot of memory. So, I tweeted about this experience. Daniele Varrazzo, one of the authors of psycopg2, tweeted back with more info.

Daniele’s message basically pointed me to this script that he had written to monitor memory usage. After reading through it and learning about the ‘gc’ python module I incorporated a few of the pieces in my code and noticed that in fact the object count, specifically the ‘dict’ object count, kept increasing as the program executed. This meant that the problem was not in psycopg2, but at a higher layer.

A quick look at one of these dicts and I saw that some code was keeping track of the SQL query execution time and the actual query for each query that was executed. Another run and I noticed this list of dictionaries continued to grow. A quick search through the 1.3 django source revealed exactly where this was happening. It was the django debug cursor wrapper which was keeping the list of SQL queries and the accompanying execution time for monitoring.

Since this was a debug wrapper, the first thing that came to mind was to change settings.DEBUG = False and rerun. Guess what? The dict object count stabilized and the process is now sticking to an almost constant ~210MB (might still seem high, but that is expected because of the read buffer that I am specifying).

Activity monitor showing the constant ~210MB memory that the process now takes.

I did not want to set settings.DEBUG = False because it affects the rest of the site and I didn’t care about the debug wrapper in this specific case. So, I spent a few more minutes looking through code and noticed the use_debug_cursor variable on the db object. That is exactly what I needed. After setting that to False in my code I no longer have to worry about messing with the rest fo the site.

1
2
from django.db import connection
connection.use_debug_cursor = False

So, if you are executing a lot of queries while settings.DEBUG = True then make sure that you are aware of this additional overhead.

Back to blog...