Memory leaks are difficult to trace down. A nice description of finding one is "A server memory leak", a story about tracing a memory leak in a Django application.
I'm trying to work out why a server process is slowly using more and more memory. The server parses a stream of text, and shoves stuff into a database. It's written in Python, and about the only external code it uses is SQLAlchemy
I found a combination of Meliae and objgraph useful. Pympler also seems useful (the tracker class particularly). Dowser also looks useful, and might have made things a bit easier..
So, first we need to get whatever memory-analysing code into the server. You could use pyrasite, which injects code into an existing process using gdb (it even has helpers for Meliae)
However, it's usually easier and more reliable to modify the code a bit.
The simplest and tidiest way is to use a signal handler which starts pdb. To steal an idea from Bazaar, SIGQUIT is perfect for this..
In the the code's main()
function, add something like this:
def start_pdb(signal, trace): import pdb pdb.set_trace() import signal signal.signal(signal.SIGQUIT, start_pdb)
Then, when the code is running in a terminal, pressing ctrl+
will start pdb. You can then run any Python code you feel like. After, you can either continue the program running with "c" (or "continue"), or kill it with "q" ("quit" or "ctrl+d"):
$ python sigquit_test.py ^--Return-- > sigquit_test.py(10)start_pdb()->None -> pdb.set_trace() (Pdb) c ^--Return-- > sigquit_test.py(10)start_pdb()->None -> pdb.set_trace() (Pdb) q Traceback (most recent call last): [...] bdb.BdbQuit $
Now, we can run the server, and press ctrl+
at any time to start up a Python shell in that environment.
Firstly we can use objgraph to see what the most common type of object in memory is:
(Pdb) import objgraph (Pdb) objgraph.show_most_common_types() dict 150951 tuple 82757 InstanceState 65303 [...]
Perhaps unsurprisingly, there is a lot of dict and tuple objects. More interestingly, there is a lot of InstanceState objects.
Next, we can use Meliae to showing how much memory each data-type is using:
(Pdb) import meliae.scanner (Pdb) meliae.scanner.dump_all_objects("/tmp/memorydump.json") (Pdb)
Then, in another Python shell we can analyse this dump:
>>> import meliae.loader >>> md = meliae.loader.load("/tmp/memorydump.json") loaded line 775946, 775947 objs, 118.7 / 118.7 MiB read in 21.3s checked 775946 / 775947 collapsed 74972 set parents 700974 / 700975 collapsed in 11.1s >>> md.summarize() Total 700975 objects, 874 types, Total size = 276.9MiB (290327578 bytes) Index Count % Size % Cum Max Kind 0 183823 26 176767720 60 60 786712 dict 1 65303 9 72616936 25 85 1112 InstanceState 2 151981 21 10535744 3 89 520 tuple
So that InstanceState is definitely not helping the memory usage.. Does it relate to those dictionaries?
>>> md.get_all('InstanceState')[0].p [WeakInstanceDict(4333095200 6291752B 3856refs 1par), dict(4760925744 3352B 2refs 1par), instancemethod(4763880816 80B 3refs 1par), dict(4759467200 3352B 90refs 1par)]
The get_all
method returns a list of all the InstanceState objects, ordered by size - so the first index is the largest one. It contains a reference to two dicts, a weak-ref and instance method..
For a better explaination, I recommend reading "Memory debugging with Meliae" and "Step-by-step Meliae", written by the author of Meliae.
This info is useful, but maybe a little unintutive. Luckily, objgraph exists.. So, back to the pdb prompt in our "live" server:
(Pdb) import objgraph (Pdb) objgraph.count("InstanceState") 65303 (Pdb) objgraph.show_backrefs(objgraph.by_type("InstanceState")[0:50]) Graph written to objects.dot (316 nodes) Image generated as objects.png
objgraph is nice in that is pure Python, and produes pretty reference graphs (by writing .dot
files, which are rendered using graphviz):
I'd recommend reading "Hunting memory leaks in Python" and "Python object graphs", written by the author of objgraph
Now, this description till now makes the problem seem a bit.. easy and linear. Instrument up the code, look at what is using the most memory, and... there's the problem! Not quite..
After a bit of prodding around, the InstanceStates seemed to be mostly referenced by a Least Recently Used (LRU) cache in the code. I wasn't sure if this memory usage was legitimate cacing, or if the above tree pictured above was causing leaky reference cycles..
The code commits to the database approximately every 30 seconds, so I added some memory usage logging code into this.
Firstly, I created a function based of objgraph's show_most_common_types
, which to writes the most used objects to a log file:
def debug_log_common_objects(): print "Logging common objects" import gc print "garbage collector got", gc.collect() import objgraph # Reimplement show_most_common_types, but write to file import operator stats = sorted(objgraph.typestats().items(), key=operator.itemgetter(1), reverse=True) limit = 30 if limit: stats = stats[:limit] width = max(len(name) for name, count in stats) out = ["#START %s" % time.time()] + ["%s %s" % (name.ljust(width), count) for name, count in stats[:limit] f = open("most_common_object.txt", "a") f.write("n".join(out) + "n") f.close() print "Logging common objects - done"
Then I left the process running for a while. This function is doing exactly what we first done with the pdb prompt (calling "objgraph.show_most_common_types()"), but with a timestamp:
#START 1320397336.95 tuple 30340 dict 16511 function 7662 list 4792 set 4174 [...] #START 1320397377.14 tuple 32176 dict 20643 function 7662 list 4901 set 4185 [...]
Then, I wrote a simple, horribly crappy, script to parse and graph these, so I could see which objects were increasing over time
#!/usr/bin/env python f = open("most_common_object.txt") alltimings = {} things = {} start = None for line in f.readlines(): if line.startswith("#START"): alltimings[start] = things _, _, thetime = line.strip().partition(" ") start = float(thetime) things = {} else: key, _, value = line.partition(" ") things[key] = int(value.strip()) times = {} values = {} import datetime for time, things in sorted(alltimings.items()): for t, howmany in things.items(): times.setdefault(t, []).append(datetime.datetime.fromtimestamp(time)) values.setdefault(t, []).append(howmany) import matplotlib.pyplot as plt thing = plt.figure(figsize = (15, 200)) thing.subplots_adjust(hspace = .2) for i, key in enumerate(times): ct = times[key] cv = values[key] ax = thing.add_subplot(len(times.keys()), 1, i+1) ax.set_title("%s" % key) ax.set_ylabel("Number of objects") ax.plot(ct, cv) plt.savefig("mem.pdf", figsize = (200, 200), ymargin = 100)
This produces a PDF full of graphs, some of which look like these:
So while there are a lot of dict objects, the number is fluctuating reasonably. Same with InstanceState
, so that wasn't the leak (if there even is one)
I later confirmed this suspicion on StackOverflow - the graph presented above will not leak - as of Python 2.2, the garbage collector gained cycle detection. There is one important caveat, which is if any object in the cycle contains a __del__
method, it will NOT be garbage collected.
However, the number of CacheEntities is growing suspiciously, so were a few other types:
..so I think that's what I'll investigate next
As I said, memory leaks are difficult to track down - I still haven't got to the bottom of the problem yet, but hopefully these techniques might help others