Pandas: Cyclic GC issues

Created on 8 Jan 2013  ·  14Comments  ·  Source: pandas-dev/pandas

A mystery to be debugged soon:

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

def leak():
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()
Bug

Most helpful comment

For the record, we ([email protected]) are using this in preduction for a bit of time, and is doing very good:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

All 14 comments

Ok, this is, in a word, f*cked up. If I add gc.collect to that for loop it stops leaking memory:

import pandas as pd
import numpy as np
import gc

arr = np.random.randn(100000, 5)

def leak():
    pd.util.testing.set_trace()
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        gc.collect()
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

There are objects here that only get garbage collected when the cyclic GC runs. What's the solution here, break cycle explicitly in __del__ so the Python memory allocator stops screwing us?

Can you try this:

from ctypes import cdll, CDLL

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")

def leak():
    for i in xrange(10000):
        libc.malloc_trim(0)
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

I suspect this has nothing to do with python, but that would confirm it.

Yeah, that seemed to do the trick. Memory usage 450MB after running that in IPython, then malloc_trim freed 400MB. Very pernicious

Following the malloc_trim lead upstream, this looks like a glibc optimization gone awry.
xref:
http://sourceware.org/bugzilla/show_bug.cgi?id=14827

see "fastbins" comment.

In [1]: from ctypes import Structure,c_int,cdll,CDLL
   ...: class MallInfo(Structure):   
   ...:     _fields_ =[
   ...:               ( 'arena',c_int ),  #  /* Non-mmapped space allocated (bytes) */
   ...:            ('ordblks',c_int  ),# /* Number of free chunks */
   ...:            (    'smblks',c_int ),  # /* Number of free fastbin blocks */
   ...:            (    'hblks',c_int  ),  #/* Number of mmapped regions */
   ...:            (    'hblkhd' ,c_int ), #/* Space allocated in mmapped regions (bytes) */
   ...:            (    'usmblks' ,c_int), # /* Maximum total allocated space (bytes) */
   ...:            (    'fsmblks' ,c_int) ,#/* Space in freed fastbin blocks (bytes) */
   ...:            (    'uordblks' ,c_int),# /* Total allocated space (bytes) */
   ...:            (    'fordblks',c_int ),# /* Total free space (bytes) */
   ...:            (    'keepcost',c_int )# /* Top-most, releasable space (bytes) */
   ...:          ]
   ...:     def __repr__(self):
   ...:         return "\n".join(["%s:%d" % (k,getattr(self,k)) for k,v in self._fields_])
   ...: 
   ...: cdll.LoadLibrary("libc.so.6")
   ...: libc = CDLL("libc.so.6")
   ...: mallinfo=libc.mallinfo
   ...: mallinfo.restype=MallInfo
   ...: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[1]: 0

In [2]: import numpy as np
   ...: import pandas as pd
   ...: arr = np.random.randn(100000, 5)
   ...: def leak():
   ...:     for i in xrange(10000):
   ...:         df = pd.DataFrame(arr.copy())
   ...:         result = df.xs(1000)
   ...: leak()
   ...: mallinfo().fsmblks
Out[2]: 128

In [3]: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[3]: 0

Won't fix then. Maybe we should add some helper functions to pandas someday to do the malloc trimming

Entry in FAQ, maybe?

For the record, we ([email protected]) are using this in preduction for a bit of time, and is doing very good:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

@alanjds thanks very much!

But there are other affected operations :-(

It's VERY strange that's issue above (issue of glibc) doesn't have any reactions. It affects ALL the environment of Linux PCs and servers. And... Nothing!!!

I know, you'll say me: ok, write a patch! I'll do it (UPD: but it'll be strange cause I know nothing about glibc code). But even nobody knows it.

Everybody say: KDE leaks. Who know - why?! Nobody!

Open source? For shame! Sorry but it's true for this situation.

P.S. http://sourceware.org/bugzilla/show_bug.cgi?id=14827

I do believe in you. 2 years and no move on that side :/

I say to fix on this side and put a huge comment of blame, because forking there looks unfeasible.

@alanjds Your code fixed a problem for me that was causing a major headache. Would you be willing to explain what the default pandas behavior is and how your code fixes it?

You can also work around this issue by switching to jemalloc as your default allocator. Instead of python script.py, run LD_PRELOAD=/usr/lib/libjemalloc.so python script.py. Note that the path to libjemalloc.so may be different on your system and that you first need to install it with your package manager.

@tchristensenowlet The problem seems to be in the malloc code of glibc. Apparently, the free implementation there does not respect a flag that should issue malloc_trim after a certain threshold, as you can see in @ghost's link. Therefore, malloc_trim is never called and memory leaks. What we did was just to manually call malloc_trim if the lib is available in the system. We call it in the __del__() method, that is executed when the object is garbage collected.

glibc.malloc.mxfast tunable has been introduced in Glibc (https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html).

I think this might be the culprit in one of our project, but our users are running Windows with default Python 3.8 (from the official website) and with all dependencies installed via pip. Would this problem also be on Windows? If so, what would be the cdll.LoadLibrary("libc.so.6") equivalent?

Edit: I ran these test described here, and the garbage collected did its job properly every time:
https://github.com/pandas-dev/pandas/issues/21353
System: Windows 10
Python: 3.8.5
Pandas: 1.1.0

Was this page helpful?
1 / 5 - 1 ratings