Pandas: Cyclic GC issues

Created on 8 Jan 2013 · 14Comments · Source: pandas-dev/pandas

A mystery to be debugged soon:

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

def leak():
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

Bug

Source

wesm

👍4 😕1

Most helpful comment

For the record, we (+@sbneto) are using this in preduction for a bit of time, and is doing very good:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

alanjds on 22 Aug 2018

👍6

All 14 comments

Ok, this is, in a word, f*cked up. If I add gc.collect to that for loop it stops leaking memory:

import pandas as pd
import numpy as np
import gc

arr = np.random.randn(100000, 5)

def leak():
    pd.util.testing.set_trace()
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        gc.collect()
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

There are objects here that only get garbage collected when the cyclic GC runs. What's the solution here, break cycle explicitly in __del__ so the Python memory allocator stops screwing us?

wesm on 8 Jan 2013

👍5

Can you try this:

from ctypes import cdll, CDLL

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")

def leak():
    for i in xrange(10000):
        libc.malloc_trim(0)
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

I suspect this has nothing to do with python, but that would confirm it.

cournape on 8 Jan 2013

👍1

Yeah, that seemed to do the trick. Memory usage 450MB after running that in IPython, then malloc_trim freed 400MB. Very pernicious

wesm on 8 Jan 2013

Following the malloc_trim lead upstream, this looks like a glibc optimization gone awry.
xref:
http://sourceware.org/bugzilla/show_bug.cgi?id=14827

see "fastbins" comment.

In [1]: from ctypes import Structure,c_int,cdll,CDLL
   ...: class MallInfo(Structure):   
   ...:     _fields_ =[
   ...:               ( 'arena',c_int ),  #  /* Non-mmapped space allocated (bytes) */
   ...:            ('ordblks',c_int  ),# /* Number of free chunks */
   ...:            (    'smblks',c_int ),  # /* Number of free fastbin blocks */
   ...:            (    'hblks',c_int  ),  #/* Number of mmapped regions */
   ...:            (    'hblkhd' ,c_int ), #/* Space allocated in mmapped regions (bytes) */
   ...:            (    'usmblks' ,c_int), # /* Maximum total allocated space (bytes) */
   ...:            (    'fsmblks' ,c_int) ,#/* Space in freed fastbin blocks (bytes) */
   ...:            (    'uordblks' ,c_int),# /* Total allocated space (bytes) */
   ...:            (    'fordblks',c_int ),# /* Total free space (bytes) */
   ...:            (    'keepcost',c_int )# /* Top-most, releasable space (bytes) */
   ...:          ]
   ...:     def __repr__(self):
   ...:         return "\n".join(["%s:%d" % (k,getattr(self,k)) for k,v in self._fields_])
   ...: 
   ...: cdll.LoadLibrary("libc.so.6")
   ...: libc = CDLL("libc.so.6")
   ...: mallinfo=libc.mallinfo
   ...: mallinfo.restype=MallInfo
   ...: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[1]: 0

In [2]: import numpy as np
   ...: import pandas as pd
   ...: arr = np.random.randn(100000, 5)
   ...: def leak():
   ...:     for i in xrange(10000):
   ...:         df = pd.DataFrame(arr.copy())
   ...:         result = df.xs(1000)
   ...: leak()
   ...: mallinfo().fsmblks
Out[2]: 128

In [3]: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[3]: 0

ghost on 18 Mar 2013

Won't fix then. Maybe we should add some helper functions to pandas someday to do the malloc trimming

wesm on 28 Mar 2013

Entry in FAQ, maybe?

kuraga on 15 Jun 2018

👍3 😕1

For the record, we (+@sbneto) are using this in preduction for a bit of time, and is doing very good:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

alanjds on 22 Aug 2018

👍6

@alanjds thanks very much!

But there are other affected operations :-(

It's VERY strange that's issue above (issue of glibc) doesn't have any reactions. It affects ALL the environment of Linux PCs and servers. And... Nothing!!!

I know, you'll say me: ok, write a patch! I'll do it (UPD: but it'll be strange cause I know nothing about glibc code). But even nobody knows it.

Everybody say: KDE leaks. Who know - why?! Nobody!

Open source? For shame! Sorry but it's true for this situation.

P.S. http://sourceware.org/bugzilla/show_bug.cgi?id=14827

kuraga on 8 Oct 2018

I do believe in you. 2 years and no move on that side :/

I say to fix on this side and put a huge comment of blame, because forking there looks unfeasible.

alanjds on 8 Oct 2018

@alanjds Your code fixed a problem for me that was causing a major headache. Would you be willing to explain what the default pandas behavior is and how your code fixes it?

tchristensenowlet on 23 Jan 2019

You can also work around this issue by switching to jemalloc as your default allocator. Instead of python script.py, run LD_PRELOAD=/usr/lib/libjemalloc.so python script.py. Note that the path to libjemalloc.so may be different on your system and that you first need to install it with your package manager.

xhochy on 24 Jan 2019

@tchristensenowlet The problem seems to be in the malloc code of glibc. Apparently, the free implementation there does not respect a flag that should issue malloc_trim after a certain threshold, as you can see in @ghost's link. Therefore, malloc_trim is never called and memory leaks. What we did was just to manually call malloc_trim if the lib is available in the system. We call it in the __del__() method, that is executed when the object is garbage collected.

sbneto on 24 Jan 2019

👍2

glibc.malloc.mxfast tunable has been introduced in Glibc (https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html).

kuraga on 6 Jun 2020

I think this might be the culprit in one of our project, but our users are running Windows with default Python 3.8 (from the official website) and with all dependencies installed via pip. Would this problem also be on Windows? If so, what would be the cdll.LoadLibrary("libc.so.6") equivalent?

Edit: I ran these test described here, and the garbage collected did its job properly every time:
https://github.com/pandas-dev/pandas/issues/21353
System: Windows 10
Python: 3.8.5
Pandas: 1.1.0