A mystery to be debugged soon:
import pandas as pd
import numpy as np
arr = np.random.randn(100000, 5)
def leak():
for i in xrange(10000):
df = pd.DataFrame(arr.copy())
result = df.xs(1000)
# result = df.ix[5000]
if __name__ == '__main__':
leak()
Ok, this is, in a word, f*cked up. If I add gc.collect to that for loop it stops leaking memory:
import pandas as pd
import numpy as np
import gc
arr = np.random.randn(100000, 5)
def leak():
pd.util.testing.set_trace()
for i in xrange(10000):
df = pd.DataFrame(arr.copy())
result = df.xs(1000)
gc.collect()
# result = df.ix[5000]
if __name__ == '__main__':
leak()
There are objects here that only get garbage collected when the cyclic GC runs. What's the solution here, break cycle explicitly in __del__
so the Python memory allocator stops screwing us?
Can you try this:
from ctypes import cdll, CDLL
import pandas as pd
import numpy as np
arr = np.random.randn(100000, 5)
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")
def leak():
for i in xrange(10000):
libc.malloc_trim(0)
df = pd.DataFrame(arr.copy())
result = df.xs(1000)
# result = df.ix[5000]
if __name__ == '__main__':
leak()
I suspect this has nothing to do with python, but that would confirm it.
Yeah, that seemed to do the trick. Memory usage 450MB after running that in IPython, then malloc_trim freed 400MB. Very pernicious
Following the malloc_trim
lead upstream, this looks like a glibc optimization gone awry.
xref:
http://sourceware.org/bugzilla/show_bug.cgi?id=14827
see "fastbins" comment.
In [1]: from ctypes import Structure,c_int,cdll,CDLL
...: class MallInfo(Structure):
...: _fields_ =[
...: ( 'arena',c_int ), # /* Non-mmapped space allocated (bytes) */
...: ('ordblks',c_int ),# /* Number of free chunks */
...: ( 'smblks',c_int ), # /* Number of free fastbin blocks */
...: ( 'hblks',c_int ), #/* Number of mmapped regions */
...: ( 'hblkhd' ,c_int ), #/* Space allocated in mmapped regions (bytes) */
...: ( 'usmblks' ,c_int), # /* Maximum total allocated space (bytes) */
...: ( 'fsmblks' ,c_int) ,#/* Space in freed fastbin blocks (bytes) */
...: ( 'uordblks' ,c_int),# /* Total allocated space (bytes) */
...: ( 'fordblks',c_int ),# /* Total free space (bytes) */
...: ( 'keepcost',c_int )# /* Top-most, releasable space (bytes) */
...: ]
...: def __repr__(self):
...: return "\n".join(["%s:%d" % (k,getattr(self,k)) for k,v in self._fields_])
...:
...: cdll.LoadLibrary("libc.so.6")
...: libc = CDLL("libc.so.6")
...: mallinfo=libc.mallinfo
...: mallinfo.restype=MallInfo
...: libc.malloc_trim(0)
...: mallinfo().fsmblks
Out[1]: 0
In [2]: import numpy as np
...: import pandas as pd
...: arr = np.random.randn(100000, 5)
...: def leak():
...: for i in xrange(10000):
...: df = pd.DataFrame(arr.copy())
...: result = df.xs(1000)
...: leak()
...: mallinfo().fsmblks
Out[2]: 128
In [3]: libc.malloc_trim(0)
...: mallinfo().fsmblks
Out[3]: 0
Won't fix then. Maybe we should add some helper functions to pandas someday to do the malloc trimming
Entry in FAQ, maybe?
For the record, we (+@sbneto) are using this in preduction for a bit of time, and is doing very good:
# monkeypatches.py
# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")
libc.malloc_trim(0)
except (OSError, AttributeError):
libc = None
__old_del = getattr(pd.DataFrame, '__del__', None)
def __new_del(self):
if __old_del:
__old_del(self)
libc.malloc_trim(0)
if libc:
print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
pd.DataFrame.__del__ = __new_del
else:
print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)
@alanjds thanks very much!
But there are other affected operations :-(
It's VERY strange that's issue above (issue of glibc) doesn't have any reactions. It affects ALL the environment of Linux PCs and servers. And... Nothing!!!
I know, you'll say me: ok, write a patch! I'll do it (UPD: but it'll be strange cause I know nothing about glibc code). But even nobody knows it.
Everybody say: KDE leaks. Who know - why?! Nobody!
Open source? For shame! Sorry but it's true for this situation.
I do believe in you. 2 years and no move on that side :/
I say to fix on this side and put a huge comment of blame, because forking there looks unfeasible.
@alanjds Your code fixed a problem for me that was causing a major headache. Would you be willing to explain what the default pandas behavior is and how your code fixes it?
You can also work around this issue by switching to jemalloc
as your default allocator. Instead of python script.py
, run LD_PRELOAD=/usr/lib/libjemalloc.so python script.py
. Note that the path to libjemalloc.so
may be different on your system and that you first need to install it with your package manager.
@tchristensenowlet The problem seems to be in the malloc
code of glibc
. Apparently, the free
implementation there does not respect a flag that should issue malloc_trim
after a certain threshold, as you can see in @ghost's link. Therefore, malloc_trim
is never called and memory leaks. What we did was just to manually call malloc_trim
if the lib is available in the system. We call it in the __del__()
method, that is executed when the object is garbage collected.
glibc.malloc.mxfast
tunable has been introduced in Glibc (https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html).
I think this might be the culprit in one of our project, but our users are running Windows with default Python 3.8 (from the official website) and with all dependencies installed via pip. Would this problem also be on Windows? If so, what would be the cdll.LoadLibrary("libc.so.6")
equivalent?
Edit: I ran these test described here, and the garbage collected did its job properly every time:
https://github.com/pandas-dev/pandas/issues/21353
System: Windows 10
Python: 3.8.5
Pandas: 1.1.0
Most helpful comment
For the record, we (+@sbneto) are using this in preduction for a bit of time, and is doing very good: