Pandas: μˆœν™˜ GC 문제

에 λ§Œλ“  2013λ…„ 01μ›” 08일  Β·  14μ½”λ©˜νŠΈ  Β·  좜처: pandas-dev/pandas

곧 디버깅 될 λ―ΈμŠ€ν„°λ¦¬ :

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

def leak():
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

κ°€μž₯ μœ μš©ν•œ λŒ“κΈ€

기둝을 μœ„ν•΄, 우리 (+ @ sbneto)λŠ” 이것을 μ•½κ°„μ˜ μ‹œκ°„ λ™μ•ˆ 예감으둜 μ‚¬μš©ν•˜κ³  있으며 맀우 μž˜ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

λͺ¨λ“  14 λŒ“κΈ€

μ’‹μ•„, 이것은 ν•œλ§ˆλ””λ‘œ 망할거야. for 루프에 gc.collectλ₯Ό μΆ”κ°€ν•˜λ©΄ λ©”λͺ¨λ¦¬ λˆ„μˆ˜κ°€ μ€‘μ§€λ©λ‹ˆλ‹€.

import pandas as pd
import numpy as np
import gc

arr = np.random.randn(100000, 5)

def leak():
    pd.util.testing.set_trace()
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        gc.collect()
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

μ—¬κΈ°μ—λŠ” μˆœν™˜ GCκ°€ 싀행될 λ•Œλ§Œ 가비지 μˆ˜μ§‘λ˜λŠ” 객체가 μžˆμŠ΅λ‹ˆλ‹€. μ–΄λ–€ μ†”λ£¨μ…˜μ΄ νœ΄μ‹μ£ΌκΈ°κ°€ λͺ…μ‹œ 적으둜, 여기에 __del__ 파이썬 λ©”λͺ¨λ¦¬ ν• λ‹Ή, κ·Έλž˜μ„œ 우리λ₯Ό 속이고 쀑지?

이것을 μ‹œλ„ν•΄ λ³Ό 수 μžˆμŠ΅λ‹ˆκΉŒ?

from ctypes import cdll, CDLL

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")

def leak():
    for i in xrange(10000):
        libc.malloc_trim(0)
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

λ‚˜λŠ” 이것이 파이썬과 μ•„λ¬΄λŸ° 관련이 μ—†λ‹€κ³  μƒκ°ν•˜μ§€λ§Œ 그것은 그것을 확인할 κ²ƒμž…λ‹ˆλ‹€.

그래, 그게 μ†μž„μˆ˜λ₯Ό μ“°λŠ” 것 κ°™μ•˜ μ–΄. λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ€ IPythonμ—μ„œ μ‹€ν–‰ ν•œ ν›„ 450MB이고 malloc_trim은 400MBλ₯Ό ν•΄μ œν–ˆμŠ΅λ‹ˆλ‹€. 맀우 μ•…μ„±

malloc_trim λ¦¬λ“œ μ—…μŠ€νŠΈλ¦Όμ— 이어 glibc μ΅œμ ν™”κ°€ 잘λͺ» 된 κ²ƒμ²˜λŸΌ λ³΄μž…λ‹ˆλ‹€.
μ™ΈλΆ€ μ°Έμ‘° :
http://sourceware.org/bugzilla/show_bug.cgi?id=14827

"fastbins"주석을 μ°Έμ‘°ν•˜μ‹­μ‹œμ˜€.

In [1]: from ctypes import Structure,c_int,cdll,CDLL
   ...: class MallInfo(Structure):   
   ...:     _fields_ =[
   ...:               ( 'arena',c_int ),  #  /* Non-mmapped space allocated (bytes) */
   ...:            ('ordblks',c_int  ),# /* Number of free chunks */
   ...:            (    'smblks',c_int ),  # /* Number of free fastbin blocks */
   ...:            (    'hblks',c_int  ),  #/* Number of mmapped regions */
   ...:            (    'hblkhd' ,c_int ), #/* Space allocated in mmapped regions (bytes) */
   ...:            (    'usmblks' ,c_int), # /* Maximum total allocated space (bytes) */
   ...:            (    'fsmblks' ,c_int) ,#/* Space in freed fastbin blocks (bytes) */
   ...:            (    'uordblks' ,c_int),# /* Total allocated space (bytes) */
   ...:            (    'fordblks',c_int ),# /* Total free space (bytes) */
   ...:            (    'keepcost',c_int )# /* Top-most, releasable space (bytes) */
   ...:          ]
   ...:     def __repr__(self):
   ...:         return "\n".join(["%s:%d" % (k,getattr(self,k)) for k,v in self._fields_])
   ...: 
   ...: cdll.LoadLibrary("libc.so.6")
   ...: libc = CDLL("libc.so.6")
   ...: mallinfo=libc.mallinfo
   ...: mallinfo.restype=MallInfo
   ...: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[1]: 0

In [2]: import numpy as np
   ...: import pandas as pd
   ...: arr = np.random.randn(100000, 5)
   ...: def leak():
   ...:     for i in xrange(10000):
   ...:         df = pd.DataFrame(arr.copy())
   ...:         result = df.xs(1000)
   ...: leak()
   ...: mallinfo().fsmblks
Out[2]: 128

In [3]: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[3]: 0

κ·Έλ•Œ κ³ μΉ˜μ§€ μ•Šμ„ κ²ƒμž…λ‹ˆλ‹€. μ•„λ§ˆλ„ μš°λ¦¬λŠ” malloc νŠΈλ¦¬λ°μ„ μˆ˜ν–‰ν•˜κΈ° μœ„ν•΄ μ–Έμ  κ°€ νŒ¬λ”μ— λͺ‡ 가지 λ„μš°λ―Έ ν•¨μˆ˜λ₯Ό μΆ”κ°€ν•΄μ•Ό ν•  κ²ƒμž…λ‹ˆλ‹€.

FAQ ν•­λͺ©μΌκΉŒμš”?

기둝을 μœ„ν•΄, 우리 (+ @ sbneto)λŠ” 이것을 μ•½κ°„μ˜ μ‹œκ°„ λ™μ•ˆ 예감으둜 μ‚¬μš©ν•˜κ³  있으며 맀우 μž˜ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

@alanjds λŒ€λ‹¨νžˆ κ°μ‚¬ν•©λ‹ˆλ‹€!

κ·ΈλŸ¬λ‚˜ λ‹€λ₯Έ 영ν–₯μ„λ°›λŠ” μž‘μ—…μ΄ μžˆμŠ΅λ‹ˆλ‹€.

μœ„μ˜ 문제 (glibc 문제)에 μ•„λ¬΄λŸ° λ°˜μ‘μ΄ μ—†λ‹€λŠ” 것은 맀우 μ΄μƒν•©λ‹ˆλ‹€. Linux PC 및 μ„œλ²„μ˜ λͺ¨λ“  ν™˜κ²½μ— 영ν–₯을 λ―ΈμΉ©λ‹ˆλ‹€. 그리고 아무것도!!!

μ•Œμ•„μš”, 당신은 μ €μ—κ²Œ 말할 κ²ƒμž…λ‹ˆλ‹€ : μ’‹μ•„μš”, 패치λ₯Ό μž‘μ„±ν•˜μ„Έμš”! λ‚˜λŠ” 그것을 ν•  κ²ƒμž…λ‹ˆλ‹€ (UPD : glibc μ½”λ“œμ— λŒ€ν•΄ 아무것도 λͺ¨λ₯΄κΈ° λ•Œλ¬Έμ— 이상 ν•  κ²ƒμž…λ‹ˆλ‹€). κ·ΈλŸ¬λ‚˜ 아무도 그것을 λͺ¨λ¦…λ‹ˆλ‹€.

λͺ¨λ‘κ°€ λ§ν•©λ‹ˆλ‹€ : KDEκ°€ λˆ„μΆœλ©λ‹ˆλ‹€. λˆ„κ°€ μ•Œμ•„-μ™œ?! 아무도!

μ˜€ν”ˆ μ†ŒμŠ€? λΆ€λ„λŸ¬μ›Œ! μ£„μ†‘ν•˜μ§€λ§Œμ΄ 상황은 μ‚¬μ‹€μž…λ‹ˆλ‹€.

PS http://sourceware.org/bugzilla/show_bug.cgi?id=14827

λ‚˜λŠ” 당신을 λ―ΏμŠ΅λ‹ˆλ‹€. 2 년이고 κ·Έμͺ½μœΌλ‘œ 움직이지 μ•ŠλŠ”λ‹€ : /

λ‚˜λŠ” 이μͺ½μ„ 고치고 κ±°κΈ°μ—μ„œ ν¬ν¬ν•˜λŠ” 것은 μ‹€ν˜„ λΆˆκ°€λŠ₯ ν•΄ 보이기 λ•Œλ¬Έμ— 큰 λΉ„λ‚œμ„ν•˜κ² λ‹€κ³  λ§ν•œλ‹€.

@alanjds κ·€ν•˜μ˜ μ½”λ“œλŠ” μ‹¬κ°ν•œ 두톡을 μΌμœΌν‚€λŠ” 문제λ₯Ό ν•΄κ²°ν–ˆμŠ΅λ‹ˆλ‹€. κΈ°λ³Έ pandas λ™μž‘μ΄ 무엇이며 μ½”λ“œμ—μ„œμ΄λ₯Ό μˆ˜μ •ν•˜λŠ” 방법을 μ„€λͺ…ν•΄ μ£Όμ‹œκ² μŠ΅λ‹ˆκΉŒ?

κΈ°λ³Έ ν• λ‹Ή 자둜 jemalloc 둜 μ „ν™˜ν•˜μ—¬μ΄ 문제λ₯Ό ν•΄κ²°ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€. λŒ€μ‹  python script.py , μ‹€ν–‰ LD_PRELOAD=/usr/lib/libjemalloc.so python script.py . libjemalloc.so 의 κ²½λ‘œλŠ” μ‹œμŠ€ν…œμ— 따라 λ‹€λ₯Ό 수 있으며 λ¨Όμ € νŒ¨ν‚€μ§€ κ΄€λ¦¬μžλ₯Ό μ‚¬μš©ν•˜μ—¬ μ„€μΉ˜ν•΄μ•Όν•©λ‹ˆλ‹€.

@tchristensenowlet λ¬Έμ œλŠ” glibc 의 malloc μ½”λ“œμ—μžˆλŠ” 것 κ°™μŠ΅λ‹ˆλ‹€. λΆ„λͺ…νžˆ free κ΅¬ν˜„μ€ @ghost 의 λ§ν¬μ—μ„œ λ³Ό 수 μžˆλ“―μ΄ νŠΉμ • μž„κ³„ κ°’ 이후에 malloc_trim λ°œν–‰ν•΄μ•Όν•˜λŠ” ν”Œλž˜κ·Έλ₯Ό μ‘΄μ€‘ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. λ”°λΌμ„œ malloc_trim λŠ” ν˜ΈμΆœλ˜μ§€ μ•Šκ³  λ©”λͺ¨λ¦¬ λˆ„μˆ˜κ°€ λ°œμƒν•©λ‹ˆλ‹€. μ‹œμŠ€ν…œμ—μ„œ libλ₯Ό μ‚¬μš©ν•  μˆ˜μžˆλŠ” 경우 μˆ˜λ™μœΌλ‘œ malloc_trim ν˜ΈμΆœν–ˆμŠ΅λ‹ˆλ‹€. 객체가 가비지 μˆ˜μ§‘ 될 λ•Œ μ‹€ν–‰λ˜λŠ” __del__() λ©”μ„œλ“œμ—μ„œ ν˜ΈμΆœν•©λ‹ˆλ‹€.

glibc.malloc.mxfast νŠœλ„ˆ 블이 Glibc (https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html)에 λ„μž…λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

이것이 우리 ν”„λ‘œμ νŠΈ 쀑 ν•˜λ‚˜μ˜ 범인이라고 μƒκ°ν•˜μ§€λ§Œ μ‚¬μš©μžλŠ” κΈ°λ³Έ Python 3.8 (곡식 μ›Ή μ‚¬μ΄νŠΈμ—μ„œ 제곡)κ³Ό pipλ₯Ό 톡해 μ„€μΉ˜λœ λͺ¨λ“  μ’…μ†μ„±μœΌλ‘œ Windowsλ₯Ό μ‹€ν–‰ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 이 λ¬Έμ œκ°€ Windowsμ—μ„œλ„ λ°œμƒν•©λ‹ˆκΉŒ? κ·Έλ ‡λ‹€λ©΄ cdll.LoadLibrary("libc.so.6") ν•΄λ‹Ήν•˜λŠ” 것은 λ¬΄μ—‡μž…λ‹ˆκΉŒ?

νŽΈμ§‘ : 여기에 μ„€λͺ… 된 ν…ŒμŠ€νŠΈλ₯Ό μ‹€ν–‰ν–ˆκ³  가비지 μˆ˜μ§‘μ€ 맀번 μ œλŒ€λ‘œ μž‘λ™ν–ˆμŠ΅λ‹ˆλ‹€.
https://github.com/pandas-dev/pandas/issues/21353
μ‹œμŠ€ν…œ : Windows 10
파이썬 : 3.8.5
νŒλ‹€ : 1.1.0

이 νŽ˜μ΄μ§€κ°€ 도움이 λ˜μ—ˆλ‚˜μš”?
1 / 5 - 1 λ“±κΈ‰