Requests: Possible Memory Leak

Created on 17 Oct 2013  ·  53Comments  ·  Source: psf/requests

I have a very simple program that periodically retrieves an image from an IP camera. I've noticed that the working set of this program grows monotonically. I've written a small program that reproduces the issue.

import requests
from memory_profiler import profile


@profile
def lol():
    print "sending request"
    r = requests.get('http://cachefly.cachefly.net/10mb.test')
    print "reading.."
    with open("test.dat", "wb") as f:
        f.write(r.content)
    print "Finished..."

if __name__=="__main__":
    for i in xrange(100):
        print "Iteration", i
        lol()

The memory usage is printed at the end of each iteration. This is the sample output.
* Iteration 0 *

Iteration 0
sending request
reading..
Finished...
Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     12.5 MiB      0.0 MiB   @profile
     6                             def lol():
     7     12.5 MiB      0.0 MiB       print "sending request"
     8     35.6 MiB     23.1 MiB       r = requests.get('http://cachefly.cachefly.net/10mb.test')
     9     35.6 MiB      0.0 MiB       print "reading.."
    10     35.6 MiB      0.0 MiB       with open("test.dat", "wb") as f:
    11     35.6 MiB      0.0 MiB        f.write(r.content)
    12     35.6 MiB      0.0 MiB       print "Finished..."

* Iteration 1 *

Iteration 1
sending request
reading..
Finished...
Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     35.6 MiB      0.0 MiB   @profile
     6                             def lol():
     7     35.6 MiB      0.0 MiB       print "sending request"
     8     36.3 MiB      0.7 MiB       r = requests.get('http://cachefly.cachefly.net/10mb.test')
     9     36.3 MiB      0.0 MiB       print "reading.."
    10     36.3 MiB      0.0 MiB       with open("test.dat", "wb") as f:
    11     36.3 MiB      0.0 MiB        f.write(r.content)
    12     36.3 MiB      0.0 MiB       print "Finished..."

The memory usage does not grow with every iteration, but it does continue to creep up with requests.get being the culprit that increases memory usage.

By ** Iteration 99 ** this is what the memory profile looks like.

Iteration 99
sending request
reading..
Finished...
Filename: test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     40.7 MiB      0.0 MiB   @profile
     6                             def lol():
     7     40.7 MiB      0.0 MiB       print "sending request"
     8     40.7 MiB      0.0 MiB       r = requests.get('http://cachefly.cachefly.net/10mb.test')
     9     40.7 MiB      0.0 MiB       print "reading.."
    10     40.7 MiB      0.0 MiB       with open("test.dat", "wb") as f:
    11     40.7 MiB      0.0 MiB        f.write(r.content)
    12     40.7 MiB      0.0 MiB       print "Finished..."

Memory usage doesn't drop unless the program is terminated.

Is there a bug or is it user error?

Bug

Most helpful comment

There's been no further complaints of this arising and I think we've done our best on this. I'm happy to reopen it and reinvestigate if necessary

All 53 comments

Thanks for raising this and providing so much detail!

Tell me, do you ever see the memory usage go down at any point?

I've not seen memory usage go down. I was wondering if it had to do with Python's garbage collector and perhaps it just hasn't had an opportunity to kick in, so I added a call to gc.collect() after each download. That made no difference.

May I ask why this issue has been closed?

My company has experienced this same issue, which became even more augmented when using pypy. We spent several days tracing the origin of this issue in our code base to python-requests.

Just to highlight the seriousness of this issue, here's a screenshot of what one of our server processes looked like when running a memory profiler:
http://cl.ly/image/3X3G2y3Y191h

The issue is still present with regular cpython, but is less noticeable. Perhaps, this is why this issue has been unreported, despite the grave consequences it has for those utilizing this library for long-standing processes.

At this point, we are desperate enough to consider using curl with a subprocess.

Please let me know what you think and if this will ever be investigated thoroughly. Otherwise, I hold the opinion that python-requests is too dangerous to use for mission critical applications (eg: health care related services).

Thanks,
-Matt

It was closed due to inactivity. If you believe you can provide useful diagnostics to point us in the right direction, we'd be happy to reopen.

Well, allow me to help then.

I created small git repo to help faciliate the examination of this issue.
https://github.com/mhjohnson/memory-profiling-requests

Here's a screenshot of the graph it generates:
http://cl.ly/image/453h1y3a2p1r

Hope this helps! Let me know if I did anything incorrectly.

-Matt

Thanks Matt! I'm going to start looking into this now. The first few times I've run the script, (and variations I've tried) have shown this is easily reproducible. I'm going to have to start playing with this now.

So this grows at about 0.1MB/request. I tried sticking the profile decorator on lower level methods but they're all too long for the output to be remotely useful and using a higher interval than 0.1 seems to only serve to track the overall usage, not the per-line usage. Are there better tools than mprof?

So I decided instead to pipe it's output to | ag '.*0\.[1-9]+ MiB.*' to get the lines where memory is added and moved the profile decorator to Session#send. Unsurprisingly, most of it is coming from the call to HTTPAdapter#send. Down the rabbit hole I go

And now it's all coming from the call to conn.urlopen on L355 and HTTPAdapter#get_connection. If you decorate get_connection, there are 7 times it allocates memory when it calls PoolManager#connection_from_url. Now given the majority are being triggered by HTTPResponses returned from urllib3, I'm going to see if there's something we _should_ be doing with them that we aren't to ensure that memory is released after the fact. If I can't find a good way to handle that, I'll start digging into urllib3.

@sigmavirus24 Wow. Great work! It looks like you may have pinpointed the hot spot in the code.
As for tracing the which object is responsible for the memory leak, you might get some extra hints by using objgraph like so:

import gc
import objgraph
# garbage collect first
gc.collect()  
# print most common python types
objgraph.show_most_common_types()

Lemme know if I can help in any way.

-Matt

My first guess as to the culprit would be socket objects. That would explain why it's worse on PyPy...

I'm sitting in an airport right now and I'll be on a plain for several hours soon. I probably won't be able to get to this tonight or until potentially later this week (if not next weekend/week). So far though, I tried using release_conn on the HTTPResponse we receive back. I checked with gc.get_referents what the Response object has that may be failing to be GC'd. It has the original httplib HTTPResponse (stored as _original_response and that (from what get_referents reported) only has an email Message (for the headers) and everything else is either a string or dictionary (or maybe lists). If it is sockets, I don't see where they wouldn't be garbage collected.

Also, using Session#close (I made the code use sessions instead of the functional API first) doesn't help (and that should clear the PoolManagers which clear the connection pools). So the other thing that was interesting was that PoolManager#connection_from_url would add ~0.8 MB (give or take 0.1) the first few times it was called. So that adds ~3MB but the rest of it comes from conn.urlopen in HTTPAdapter#send. The bizarre thing is that gc.garbage has some odd elements if you use gc.set_debug(gc.DEBUG_LEAK). It has something like [[[...], [...], [...], None], [[...], [...], [...], None], [[...], [...], [...], None], [[...], [...], [...], None]] and as you'd expect gc.garbage[0] is gc.garbage[0][0] so that information is absolutely useless. I'll have to experiment with objgraph when I get the chance.

So I dug into urllib3 and followed the rabbit hole further earlier this morning. I profiled ConnectionPool#urlopen which lead me to ConnectionPool#_make_request. At this point, there's a lot of memory allocated from lines 306 and 333 in urllib3/connectionpool.py. L306 is self._validate_conn(conn) and L333 is conn.getresponse(buffering=True). getresponse is the httplib method on a HTTPConnection. Profiling further into that will not be easy. If we look at _validate_conn the line there that causes this is conn.connect() which is another HTTPConnection method. connect is almost certainly where the socket is being created. If I disable the memory profiling and I stick a print(old_pool) in HTTPConnectionPool#close it never prints anything. It would seem we're not actually closing pools as the session gets destroyed. My guess is this is the cause of the memory leak.

Would love to help debug this, I'll be in/out of IRC today and tomorrow.

So tracing this further, if you open up python with _make_request still decorated (with profile), and you create a session, then make requests every 10 or 20 seconds (to the same URL even), you'll see the conn has been considered dropped, so the VerifiedHTTPSConnection is closed and then reused. This means the connection class is reused, not the underlying socket. The close method is the one that lives on httplib.HTTPConnection (L798). This closes the socket object, then sets it to None. Then it closes (and sets to None) the most recent httplib.HTTPResponse. If you also profile VerifiedHTTPSConnection#connect, all of the memory created/leaked happens in urllib3.util.ssl_.ssl_wrap_socket.

So looking at this, what memory_profiler is using to report memory usage is the process' resident set size (rss). This is the size of the process in RAM (the vms, or virtual memory size, has to do with mallocs), so I'm looking to see if we're leaking virtual memory, or if we're just having pages allocated for memory that we're not losing.

So given that the all of the URLs we'd been using thus far were using verified HTTPS, I switched to using http://google.com and while there's still a consistent increase in memory, it seems that it consumes ~11-14MiB less on the whole. It still all comes back to the conn.getresponse line (and to a lesser degree now, conn.request).

The interesting thing is that VMS doesn't seem to grow much when I'm examining it in the repl. I have yet to modify mprof to return that value instead of the RSS value. A steadily increasing VMS will certainly point to a memory leak while RSS could simply be a large number of mallocs (which is possible). Most operating systems (if I understand correctly) don't reclaim RSS eagerly, so until another application page faults and there's no where else to assign it from, RSS will never shrink (even if it could). That said, if we're consistently increasing without reaching a steady-state, I can't be certain if that's requests/urllib3 or just the interpreter

I'm also going to see what happens when we use urllib2/httplib directly because I'm starting to think that this isn't our problem. As far as I can tell, Session#close properly closes all sockets and removes references to them to allow them to be GC'd. Further, if a socket needs to be replaced by the Connection Pool, the same happens. Even SSLSockets seem to properly handle being garbage collected.

So urllib2 consistently seems to flatline around 13.3MiB. The difference is I had to wrap it in a try/except because it would consistently crash with a URLError after a short while. So perhaps it's not actually doing anything after a while.

@sigmavirus24 You're crushing it! :)

Hmm... Python only releases memory to be reused by itself again, and the system doesn't get the memory back until the process terminates. So, I would think the flatline you are seeing at 13.3MiB is probably indication there is not a memory leak present with urllib2, unlike with urllib3.

It would be nice to confirm that the problem can be isolated to urllib3. Can you share the scripts you're using to test with urllib2?

So I'm starting to wonder if this doesn't have something to do with the HTTPConnection objects. If you do

import sys
import requests

s = requests.Session()
r = s.get('https://httpbin.org/get')
print('Number of response refs: ', sys.getrefcount(r) - 1)
print('Number of session refs: ', sys.getrefcount(s) - 1)
print('Number of raw refs: ', sys.getrefcount(r.raw) - 1)
print('Number of original rsponse refs: ', sys.getrefcount(r.raw._original_response) - 1)

The first three should print 1, the last 3. [1] I already identified that an HTTPConnection has _HTTPConnection__response which is a reference to _original_response. So I was expecting that number to be 3. What I cannot figure out is what is holding the reference to the 3rd copy.

For further entertainment, add the following

import gc
gc.set_debug(gc.DEBUG_STATS | gc.DEBUG_UNCOLLECTABLE)

to the beginning of the script. There are 2 unreachable objects after making the call to requests which is interesting, but nothing was uncollectable. If you add this to the script @mhjohnson provided and you filter the output for the lines with unreachable in them, you'll see that there are plenty of times where there are well over 300 unreachable objects. I don't yet know what the significance of unreachable objects it though. As always, I'll keep y'all posted.

@mhjohnson to test urllib3, just replace your call to requests.get with urllib2.urlopen (also I should probably have been doing r.read() but I wasn't).

So I took @mhjohnson's previous suggestion and used objgraph to figure out where the other reference was, but objgraph can't seem to find it. I added:

objgraph.show_backrefs([r.raw._original_response], filename='requests.png')

In the script 2 comments above and got the following:
requests which only shows that there would be 2 references to it. I wonder if there's something up with how sys.getrefcount works that's unreliable.

So that's a red herring. a urllib3.response.HTTPResponse has both _original_response and _fp. That combined with _HTTPConection__response gives us three refs.

So, urllib3.response.HTTPResponse has a _pool attribute which is also referenced by the PoolManager. Likewise, the HTTPAdapter used to make the request, has a reference on the Response requests returns. Maybe someone else can identify something from here:

requests

The code that generates that is: https://gist.github.com/sigmavirus24/bc0e1fdc5f248ba1201d

@sigmavirus24
Yeah, I got a little lost with that last graphic. Probably because I don't know the code base very well, nor am I very seasoned on debugging memory leaks.

Do you know which object this is that I am pointing at with the red arrow in this screenshot of your graphic?
http://cl.ly/image/3l3g410p3r1C

I was able to get the code to show the same slowly increasing memory usage
on python3 by replacing urllib3/requests with urllib.request.urlopen.

Modified code here: https://gist.github.com/kevinburke/f99053641fab0e2259f0

Kevin Burke
phone: 925.271.7005 | twentymilliseconds.com

On Mon, Nov 3, 2014 at 9:28 PM, Matthew Johnson [email protected]
wrote:

@sigmavirus24 https://github.com/sigmavirus24
Yeah, I got a little lost with that last graphic. Probably because I don't
know the code base very well, nor am I very seasoned on debugging memory
leaks.

Do you know which object this is that I am pointing at with the red arrow
in this screenshot of your graphic?
http://cl.ly/image/3l3g410p3r1C


Reply to this email directly or view it on GitHub
https://github.com/kennethreitz/requests/issues/1685#issuecomment-61595362
.

As far as I can tell making requests to a website that returns a
Connection: close header (for example https://api.twilio.com/2010-04-01.json)
does not increase the memory usage by a significant amount. The caveat is
there are multiple different factors and I'm just assuming it's a socket
related issue.

Kevin Burke
phone: 925.271.7005 | twentymilliseconds.com

On Mon, Nov 3, 2014 at 9:43 PM, Kevin Burke [email protected] wrote:

I was able to get the code to show the same slowly increasing memory usage
on python3 by replacing urllib3/requests with urllib.request.urlopen.

Modified code here:
https://gist.github.com/kevinburke/f99053641fab0e2259f0

Kevin Burke
phone: 925.271.7005 | twentymilliseconds.com

On Mon, Nov 3, 2014 at 9:28 PM, Matthew Johnson [email protected]
wrote:

@sigmavirus24 https://github.com/sigmavirus24
Yeah, I got a little lost with that last graphic. Probably because I
don't know the code base very well, nor am I very seasoned on debugging
memory leaks.

Do you know which object this is that I am pointing at with the red arrow
in this screenshot of your graphic?
http://cl.ly/image/3l3g410p3r1C


Reply to this email directly or view it on GitHub
https://github.com/kennethreitz/requests/issues/1685#issuecomment-61595362
.

@mhjohnson that seems to be the number of references to the metatype type by object which is of type type. In other words, I think that's all the references of either object or type, but I'm not quite sure. Either way, if I try to exclude those, the graph becomes something like 2 nodes.

I am also very concerned about this memory leak problem because we use Requests in our web crawling system in which a process usually runs for several days. Is there any progress on this issue?

After spending some time on this together with @mhjohnson, I can confirm @kevinburke theory related to the way GC treats the sockets on PyPy.

The 3c0b94047c1ccfca4ac4f2fe32afef0ae314094e commit is an interesting one. Specifically the line https://github.com/kennethreitz/requests/blob/master/requests/models.py#L736

Calling self.raw.release_conn() before returning content reduced significantly the used memory on PyPy, though there's still room for improvements.

Also, I think it would be better if we document the .close() calls that relate to the session and response classes, as also mentioned by @sigmavirus24. Requests users should be aware of those methods, because in most of the cases the methods are not called implicitly.

I also have a question and a suggestion related to the QA of this project. May I ask the maintainers why we don't use a CI to ensure the integrity of our tests? Having a CI would also allow us writing benchmark test cases where we can profile and keep a track of any performance/memory regressions.

A good example of such an approach can be found in the pq project:
https://github.com/malthe/pq/blob/master/pq/tests.py#L287

Thanks to everyone who jumped on this and decided to help!
We will keep investigating other theories causing this.

@stas I want to address one thing:

Requests users should be aware of those methods, because in most of the cases the methods are not called implicitly.

Leaving PyPy aside for a moment, those methods shouldn't _need_ to be called explicitly. If the socket objects become unreachable in CPython they will get auto gc'd, which includes closing the file handles. This is not an argument to not-document those methods, but it is a warning to not focus overmuch on them.

We are meant to use a CI, but it appears to be unwell at the moment, and only @kennethreitz is in a position to fix it. He'll get to it when he has time. Note, however, that benchmark tests are extremely difficult to get right in a way that doesn't make them extremely noisy.

Leaving PyPy aside for a moment, those methods shouldn't need to be called explicitly. If the socket objects become unreachable in CPython they will get auto gc'd, which includes closing the file handles. This is not an argument to not-document those methods, but it is a warning to not focus overmuch on them.

I kind of agree with what you say, except for the part that we discuss Python here. I don't want to start an argument, but reading _The Zen of Python_, the pythonic way would be to follow the _Explicit is better than implicit_ approach. I'm also not familiar with this project philosophy, so please ignore my thoughts if this does not apply for requests.

I would be happy to help with the CI or benchmark tests whenever there's an opportunity! Thanks for explaining the current situation.

So, I think I've found the cause of the problem when using the functional API. If you do

import requests
r = requests.get('https://httpbin.org/get')
print(r.raw._pool.pool.queue[-1].sock)

The socket appears to still be open. The reason I say _appears_ is because it still has a _sock attribute is because if you do

r.raw._pool.queue[-1].close()
print(repr(r.raw._pool.queue[-1].sock))

You'll see None printed. So what is happening is that urllib3 includes on every HTTPResponse an attribute that points to the Connection Pool it came from. The connection pool has the Connection in the queue which has the unclosed socket. The problem, for the functional API, would be fixed if in requests/api.py we do:

def request(...):
    """..."""
    s = Session()
    response = s.request(...)
    s.close()
    return s

Then r.raw._pool will still be the connection pool but r.raw._pool.pool will be None.

The tricky part becomes what happens when people are using sessions. Having them close the session after every request is non-sensical and defeats the purpose of the session. In reality, if you use a session (without threads) and make a 100 requests to the same domain (and the same scheme, e.g., https) using a Session the memory leak is much harder to see, unless you wait about 30 seconds for a new socket to be created. The problem is that as we've already seen, r.raw._pool is a very mutable object. It's a reference to the Connection Pool that is managed by the Pool Manager in requests. So when the socket is replaced, it is replaced with references to it from every response that is still reachable (in scope). What I need to do more of is figure out if anything still holds on to references to the sockets after we close the connection pools. If I can find something that is holding on to references, I think we'll find the _real_ memory leak.

So one idea that I had was to use objgraph to figure out what actually references a SSLSocket after a call to requests.get and I got this:

socket

The interesting thing is that there are apparently 7 references to the SSLSocket but only two back references that objgraph could find. I think 1 of the references is the one passed to objgraph and the other is the binding I make in the script that generates this but that still leaves 3 or 4 unaccounted for references that I'm not sure where they're coming from.

Here's my script to generate this:

import objgraph
import requests

r = requests.get('https://httpbin.org/get')
s = r.raw._pool.pool.queue[-1].sock
objgraph.show_backrefs(s, filename='socket.png', max_depth=15, refcounts=True)

Using

import objgraph
import requests

r = requests.get('https://httpbin.org/get')
s = r.raw._pool.pool.queue[-1].sock
objgraph.show_backrefs(s, filename='socket-before.png', max_depth=15,
                       refcounts=True)
r.raw._pool.close()
objgraph.show_backrefs(s, filename='socket-after.png', max_depth=15,
                       refcounts=True)

The socket-after.png shows this:

socket-after

So we eliminate one reference to the ssl socket. That said, when I look at s._sock the underlying socket.socket is closed.

After running a bunch of long running benchmarks, here's what we've found:

  • calling close() explicitly helps!
  • users running multiple requests, should use Session and properly close it after they're done. Please merge #2326
  • PyPy users are better without JIT! Or they should call gc.collect() explicitly!

TL;DR; requests looks good, below you will find a couple of charts running this snippet:

import requests
from memory_profiler import profile

@profile
def get(session, i):
    return session.get('http://stas.nerd.ro/?{0}'.format(i))

@profile
def multi_get(session, count):
    for x in xrange(count):
        resp = get(session, count+1)
        print resp, len(resp.content), x
        resp.close()

@profile
def run():
    session = requests.Session()
    print 'Starting...'
    multi_get(session, 3000)
    print("Finished first round...")
    session.close()
    print 'Done.'

if __name__ == '__main__':
    run()

CPython:

Line #    Mem usage    Increment   Line Contents
================================================
    15      9.1 MiB      0.0 MiB   @profile
    16                             def run():
    17      9.1 MiB      0.0 MiB       session = requests.Session()
    18      9.1 MiB      0.0 MiB       print 'Starting...'
    19      9.7 MiB      0.6 MiB       multi_get(session, 3000)
    20      9.7 MiB      0.0 MiB       print("Finished first round...")
    21      9.7 MiB      0.0 MiB       session.close()
    22      9.7 MiB      0.0 MiB       print 'Done.'

PyPy without JIT:

Line #    Mem usage    Increment   Line Contents
================================================
    15     15.0 MiB      0.0 MiB   @profile
    16                             def run():
    17     15.4 MiB      0.5 MiB       session = requests.Session()
    18     15.5 MiB      0.0 MiB       print 'Starting...'
    19     31.0 MiB     15.5 MiB       multi_get(session, 3000)
    20     31.0 MiB      0.0 MiB       print("Finished first round...")
    21     31.0 MiB      0.0 MiB       session.close()
    22     31.0 MiB      0.0 MiB       print 'Done.'

PyPy with JIT:

Line #    Mem usage    Increment   Line Contents
================================================
    15     22.0 MiB      0.0 MiB   @profile
    16                             def run():
    17     22.5 MiB      0.5 MiB       session = requests.Session()
    18     22.5 MiB      0.0 MiB       print 'Starting...'
    19    219.0 MiB    196.5 MiB       multi_get(session, 3000)
    20    219.0 MiB      0.0 MiB       print("Finished first round...")
    21    219.0 MiB      0.0 MiB       session.close()
    22    219.0 MiB      0.0 MiB       print 'Done.'

I believe one of the reasons we all got confused initially is because running the benchmarks requires a bigger set to exclude the way GC behaves from one implementation to another.

Also running the requests in a threaded environment requires a larger set of calls due to the way threads work (we didn't see any major variation in memory usage after running multiple thread pools).

In regards to PyPy with JIT, calling gc.collect() for the same number of calls, saved ~30% of memory. That is why I believe JIT results should be excluded from this discussion since it's a subject of how everyone tweaks the VM and optimizes the code for JIT.

Alright, so the problem explicitly appears to be with the way we handle memory interacting with the PyPy JIT. It might be a good idea to summon in a PyPy expert: @alex?

I really can't imagine what requests (and company) are possibly doing that would cause anything like this. Can you run your test with PYPYLOG=jit-summary:- in teh env and paste the results (that will print some stuff out when the process ends)

Hope this helps:

Line #    Mem usage    Increment   Line Contents
================================================
    15     23.7 MiB      0.0 MiB   @profile
    16                             def run():
    17     24.1 MiB      0.4 MiB       session = requests.Session()
    18     24.1 MiB      0.0 MiB       print 'Starting...'
    19    215.1 MiB    191.0 MiB       multi_get(session, 3000)
    20    215.1 MiB      0.0 MiB       print("Finished first round...")
    21    215.1 MiB      0.0 MiB       session.close()
    22    215.1 MiB      0.0 MiB       print 'Done.'


[2cbb7c1bbbb8] {jit-summary
Tracing:        41  0.290082
Backend:        30  0.029096
TOTAL:              1612.933400
ops:                79116
recorded ops:       23091
  calls:            2567
guards:             7081
opt ops:            5530
opt guards:         1400
forcings:           198
abort: trace too long:  2
abort: compiling:   0
abort: vable escape:    9
abort: bad loop:    0
abort: force quasi-immut:   0
nvirtuals:          9318
nvholes:            1113
nvreused:           6666
Total # of loops:   23
Total # of bridges: 8
Freed # of loops:   0
Freed # of bridges: 0
[2cbb7c242e8b] jit-summary}

I'm on trusty 32bit using latest PyPy from https://launchpad.net/~pypy/+archive/ubuntu/ppa

31 compiled paths does not explain 200+ MB of RAM in use.

Can you put a thing in your program to run
gc.dump_rpy_heap('filename.txt') while it's at a very high memory
usage? (Just need to run it once, this will generate a dump of all the
memory the GC knows about).

Then with a checkout of the PyPy source tree, run ./pypy/tool/gcdump.py filename.txt and show us the results.

Thanks!

On Sat Nov 08 2014 at 3:20:52 PM Stas Sușcov [email protected]
wrote:

Hope this helps:

Line # Mem usage Increment Line Contents

15     23.7 MiB      0.0 MiB   @profile
16                             def run():
17     24.1 MiB      0.4 MiB       session = requests.Session()
18     24.1 MiB      0.0 MiB       print 'Starting...'
19    215.1 MiB    191.0 MiB       multi_get(session, 3000)
20    215.1 MiB      0.0 MiB       print("Finished first round...")
21    215.1 MiB      0.0 MiB       session.close()
22    215.1 MiB      0.0 MiB       print 'Done.'

[2cbb7c1bbbb8] {jit-summary
Tracing: 41 0.290082
Backend: 30 0.029096
TOTAL: 1612.933400
ops: 79116
recorded ops: 23091
calls: 2567
guards: 7081
opt ops: 5530
opt guards: 1400
forcings: 198
abort: trace too long: 2
abort: compiling: 0
abort: vable escape: 9
abort: bad loop: 0
abort: force quasi-immut: 0
nvirtuals: 9318
nvholes: 1113
nvreused: 6666
Total # of loops: 23
Total # of bridges: 8
Freed # of loops: 0
Freed # of bridges: 0
[2cbb7c242e8b] jit-summary}

I'm on trusty 32bit using latest PyPy from
https://launchpad.net/~pypy/+archive/ubuntu/ppa
https://launchpad.net/%7Epypy/+archive/ubuntu/ppa


Reply to this email directly or view it on GitHub
https://github.com/kennethreitz/requests/issues/1685#issuecomment-62269627
.

Log:

Line #    Mem usage    Increment   Line Contents
================================================
    16     22.0 MiB      0.0 MiB   @profile
    17                             def run():
    18     22.5 MiB      0.5 MiB       session = requests.Session()
    19     22.5 MiB      0.0 MiB       print 'Starting...'
    20    217.2 MiB    194.7 MiB       multi_get(session, 3000)
    21    217.2 MiB      0.0 MiB       print("Finished first round...")
    22    217.2 MiB      0.0 MiB       session.close()
    23    217.2 MiB      0.0 MiB       print 'Done.'
    24    221.0 MiB      3.8 MiB       gc.dump_rpy_heap('bench.txt')


[3fd7569b13c5] {jit-summary
Tracing:        41  0.293192
Backend:        30  0.026873
TOTAL:              1615.665337
ops:                79116
recorded ops:       23091
  calls:            2567
guards:             7081
opt ops:            5530
opt guards:         1400
forcings:           198
abort: trace too long:  2
abort: compiling:   0
abort: vable escape:    9
abort: bad loop:    0
abort: force quasi-immut:   0
nvirtuals:          9318
nvholes:            1113
nvreused:           6637
Total # of loops:   23
Total # of bridges: 8
Freed # of loops:   0
Freed # of bridges: 0
[3fd756c29302] jit-summary}

The dump here: https://gist.github.com/stas/ad597c87ccc4b563211a

Thanks for taking your time to help with it!

So this accounts for maybe 100MB of the usage. There's two places teh rest
of it can be, in "spare memory" the GC keeps around for various things, and
in non-GC allocations -- these mean things like OpenSSL's internal
allocations. I wonder if there's a good way to see if OpenSSL structures
are being leaked, is the thing being tested here with TLS, if yes, can you
try with a non-TLS site and see if it reproduces?

On Sat Nov 08 2014 at 5:38:04 PM Stas Sușcov [email protected]
wrote:

Log:

Line # Mem usage Increment Line Contents

16     22.0 MiB      0.0 MiB   @profile
17                             def run():
18     22.5 MiB      0.5 MiB       session = requests.Session()
19     22.5 MiB      0.0 MiB       print 'Starting...'
20    217.2 MiB    194.7 MiB       multi_get(session, 3000)
21    217.2 MiB      0.0 MiB       print("Finished first round...")
22    217.2 MiB      0.0 MiB       session.close()
23    217.2 MiB      0.0 MiB       print 'Done.'
24    221.0 MiB      3.8 MiB       gc.dump_rpy_heap('bench.txt')

[3fd7569b13c5] {jit-summary
Tracing: 41 0.293192
Backend: 30 0.026873
TOTAL: 1615.665337
ops: 79116
recorded ops: 23091
calls: 2567
guards: 7081
opt ops: 5530
opt guards: 1400
forcings: 198
abort: trace too long: 2
abort: compiling: 0
abort: vable escape: 9
abort: bad loop: 0
abort: force quasi-immut: 0
nvirtuals: 9318
nvholes: 1113
nvreused: 6637
Total # of loops: 23
Total # of bridges: 8
Freed # of loops: 0
Freed # of bridges: 0
[3fd756c29302] jit-summary}

The dump here: https://gist.github.com/stas/ad597c87ccc4b563211a

Thanks for taking your time to help with it!


Reply to this email directly or view it on GitHub
https://github.com/kennethreitz/requests/issues/1685#issuecomment-62277822
.

@alex,

I believe that @stas had used regular http (non-SSL/TLS) connection for this benchmark. Just in case, I also used @stas's benchmark script and preformed it on my Mac (OSX 10.9.5 2.5 GHz i5 8 GB 1600 MHz DDR3) with a regular http connection.

If it helps, here are my results to compare (using your instructions):
https://gist.github.com/mhjohnson/a13f6403c8c3a3d49b8d

Let me know what you think.

Thanks,

-Matt

GitHub's regular expression is too loose. I'm reopening this because I don't think it's entirely fixed.

Hello, maybe I could help pointing the issue exists. I have a crawler that uses requests and have Process using multiprocessing. It's happening that more than one instance is receiving the same result. Maybe there is some leak on the buffer of result or the socket itself.

Let me know if i can send some sample of the code or how do generate the reference tree to identify which part of the information is being "shared" (leaked)

Thanks

@barroca that's a different issue. You're likely using a Session across threads and using stream=True. If you're closing a response before you've finished reading it, the socket is placed back into the connection pool with that data still in it (if I remember correctly). If that's not happening it's also plausible that you're picking up the most recent connection and receiving a cached response from the server. Either way, this is not an indication of a memory leak.

@sigmavirus24 Thanks Ian, It was some miss use of the Session across threads as you've mention. Thanks for the explanation and sorry for updating the wrong issue.

No worries @barroca :)

There's been no further complaints of this arising and I think we've done our best on this. I'm happy to reopen it and reinvestigate if necessary

so, what is the solution of this issue ?

@Makecodeeasy I wanna know that too

so far my issue around requests is it's not thread-safe,
best use separate session for different thread,

my on going work for walking through millions of url to validate the cache response lead me to here

as I discover memory usage grows beyond reasonable when requests interact with ThreadPoolExecutor or threading,
in the end I just use multiprocessing.Process to isolation the worker and have independent session for each worker

@AndCycle then your issue is not here. There was a PR merged to fix this particular memory leak case. It has not regressed as there are tests around it. And your issue sounds to be completely different.

Was this page helpful?
0 / 5 - 0 ratings