Are you using Canvas workflows? Maybe #4839 is related.

Also I assume you are using prefork pool for worker concurrency?

georgepsarakis on 23 Jun 2018

Thanks georgepsarakis.

I am not using workflow.
I use prefork concurrency 1 on single server.

marvelph on 23 Jun 2018

The increase rate seems quite linear, quite weird. Is the worker processing tasks during this time period? Also, can you add a note with the complete command you are using to start the worker?

georgepsarakis on 23 Jun 2018

Yes. The worker continues to process the task normally.

The worker is started with the following command.

/xxxxxxxx/bin/celery worker --app=xxxxxxxx --loglevel=INFO --pidfile=/var/run/xxxxxxxx.pid

marvelph on 23 Jun 2018

This problem is occurring in both the production environment and the test environment.
I can add memory profile and test output to the test environment.
If there is anything I can do, please say something.

marvelph on 23 Jun 2018

We need to understand what the worker is running during the time that the memory increase is observed. Any information and details you can possibly provide would definitely. It is also good that you can reproduce this.

georgepsarakis on 23 Jun 2018

Although it was a case occurred at a timing different from the graph, the next log was output at the timing when the memory leak started.

[2018-02-24 07:50:52,953: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 320, in start
blueprint.start(self)
File "/xxxxxxxx/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "/xxxxxxxx/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
next(loop)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/hub.py", line 293, in create_loop
poll_timeout = fire_timers(propagate=propagate) if scheduled else 1
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/hub.py", line 136, in fire_timers
entry()
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/timer.py", line 68, in __call__
return self.fun(*self.args, **self.kwargs)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/async/timer.py", line 127, in _reschedules
return fun(*args, **kwargs)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/connection.py", line 290, in heartbeat_check
return self.transport.heartbeat_check(self.connection, rate=rate)
File "/xxxxxxxx/lib/python3.5/site-packages/kombu/transport/pyamqp.py", line 149, in heartbeat_check
return connection.heartbeat_tick(rate=rate)
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/connection.py", line 696, in heartbeat_tick
self.send_heartbeat()
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/connection.py", line 647, in send_heartbeat
self.frame_writer(8, 0, None, None, None)
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/method_framing.py", line 166, in write_frame
write(view[:offset])
File "/xxxxxxxx/lib/python3.5/site-packages/amqp/transport.py", line 258, in write
self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-02-24 08:49:12,016: INFO/MainProcess] Connected to amqp://xxxxxxxx:**@xxx.xxx.xxx.xxx:5672/xxxxxxxx

It seems that it occurred when the connection with RabbitMQ was temporarily cut off.

marvelph on 23 Jun 2018

@marvelph so it occurs during RabbitMQ reconnections? Perhaps these issues are related:

georgepsarakis on 24 Jun 2018

Yes.
It seems that reconnection triggers it.

marvelph on 24 Jun 2018

It looks like I'm having the same issue... It is so hard for me to find out what triggers it and why there is a memeory leak. It annoys me for at least a month. I fallback to used celery 3 and everything is fine.

For the memory leak issue, I'm using ubuntu 16, celery 4.1.0 with rabbitmq. I deployed it via docker.

The memory leak is with MainProcess not ForkPoolWorker. The memory usage of ForkPoolWorker is normal, but memory usage of MainProcess is always increasing. For five seconds, around 0.1MB memeory is leaked. The memory leak doesn't start after the work starts immediatly but maybe after one or two days.

I used gdb and pyrasite to inject the running process and try to gc.collect(), but nothing is collected.

I checked the log, the consumer: Connection to broker lost. Trying to re-establish the connection... did happens, but for now I'm not sure this is the time when memory leak happens.

Any hints for debugging this issue and to find out what really happens? Thanks.

jxltom on 25 Jun 2018

Since @marvelph mentioned it may relate with rabbitmq reconnection, I try to stop my rabbitmq server. The memory usage did increase after each reconnection, following is the log. So I can confirm this https://github.com/celery/kombu/issues/843 issue.

But after the connection is reconnected, the memory usage stops to gradually increase. So I'm not sure this is the reason for memory leak.

I will try to use redis to figure out whether this memory leak issue relates wtih rabbitmq or not.

[2018-06-25 02:43:33,456: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 592, in start
    c.loop(*c.loop_args())
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/transport/base.py", line 236, in on_readable
    reader(loop)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/transport/base.py", line 218, in _read
    drain_events(timeout=0)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/connection.py", line 491, in drain_events
    while not self.blocking_read(timeout):
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/connection.py", line 496, in blocking_read
    frame = self.transport.read_frame()
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 243, in read_frame
    frame_header = read(7, True)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 418, in _read
    s = recv(n - len(rbuf))
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-06-25 02:43:33,497: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 2.00 seconds...

[2018-06-25 02:43:35,526: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 4.00 seconds...

[2018-06-25 02:43:39,560: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 6.00 seconds...

[2018-06-25 02:43:45,599: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 8.00 seconds...

[2018-06-25 02:43:53,639: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 10.00 seconds...

[2018-06-25 02:44:03,680: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 12.00 seconds...

[2018-06-25 02:44:15,743: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 14.00 seconds...

[2018-06-25 02:44:29,790: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 16.00 seconds...

[2018-06-25 02:44:45,839: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 18.00 seconds...

[2018-06-25 02:45:03,890: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 20.00 seconds...

[2018-06-25 02:45:23,943: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 22.00 seconds...

[2018-06-25 02:45:46,002: ERROR/MainProcess] consumer: Cannot connect to amqp://***:**@***:***/***: [Errno 111] Connection refused.
Trying again in 24.00 seconds...

[2018-06-25 02:46:10,109: INFO/MainProcess] Connected to amqp://***:**@***:***/***
[2018-06-25 02:46:10,212: INFO/MainProcess] mingle: searching for neighbors
[2018-06-25 02:46:10,291: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 40, in start
    self.sync(c)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 44, in sync
    replies = self.send_hello(c)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/worker/consumer/mingle.py", line 57, in send_hello
    replies = inspect.hello(c.hostname, our_revoked._data) or {}
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 132, in hello
    return self._request('hello', from_node=from_node, revoked=revoked)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 84, in _request
    timeout=self.timeout, reply=True,
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/control.py", line 439, in broadcast
    limit, callback, channel=channel,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/pidbox.py", line 315, in _broadcast
    serializer=serializer)
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/pidbox.py", line 290, in _publish
    serializer=serializer,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py", line 1732, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py", line 166, in write_frame
    write(view[:offset])
  File "/app/.heroku/python/lib/python3.6/site-packages/amqp/transport.py", line 275, in write
    self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2018-06-25 02:46:10,375: INFO/MainProcess] Connected to amqp://***:**@***:***/***
[2018-06-25 02:46:10,526: INFO/MainProcess] mingle: searching for neighbors
[2018-06-25 02:46:11,764: INFO/MainProcess] mingle: all alone

jxltom on 25 Jun 2018

Although I checked the logs, I found a log of reconnection at the timing of memory leak, but there was also a case where a memory leak started at the timing when reconnection did not occur.
I agree with the idea of jxlton.

Also, when I was using Celery 3.x, I did not encounter such a problem.

marvelph on 25 Jun 2018

same problem here
screenshot 2018-06-25 11 09 22
Every few days i have to restart workers due to this problem
there are no any significant clues in logs, but I have a suspicion that reconnects can affect; since i have reconnect log entries somewhere in time when memory starts constantly growing
My conf is ubuntu 17, 1 server - 1 worker with 3 concurrency; rabbit and redis on backend; all packages are the latest versions

dmitry-kostin on 25 Jun 2018

👍2

@marvelph @dmitry-kostin could you please provide your exact configuration (omitting sensitive information of course) and possibly a task, or sample, that reproduces the issue? Also, do you have any estimate of the average uptime interval that the worker memory increase starts appearing?

georgepsarakis on 25 Jun 2018

the config is nearby to default

imports = ('app.tasks',)
result_persistent = True
task_ignore_result = False
task_acks_late = True
worker_concurrency = 3
worker_prefetch_multiplier = 4
enable_utc = True
timezone = 'Europe/Moscow'
broker_transport_options = {'visibility_timeout': 3600, 'confirm_publish': True, 'fanout_prefix': True, 'fanout_patterns': True}

screenshot 2018-06-25 11 35 17
Basically this is new deployed node; it was deployed on 06/21 18-50; stared to grow 6/23 around 05-00 and finally crashed 6/23 around 23-00

the task is pretty simple and there is no superlogic there, i think i can reproduce the whole situation on a clear temp project but have no free time for now, if i will be lucky i will try to do a full example on weekend

UPD
as you can see the task itself consumes some memory you can see it by spikes on the graph, but the time when memory stared to leak there were no any tasks produced or any other activities

dmitry-kostin on 25 Jun 2018

@marvelph @dmitry-kostin @jxltom I noticed you use Python3. Would you mind enabling tracemalloc for the process? You may need to patch the worker process though to log memory allocation traces, let me know if you need help with that.

georgepsarakis on 25 Jun 2018

@georgepsarakis You mean enable tracemalloc in worker and log stats, such as the top 10 memory usage files, at a specific interval such as 5 minutes?

jxltom on 25 Jun 2018

@jxltom I think something like that would help locate the part of code that is responsible. What do you think?

georgepsarakis on 25 Jun 2018

@georgepsarakis I'v tried to use gdb and https://github.com/lmacken/pyrasite to inject the memory leak process, and start debug via tracemalloc. Here is the top 10 file with highest mem usage.

I use resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024 and the memory usage is gradually increasing indeed.

>>> import tracemalloc
>>> 
>>> tracemalloc.start()
>>> snapshot = tracemalloc.take_snapshot()
>>> top_stats = snapshot.statistics('lineno')
>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/kombu/utils/eventio.py:84: size=12.0 KiB, count=1, average=12.0 KiB
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=3520 B, count=8, average=440 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=3264 B, count=12, average=272 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=3060 B, count=10, average=306 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=2912 B, count=8, average=364 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=2912 B, count=8, average=364 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=2816 B, count=12, average=235 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=2816 B, count=8, average=352 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=2672 B, count=6, average=445 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=2592 B, count=8, average=324 B

Here is the difference between two snapshots after around 5 minutes.

>>> snapshot2 = tracemalloc.take_snapshot()
>>> top_stats = snapshot2.compare_to(snapshot, 'lineno')
>>> print("[ Top 10 differences ]")
[ Top 10 differences ]

>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=220 KiB (+216 KiB), count=513 (+505), average=439 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=211 KiB (+208 KiB), count=758 (+748), average=285 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=210 KiB (+206 KiB), count=789 (+777), average=272 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=190 KiB (+187 KiB), count=530 (+522), average=366 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=186 KiB (+183 KiB), count=524 (+516), average=363 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=185 KiB (+182 KiB), count=490 (+484), average=386 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=182 KiB (+179 KiB), count=528 (+520), average=353 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=179 KiB (+176 KiB), count=786 (+774), average=233 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=165 KiB (+163 KiB), count=525 (+517), average=323 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/async/hub.py:293: size=157 KiB (+155 KiB), count=255 (+251), average=632 B

jxltom on 25 Jun 2018

👍1

Any suggestions for how to continue to debug this? I have no clue for how to proceed. Thanks.

jxltom on 25 Jun 2018

@georgepsarakis

I want a little time to cut out the project for reproduction.

It is setting of Celery.

BROKER_URL = [
    'amqp://xxxxxxxx:[email protected]:5672/zzzzzzzz'
]
BROKER_TRANSPORT_OPTIONS = {}

The scheduler has the following settings.

CELERYBEAT_SCHEDULE = {
    'aaaaaaaa_bbbbbbbb': {
        'task': 'aaaa.bbbbbbbb_cccccccc',
        'schedule': celery.schedules.crontab(minute=0),
    },
    'dddddddd_eeeeeeee': {
        'task': 'dddd.eeeeeeee_ffffffff',
        'schedule': celery.schedules.crontab(minute=0),
    },
}

On EC 2, I am using supervisord to operate it.

marvelph on 26 Jun 2018

@georgepsarakis
Since my test environment can tolerate performance degradation, you can use tracemalloc.
Can you make a patched Celery to dump memory usage?

marvelph on 26 Jun 2018

@jxltom I bet tracemalloc with 5 minutes wont help to locate problem
For example I have 5 nodes and only 3 of them had this problem for last 4 days, and 2 worked fine all this this time, so it will be very tricky to locate problem ..
I feel like there is some toggle that switches on and then memory starts grow, until this switch memory consumption looks very well

dmitry-kostin on 26 Jun 2018

I tried to find out whether similar problems occurred in other running systems.
The frequency of occurrence varies, but a memory leak has occurred on three systems using Celery 4.x, and it has not happened on one system.
The system that has a memory leak is Python 3.5.x, and the system with no memory leak is Python 2.7.x.

marvelph on 26 Jun 2018

@dmitry-kostin What's the difference with the other two normal nodes, are they both using same rabbitmq as broker?

Since our discussion mentioned it may related to rabbitmq, I started another new node with same configuration except for using redis instead. So far, this node has no memory leak after running 24 hours. I will post it here if it has memory leak later

jxltom on 26 Jun 2018

@marvelph So do you mean that the three system with memory leak are using python3 while the one which is fine is using python2?

jxltom on 26 Jun 2018

@jxltom no difference at all, and yes they are on python 3 & rabit as broker and redis on backend
I made a testing example to reproduce this, if it will succeed in a couple of days i will give credentials to this servers for somebody who aware how to locate this bug

dmitry-kostin on 26 Jun 2018

@jxltom
Yes.
As far as my environment is concerned, problems do not occur in Python 2.

marvelph on 26 Jun 2018

I tracked the memory leak via tracemalloc within a longer period.

The start memory usage reported by resource module is 146.58MB, after the 3.5 hours, it reports the memory usage is 224.21MB.

Following is the snapshot difference reported by tracemalloc

>>> snapshot2 = tracemalloc.take_snapshot(); top_stats = snapshot2.compare_to(snapshot, 'lineno')
>>> for stat in top_stats[:10]:
...     print(stat)
... 
/app/.heroku/python/lib/python3.6/site-packages/celery/worker/heartbeat.py:47: size=3619 KiB (+3614 KiB), count=8436 (+8426), average=439 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:142: size=3470 KiB (+3466 KiB), count=12529 (+12514), average=284 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/method_framing.py:166: size=3418 KiB (+3414 KiB), count=12920 (+12905), average=271 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:157: size=3149 KiB (+3145 KiB), count=8762 (+8752), average=368 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/abstract_channel.py:50: size=3099 KiB (+3096 KiB), count=8685 (+8676), average=365 B
/app/.heroku/python/lib/python3.6/site-packages/celery/events/dispatcher.py:199: size=3077 KiB (+3074 KiB), count=8354 (+8345), average=377 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:203: size=3020 KiB (+3017 KiB), count=8723 (+8713), average=355 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/messaging.py:181: size=2962 KiB (+2959 KiB), count=12952 (+12937), average=234 B
/app/.heroku/python/lib/python3.6/site-packages/amqp/channel.py:1734: size=2722 KiB (+2718 KiB), count=8623 (+8613), average=323 B
/app/.heroku/python/lib/python3.6/site-packages/kombu/async/hub.py:293: size=2588 KiB (+2585 KiB), count=4193 (+4188), average=632 B

Any ideas? It looks like it is not a single file is leaking?

I also imported gc, and gc.collect() returns 0...

jxltom on 26 Jun 2018

@georgepsarakis I was able to reproduce this, ping me for access creds

dmitry-kostin on 29 Jun 2018

Update: I have updated the broker from rabbitmq to redis by updating broker url as environment variable and keep docker/code completely same. And it's running for 4 days and there is no memory leak.

So I believe this issue is related with rabbitmq broker.

jxltom on 29 Jun 2018

If possible please try running the benchmark command, mentioned here: https://github.com/celery/celery/issues/2927#issuecomment-171455414

georgepsarakis on 29 Jun 2018

This system is running workers with 20 servers.
A memory leak occurred yesterday, but it is occurring on almost all servers at the same time.
memoryleak

marvelph on 30 Jun 2018

👍2

Don't know if it's related, leaving it here in case it helps.

I have a different issue with celery and rabbitmq (celery loses connection and starts reconnecting loads of times per second, cpu goes 100% on 1 core, beat can't send new tasks, need to restart celery).

The reason I am reporting this here is the trigger: after days of monitoring I think I located the start of the issue and it appears to be rabbitmq moving some messages from memory to disk. At that time celery starts trying to reconnect as fast as it can and rabbitmq logs show tens of connection/disconnection operations per second, in batches of ~10 or so at a time. Restarting rabbitmq doesn't fix the issue, restarting celery fixes it right away. I do not have a proper fix but as an example, setting an expire policy allowing messages to always stay in memory works around the issue and I haven't seen it since.

Given some details of this issue match what I saw (swapping rabbitmq with redis fixes it, there's not a clear starting point, it happens on more than one worker/server at the same time) I guess there might be a common trigger and it might be the same I spotted.

nicbus on 30 Jun 2018

The test suite is changed from https://github.com/celery/celery/tree/master/funtests/stress to https://github.com/celery/cyanide, and it only supports Python2.

So I run it in Python2 with rabbitmq as broker. It raised !join: connection lost: error(104, 'Connection reset by peer'). Is this related with memory leak issue?

Here is the log for test suite.

➜  cyanide git:(master) pipenv run python -m cyanide.bin.cyanide
Loading .env environment variables…
Cyanide v1.3.0 [celery 4.2.0 (windowlicker)]

Linux-4.13.0-45-generic-x86_64-with-debian-stretch-sid

[config]
.> app:    cyanide:0x7fb097f31710
.> broker: amqp://**:**@**:**/cyanide
.> suite: cyanide.suites.default:Default

[toc: 12 tests total]
.> 1) manyshort,
.> 2) always_timeout,
.> 3) termbysig,
.> 4) timelimits,
.> 5) timelimits_soft,
.> 6) alwayskilled,
.> 7) alwaysexits,
.> 8) bigtasksbigvalue,
.> 9) bigtasks,
.> 10) smalltasks,
.> 11) revoketermfast,
.> 12) revoketermslow

+enable worker task events...
+suite start (repetition 1)
[[[manyshort(50)]]]
 1: manyshort                            OK (1/50) rep#1 runtime: 15.00 seconds/15.01 seconds
 1: manyshort                            OK (2/50) rep#1 runtime: 13.16 seconds/28.17 seconds
 1: manyshort                            OK (3/50) rep#1 runtime: 13.29 seconds/41.46 seconds
 1: manyshort                            OK (4/50) rep#1 runtime: 13.70 seconds/55.16 seconds
 1: manyshort                            OK (5/50) rep#1 runtime: 13.77 seconds/1.15 minutes
 1: manyshort                            OK (6/50) rep#1 runtime: 13.91 seconds/1.38 minutes
!join: connection lost: error(104, 'Connection reset by peer')
!Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',)
!Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',)
!Still waiting for 475/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',)
!Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',)
!Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',)
!Still waiting for 475/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',)
!Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',)
!Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',)
!join: connection lost: error(104, 'Connection reset by peer')
failed after 7 iterations in 3.12 minutes
Traceback (most recent call last):
  File "/home/***/.pyenv/versions/2.7.15/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/***/.pyenv/versions/2.7.15/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/***/Documents/Python-Dev/cyanide/cyanide/bin/cyanide.py", line 62, in <module>
    main()
  File "/home/***/Documents/Python-Dev/cyanide/cyanide/bin/cyanide.py", line 58, in main
    return cyanide().execute_from_commandline(argv=argv)
  File "/home/***/.local/share/virtualenvs/cyanide-Vy3PQPTU/lib/python2.7/site-packages/celery/bin/base.py", line 275, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/home/***/.local/share/virtualenvs/cyanide-Vy3PQPTU/lib/python2.7/site-packages/celery/bin/base.py", line 363, in handle_argv
    return self(*args, **options)
  File "/home/***/.local/share/virtualenvs/cyanide-Vy3PQPTU/lib/python2.7/site-packages/celery/bin/base.py", line 238, in __call__
    ret = self.run(*args, **kwargs)
  File "/home/***/Documents/Python-Dev/cyanide/cyanide/bin/cyanide.py", line 20, in run
    return self.run_suite(names, **options)
  File "/home/***/Documents/Python-Dev/cyanide/cyanide/bin/cyanide.py", line 30, in run_suite
    ).run(names, **options)
  File "cyanide/suite.py", line 366, in run
    self.runtest(test, iterations, j + 1, i + 1)
  File "cyanide/suite.py", line 426, in runtest
    self.execute_test(fun)
  File "cyanide/suite.py", line 447, in execute_test
    fun()
  File "cyanide/suites/default.py", line 22, in manyshort
    timeout=10, propagate=True)
  File "cyanide/suite.py", line 246, in join
    raise self.TaskPredicate('Test failed: Missing task results')
cyanide.suite.StopSuite: Test failed: Missing task results

Here is the log for worker.

➜  cyanide git:(master) pipenv run celery -A cyanide worker -c 1
Loading .env environment variables…

 -------------- celery@** v4.2.0 (windowlicker)
---- **** ----- 
--- * ***  * -- Linux-4.13.0-45-generic-x86_64-with-debian-stretch-sid 2018-07-03 12:59:28
-- * - **** --- 
- ** ---------- [config]
- ** ---------- .> app:         cyanide:0x7fdc988b4e90
- ** ---------- .> transport:   amqp://**:**@**:**/cyanide
- ** ---------- .> results:     rpc://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> c.stress         exchange=c.stress(direct) key=c.stress


[2018-07-03 12:59:29,883: WARNING/ForkPoolWorker-1] ! Still waiting for 1000/1000: [e6e71bed-8e58-4e7e-96c5-f56b583a37af, 42fd4f97-4ff5-4e0e-b874-89e7b3f0ff22, 3de45eeb-9b89-41bc-8284-95a4c07aa34a,...]: TimeoutError('The operation timed out.',) !
[2018-07-03 12:59:29,886: WARNING/ForkPoolWorker-1] ! Still waiting for 1000/1000: [e6e71bed-8e58-4e7e-96c5-f56b583a37af, 42fd4f97-4ff5-4e0e-b874-89e7b3f0ff22, 3de45eeb-9b89-41bc-8284-95a4c07aa34a,...]: TimeoutError('The operation timed out.',) !
[2018-07-03 12:59:30,964: WARNING/ForkPoolWorker-1] + suite start (repetition 1) +
[2018-07-03 12:59:30,975: WARNING/ForkPoolWorker-1] ---  1: manyshort                             (0/50) rep#1 runtime: 0.0000/0.0000 ---
[2018-07-03 13:01:07,835: WARNING/ForkPoolWorker-1] ! join: connection lost: error(104, 'Connection reset by peer') !
[2018-07-03 13:01:17,918: WARNING/ForkPoolWorker-1] ! Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',) !
[2018-07-03 13:01:27,951: WARNING/ForkPoolWorker-1] ! Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',) !
[2018-07-03 13:01:38,902: WARNING/ForkPoolWorker-1] ! Still waiting for 475/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',) !
[2018-07-03 13:01:48,934: WARNING/ForkPoolWorker-1] ! Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',) !
[2018-07-03 13:01:58,961: WARNING/ForkPoolWorker-1] ! Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',) !
[2018-07-03 13:02:09,906: WARNING/ForkPoolWorker-1] ! Still waiting for 475/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',) !
[2018-07-03 13:02:19,934: WARNING/ForkPoolWorker-1] ! Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',) !
[2018-07-03 13:02:29,964: WARNING/ForkPoolWorker-1] ! Still waiting for 1000/1000: [1624cd7a-3cc0-474a-b957-b0484f6b4937, 2b436525-73de-4062-bd6b-924cbd11ba74, ce04cb5e-a99e-41e2-95dc-e9bc351e606d,...]: TimeoutError(u'The operation timed out.',) !
[2018-07-03 13:02:37,900: WARNING/ForkPoolWorker-1] ! join: connection lost: error(104, 'Connection reset by peer') !

jxltom on 3 Jul 2018

I have updated to use celery 3.1.25 with same stress test suite, everything is fine.

jxltom on 3 Jul 2018

BTW For everybody who are looking for a fast fix – replacing rabbit with redis solves the problem as @jxltom suggested, i have more than week of stable work with redis only now
So the problem is definitely somewhere near the rabbit<->celery border

dmitry-kostin on 9 Jul 2018

👍5

@dieeasy we have experienced the same issue. I assume you are using RPC result backend. If so, try switching to DB result backend and see if that helps. The issue that causes this is: https://github.com/celery/kombu/pull/779 and is explained here: https://github.com/celery/kombu/pull/779#discussion_r134961611

encodeltd on 17 Aug 2018

I have a same problem memory leak
Memory

Version
python 3.6.5 celery 4.2.1 backend redis broker rabbitmq

Config

[celery]
broker_url=amqp://taunt:[email protected]:5672/%2ftaunt
celery_result_backend=redis://xx.xx.xx.xx:6379
# 7days
celery_task_result_expires=604800
celery_task_serializer=msgpack
celery_result_serializer=json
celery_accept_content=json,msgpack
celery_timezone=Asia/Shanghai
celery_enable_utc=True

[cmd]
worker=True
proj=app.worker.celery
log_level=INFO
name=send_im%%h
queue=im
autoscale=10,3
concurrency=10

```python

-- coding: utf-8 --

from kombu import Queue, Exchange
from oslo_log import log as logging

from app.conf import CONF

LOG = logging.getLogger(__name__)

celery_queues = (
Queue('im', exchange=Exchange('sender'), routing_key='im'),
Queue('sms', exchange=Exchange('sender'), routing_key='sms'),
Queue('mail', exchange=Exchange('sender'), routing_key='mail'),
Queue('ivr', exchange=Exchange('sender'), routing_key='ivr')
)

celery_routes = {
'sender.im': {'queue': 'im', 'routing_key': 'im'},
'sender.sms': {'queue': 'sms', 'routing_key': 'sms'},
'sender.mail': {'queue': 'mail', 'routing_key': 'mail'},
'sender.ivr': {'queue': 'ivr', 'routing_key': 'ivr'}
}

config = {
'BROKER_URL': CONF.celery.broker_url,
'CELERY_RESULT_BACKEND': CONF.celery.celery_result_backend,
'CELERY_TASK_RESULT_EXPIRES': CONF.celery.celery_task_result_expires,
'CELERY_TASK_SERIALIZER': CONF.celery.celery_task_serializer,
'CELERY_RESULT_SERIALIZER': CONF.celery.celery_result_serializer,
'CELERY_ACCEPT_CONTENT': CONF.celery.celery_accept_content.split(','),
'CELERY_TIMEZONE': CONF.celery.celery_timezone,
'CELERY_ENABLE_UTC': CONF.celery.celery_enable_utc,
'CELERY_QUEUES': celery_queues,
'CELERY_ROUTES': celery_routes
}

**Startup**
```python

def make_command() -> list:
    log_path = f'{CONF.log_dir}{os.sep}{CONF.log_file}'
    command_name = f'{sys.path[0]}{os.sep}celery'
    command = [command_name, 'worker', '-A', CONF.cmd.proj, '-E']
    if CONF.cmd.log_level:
        command.extend(['-l', CONF.cmd.log_level])
    if CONF.cmd.queue:
        command.extend(['-Q', CONF.cmd.queue])
    if CONF.cmd.name:
        command.extend(['-n', CONF.cmd.name])
    # if CONF.cmd.autoscale:
    #     command.extend(['--autoscale', CONF.cmd.autoscale])
    if CONF.cmd.concurrency:
        command.extend(['--concurrency', CONF.cmd.concurrency])
    command.extend(['-f', log_path]) 
    return command


if CONF.cmd.worker:
    LOG.info(make_command())
    entrypoint = celery.start(argv=make_command())

I can provide more information if needed.

kolapapa on 7 Nov 2018

For what it's worth, I am having this issue and can reproduce it consistently by opening the rabbitmq management console, going to connections, and closing connections with traffic from celery to rabbitmq.

I've tested with celery 4.1 and 4.2 and rabbitmq 3.7.7-1
EDIT: also python version 3.6.5 and the ubuntu 16.04 (AWS EC2 image)

ConnorWhalen on 29 Nov 2018

I'm having a memory leak with celery 4.2.1 and redis broker. The memory grows from 100 MiB to 500 MiB(limited) in 3 hours, and the workers are marked as offline in flower. Both prefork pool and gevent show the same issue.

yifeikong on 28 Dec 2018

@yifeikong this may not be the same issue, but for your case could you please try the solution proposed https://github.com/celery/celery/pull/4839#issuecomment-447739820 ?

georgepsarakis on 28 Dec 2018

@georgepsarakis I'm using Python 3.6.5, so I'm not affected by this bug. I will use tracemalloc to do some research. If it was a celery bug, I'll open a new issue. Thanks

yifeikong on 29 Dec 2018

Maybe same cause with #5047, it seems this bug can lead to different phenomenon.

ssikyou on 20 Jan 2019

We are facing the same memory leak running Celery 4.2.1, Kombu 4.2.2 and python3.6 with RabbitMQ as broker.

$ celery --app=eventr.celery_app report

software -> celery:4.2.1 (windowlicker) kombu:4.2.2-post1 py:3.6.8
            billiard:3.5.0.5 py-amqp:2.4.0
platform -> system:Linux arch:64bit imp:CPython

I can say we have tried many things that other people mentioned as possible workarounds (redis as broker, using jemalloc, libamqp, monkey path __del__ on AsyncResult) but we always ended up having memory leaked.

By analysing our log we noticed that we had a lot of messages related to missed heartbeats from gossip.

{"asctime": "2019-01-25 13:40:06,486", "levelname": "INFO", "name": "celery.worker.consumer.gossip", "funcName": "on_node_lost", "lineno": 147, "message": "missed heartbeat from celery@******"}

One last thing that we tried was disabling gossip by running the workers with --without-gossip, surprisingly, disabling gossip had an immediate effect.

You can see it here:
celery-memory-14d

Since we deactivated gossip in two projects running celery workers the memory consumption has improved.

If you pay attention, before we were having similar memory spikes as described here https://github.com/celery/celery/issues/4843#issuecomment-399833781

One thing that I've been trying to fully understand is what are the implications of completely disabling gossip, since it's only described as worker <-> worker communication, if anyone could shed some light about this I would be very grateful.

Hope this helps and thanks for the hard work.

menecio on 1 Feb 2019

👍3

Why was this issue closed?

AvnerCohen on 4 Feb 2019

👍17

There is active feedback and interest in this issue, so I am reopening.

georgepsarakis on 4 Feb 2019

👍11

Well @georgepsarakis since we diagnosed my leak as not being #4839, and you suspected that it was #4843, I'll flip over to this leak thread at least for now. I'm not sure #4843 is my leak either. According to the initial issue on this thread:

This problem happens at least in Celery 4.1, and it also occurs in Celery 4.2.
Celery is running on Ubuntu 16 and brokers use RabbitMQ.

I'm currently on:

python 2.7.12
Ubuntu 16.04.1 amd64
RabbitMQ 3.7.5

using:

Celery 4.1.1
librabbitmq 2.0.0
amqp 2.4.0
vine 1.1.4
billiard 3.5.0.5
kombu 4.2.2.post1
gevent 1.2.2

However, Celery 4.1.1 + gevent 1.2.2 doesn't leak for me (nor does Celery 3.1.25 + gevent 1.2.2 AFAICT); Celery 4.2.1 + gevent 1.3.7 does. Unfortunately, gevent 1.3.7 and gevent 1.2.2 are not interchangeable to demonstrate (or exclude) a gevent library as a possible source of the problem.

EDIT: Hmm...there seems to be a gevent patch (022f447dd) that looks like it could fix the error I encountered. I'll try and get that to work.

ldav1s on 8 Feb 2019

I applied 022f447 to Celery 4.1.1 and installed gevent 1.3.7. That Celery + gevent combination ran...and produced memory usage patterns consistent with the leak I've been experiencing. I'll install Celery 4.2.1 + gevent 1.2.2 (with the reverse patch) and see if I get the usual memory usage pattern.

I notice gevent 1.4.0 is out. Maybe I should give that a whirl as well to see how that behaves.

ldav1s on 9 Feb 2019

Celery 4.2.1 + gevent 1.2.2 + reverse patch for gevent 1.2.2 doesn't seem to produce the leak as does Celery 4.2.1 + gevent 1.3.7.

ldav1s on 13 Feb 2019

Celery 4.2.1 + gevent 1.4.0 does seem to leak at approximately the same rate as gevent 1.3.7 AFAICT.

ldav1s on 14 Feb 2019

https://github.com/celery/celery/blob/9f0a554dc2d28c630caf9d192873d040043b7346/celery/events/dispatcher.py

    def _publish(self, event, producer, routing_key, retry=False,
                 retry_policy=None, utcoffset=utcoffset):
        exchange = self.exchange
        try:
            producer.publish(...)
        except Exception as exc:  # pylint: disable=broad-except
            if not self.buffer_while_offline:  # <-- False by default
                raise
            self._outbound_buffer.append((event, routing_key, exc))  # <---- Always buffered

    def send(self, type, blind=False, utcoffset=utcoffset, retry=False,
            ...
            if group in self.buffer_group:   # <--- Never true for eventlet & gevent
                ...
                if len(buf) >= self.buffer_limit:
                    self.flush()     #  <---- Never flushed even when grows above limit
                ...
            else:
                return self.publish(type, fields, self.producer, blind=blind,
                                    Event=Event, retry=retry,

https://github.com/celery/celery/blob/b2668607c909c61becd151905b4525190c19ff4a/celery/worker/consumer/events.py

    def start(self, c):
        # flush events sent while connection was down.
        prev = self._close(c)
        dis = c.event_dispatcher = c.app.events.Dispatcher(
            ...
            # we currently only buffer events when the event loop is enabled
            # XXX This excludes eventlet/gevent, which should actually buffer.
            buffer_group=['task'] if c.hub else None,
            on_send_buffered=c.on_send_event_buffered if c.hub else None,
        )
        if prev:
            dis.extend_buffer(prev)
            dis.flush()    # <---- The only (!) chance to flush on [g]event[let] is on reconnect.

Now, if I understand correctly what AMQP does under the hood, then it has it's own heartbeat and when it detects a broken connection, it goes ahead and reconnects under the hood. Depending on the types of events that are enabled (gossip, heartbeat), this can leak pretty fast.
This should be true for any version of eventlet & gevent but some could exhibit connection issues that make things worse/more noticeable.

iafilatov on 16 Feb 2019

👍2 👀1

Hi,

I suspect that we are having the same issue.
Our configuration is below. Can I either negate or confirm that this is the same issue discussed here?

Python: 2.7
Celery: 4.2.1
OS: CentOS release 6.10
Redis as broker

In the attached image you can see:

Memory consumption increasing constantly and dropping on restart.
On January 13 - we upgraded from celery 3.1.25 to 4.2.1. Memory consumption increasing pace grows.

UPDATE

Regardless this issue, we upgraded to python 3.6 and since then it seems like the leak does not happen anymore.

(the upgrade was on February 19)

@georgepsarakis

Gabaie on 18 Feb 2019

👀1

Not sure how relevant this is, but I'm having my 2GB of SWAP space exhausted by celery in production. Stopping Flower didn't clear the memory, but stopping Celery did.

pembo13 on 21 Feb 2019

could anyone try celery 4.3rc1?

auvipy on 21 Feb 2019

@auvipy I installed Celery 4.3.0rc1 + gevent 1.4.0. pip upgraded billiard to 3.6.0.0 and kombu 4.3.0.

Kind of puzzled that vine 1.2.0 wasn't also required by the rc1 package, given that #4839 is fixed by that upgrade.

Anyway, Celery 4.3.0 rc1 seems to run OK.

ldav1s on 22 Feb 2019

👍1

@ldav1s thanks a lot for the feedback. So, vine is declared as a dependency in py-amqp actually. In new installations the latest vine version will be installed but this might not happen in existing ones.

@thedrow perhaps we should declare the dependency in Celery requirements too?

georgepsarakis on 22 Feb 2019

👍1

Let's open an issue about it and discuss it there.

thedrow on 24 Feb 2019

Celery 4.3.0rc1 + gevent 1.4.0 has been running a couple of days now. Looks like it's leaking in the same fashion as Celery 4.2.1 + gevent 1.4.0.

ldav1s on 25 Feb 2019

👍3

Having the same leak with celery 4.2.1, python 3.6

Any updates on this?

yogevyuval on 2 Mar 2019

👍5

having same problem here

bilalbayasut on 28 Mar 2019

👎2

Greetings,

I'm experiencing a similar issue, but I'm not sure it is the same.

After I have migrated our celery app in a different environment/network, celery workers started to leak. Previously the celery application and the rabbitmq instance were in the same environment/network.

My configuration is on Python 3.6.5:

amqp (2.4.2)
billiard (3.5.0.5)
celery (4.1.1)
eventlet (0.22.0)
greenlet (0.4.15)
kombu (4.2.1)
vine (1.3.0)

celeryconfig

broker_url = rabbitmq
result_backend = mongodb
task_acks_late = True
result_expires = 0
task_default_rate_limit = 2000
task_soft_time_limit = 120
task_reject_on_worker_lost = True
loglevel = 'INFO'
worker_pool_restarts = True
broker_heartbeat = 0
broker_pool_limit = None

The application is composed by several workers with eventlet pool, started via command in supervisord:

[program:worker1]
command={{ celery_path }} worker -A celery_app --workdir {{ env_path }} -l info -E -P eventlet -c 250 -n worker1@{{ hostname }} -Q queue1,queue2

The memory leak behaviour it looks like this, every ~10 hours usually 1 worker, max 2 start leaking:

So I have created a broadcast message for being executed on each worker, for using tracemalloc, this is the result of top command on the machine, there is 1 worker only leaking with 1464m:

217m   1%   2   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   379
189m   1%   0   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   377     
1464m   9%   1   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   378
218m   1%   0   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   376 
217m   1%   2   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   375
217m   1%   3   0% /usr/bin/python3 -m celery worker -A celery_app --workdir   394
163m   1%   0   0% /usr/bin/python3 -m celery beat -A celery_app --workdir /app

tracemalloc TOP 10 results on the leaking worker

[2019-03-29 07:18:03,809: WARNING/MainProcess] [ Top 10: worker5@hostname ]

[2019-03-29 07:18:03,809: WARNING/MainProcess] /usr/lib/python3.6/site-packages/eventlet/greenio/base.py:207: size=17.7 MiB, count=26389, average=702 B

[2019-03-29 07:18:03,810: WARNING/MainProcess] /usr/lib/python3.6/site-packages/kombu/messaging.py:203: size=16.3 MiB, count=44422, average=385 B

[2019-03-29 07:18:03,811: WARNING/MainProcess] /usr/lib/python3.6/site-packages/celery/worker/heartbeat.py:49: size=15.7 MiB, count=39431, average=418 B

[2019-03-29 07:18:03,812: WARNING/MainProcess] /usr/lib/python3.6/site-packages/celery/events/dispatcher.py:156: size=13.0 MiB, count=40760, average=334 B

[2019-03-29 07:18:03,812: WARNING/MainProcess] /usr/lib/python3.6/site-packages/eventlet/greenio/base.py:363: size=12.9 MiB, count=19507, average=695 B

[2019-03-29 07:18:03,813: WARNING/MainProcess] /usr/lib/python3.6/site-packages/amqp/transport.py:256: size=12.7 MiB, count=40443, average=328 B

[2019-03-29 07:18:03,814: WARNING/MainProcess] /usr/lib/python3.6/site-packages/celery/events/dispatcher.py:138: size=12.4 MiB, count=24189, average=539 B

[2019-03-29 07:18:03,814: WARNING/MainProcess] /usr/lib/python3.6/site-packages/amqp/transport.py:256: size=12.3 MiB, count=19771, average=655 B

[2019-03-29 07:18:03,815: WARNING/MainProcess] /usr/lib/python3.6/site-packages/amqp/connection.py:505: size=11.9 MiB, count=39514, average=317 B

[2019-03-29 07:18:03,816: WARNING/MainProcess] /usr/lib/python3.6/site-packages/kombu/messaging.py:181: size=11.8 MiB, count=61362, average=201 B

TOP 1 with 25 frames

TOP 1

[2019-03-29 07:33:05,787: WARNING/MainProcess] [ TOP 1: worker5@hostname ]

[2019-03-29 07:33:05,787: WARNING/MainProcess] 26938 memory blocks: 18457.2 KiB

[2019-03-29 07:33:05,788: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 207

[2019-03-29 07:33:05,788: WARNING/MainProcess] mark_as_closed=self._mark_as_closed)

[2019-03-29 07:33:05,789: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 328

[2019-03-29 07:33:05,789: WARNING/MainProcess] timeout_exc=socket_timeout('timed out'))

[2019-03-29 07:33:05,790: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 357

[2019-03-29 07:33:05,790: WARNING/MainProcess] self._read_trampoline()

[2019-03-29 07:33:05,790: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 363

[2019-03-29 07:33:05,791: WARNING/MainProcess] return self._recv_loop(self.fd.recv, b'', bufsize, flags)

[2019-03-29 07:33:05,791: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 440

[2019-03-29 07:33:05,791: WARNING/MainProcess] s = recv(n - len(rbuf))

[2019-03-29 07:33:05,792: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 256

[2019-03-29 07:33:05,792: WARNING/MainProcess] frame_header = read(7, True)

[2019-03-29 07:33:05,792: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/amqp/connection.py", line 505

[2019-03-29 07:33:05,793: WARNING/MainProcess] frame = self.transport.read_frame()

[2019-03-29 07:33:05,793: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/amqp/connection.py", line 500

[2019-03-29 07:33:05,793: WARNING/MainProcess] while not self.blocking_read(timeout):

[2019-03-29 07:33:05,793: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 103

[2019-03-29 07:33:05,794: WARNING/MainProcess] return connection.drain_events(**kwargs)

[2019-03-29 07:33:05,794: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/kombu/connection.py", line 301

[2019-03-29 07:33:05,794: WARNING/MainProcess] return self.transport.drain_events(self.connection, **kwargs)

[2019-03-29 07:33:05,795: WARNING/MainProcess] File "/usr/lib/python3.6/site-packages/celery/worker/pidbox.py", line 120

[2019-03-29 07:33:05,795: WARNING/MainProcess] connection.drain_events(timeout=1.0)

I hope it could help, there are no error in the logs, other than the missed heartbeat between the workers. Now I'm trying to use the exact version of the libs we were using the old env.

UPDATE: Using the same exact dependencies lib versions and a broker heartbeat every 5 minutes the application looked like stable for longer time: more than 2 days, than it leaked again.

There were small spike continuing for ~1hour time by time, but the were "absorbed/collected".. the last one looks like started the spike.

After the 1st spike, 1st ramp, I have restarted the leaking worker.. as you can see another worker started to leak after it or probably it was already leaking, 2nd ramp.

I'm going to test without heartbeat.

UPDATE: without heartbeat leaked again after 2 days, same behaviour

440m   3%   1   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker1@ -Q p_1_queue,p_2_queue
176m   1%   0   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker2@ -Q p_1_queue,p_2_queue
176m   1%   2   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker5@ -Q p_1_queue,p_2_queue
176m   1%   1   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker3@ -Q p_1_queue,p_2_queue
176m   1%   1   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 250 -Ofair -n worker4@ -Q p_1_queue,p_2_queue
171m   1%   1   0% /usr/bin/python3 -m celery worker -A celery_app --without-heartbeat --workdir /app -l info -E -P eventlet -c 20 -n worker_p_root@ -Q p_root_queue
157m   1%   0   0% /usr/bin/python3 -m celery beat -A celery_app --workdir /app --schedule /app/beat.db -l info

UPDATE:
Using celery 4.3.0 it seems the problem resolved and it is stable since a week

amqp (2.4.2)
billiard (3.6.0.0)
celery (4.3.0)
eventlet (0.24.1)
greenlet (0.4.15)
kombu (4.5.0)
vine (1.3.0)

Please let me know if I could help somehow, instrumenting the code. If necessary provide links and example please.

Thank you

davidedeangelismdb on 29 Mar 2019

👍2

I'm also having a memory leak. It looks like I've managed to find the cause.
https://github.com/celery/celery/blob/master/celery/events/dispatcher.py#L75
I can see that this buffer starts to grow after connection issues with rabbit. I don't understand why it fails to clear events eventually, it continues to grow over time and consume more and more ram. Passing buffer_while_offline=False here https://github.com/celery/celery/blob/master/celery/worker/consumer/events.py#L43 seems to fix the leak for me. Can someone please check if this is related?

yevhen-m on 24 Apr 2019

👍2

https://github.com/celery/celery/pull/5482

auvipy on 24 Apr 2019

@yevhen-m thank you a lot! That helped us to solve the memory leakage!

slavpetroff on 4 Sep 2019

🚀2 🎉2

Its good that we have a workaround but can we please find a proper fix?

thedrow on 10 Sep 2019

continuous follow this memory leak issue

lxkaka on 26 Dec 2019

celery-pod-screencshot-lastweek

I'm using celery in a production environment, and I'm deploying it via docker.
Like the screenshot, we are having the same problem.
Our production config is shown below.

Docker parent image: python 3.6.8-buster
Celery version: 4.2.0
Command Options:

concurrency 4
prefetch-multiplier 8
No result_backend
acks_late and reject_on_worker_lost

I wonder if upgrading celery's version to 4.3.0 solves the memory leak issue.

Thank you!

yoonnoon on 27 Dec 2019

celery 4.4.0 is the latest stable

auvipy on 27 Dec 2019

Team, is there any update about this issue? Was this addressed and fixed in celery 4.4.0?

spidernello on 28 Jan 2020

👀7

Team, is there any update about this issue? Was this addressed and fixed in celery 4.4.0?

Unfortunately no. It is now addressed.

thedrow on 9 Feb 2020

Team, is there any update about this issue? Was this addressed and fixed in celery 4.4.0?

it will be available at 4.4.1

auvipy on 26 Feb 2020

it will be available at 4.4.1

is it fixed in current version 4.4.1?

nixon1333 on 23 Mar 2020

@auvipy The problem is still present in Celery 4.4.2, and 4.4.6. We see same memory leaks across all workers.

BROKER_POOL_LIMIT = None
CELERY_ACKS_LATE = False
CELERY_TRACK_STARTED = True
CELERYD_MAX_TASKS_PER_CHILD = 1
CELERYD_PREFETCH_MULTIPLIER = 1
BROKER_TRANSPORT_OPTIONS = {
    'fanout_prefix': True,
    'fanout_patterns': True,
    'visibility_timeout': 43200,
    'health_check_interval': 180,
    'socket_keepalive': True,
    'retry_on_timeout': True,
}

Celery worker is started with -O fair --without-heartbeat --without-gossip -c 1 -l flags. We also use -n and -Q flags to set worker name and queues. Running in prefork mode. Redis as configured as both, broker and result store.

~We see many missed heartbeats on long running tasks. So the problem reported in linked issues still persists.~

It's the same with disabled heartbeats.

jsynowiec on 13 Apr 2020

👍1

@jsynowiec When I faced this issue the only thing that worked for me was running the workers with gossip disabled, I mentioned something about it here https://github.com/celery/celery/issues/4843#issuecomment-459789086

menecio on 11 May 2020

We are experiencing the same issue with celery 4.4.2 and redis as a broker. Over the timespan of 48 hours celery consumes up to 60 GB of RAM until finally running out of memory.
None of the solutions named here had any effect to this behaviour.

ErrorInPersona on 10 Jul 2020

👍2

We are experiencing the same issue with celery 4.4.2 and redis as a broker. Over the timespan of 48 hours celery consumes up to 60 GB of RAM until finally running out of memory.
None of the solutions named here had any effect to this behaviour.

Did you try our latest patch version?
Do you have the same conditions as the OP?

thedrow on 12 Jul 2020

Memory leaks are still present on v4.4.6 We run workers with settings listed in an earlier comment. OP uses RabbitMQ, we use Redis as a broker.

jsynowiec on 12 Jul 2020

+1, noticing memory usage gradually increase over 24 hours even with minimal work being done. I think this issue should be re-opened.

Skowt on 12 Jul 2020

can you profile and find out the root of your memory leak?

auvipy on 13 Jul 2020

Memory leaks are still present on v4.4.6 We run workers with settings listed in an earlier comment. OP uses RabbitMQ, we use Redis as a broker.

It seems like this is a different issue or that our fix wasn't correct.
Since this solved the OP's problem it is probably a different issue, right?

thedrow on 14 Jul 2020

👍1

[2020-07-31 10:51:53,176: WARNING/MainProcess] /usr/local/lib/python3.8/site-packages/redis/client.py:90: size=19.2 KiB (+19.2 KiB), count=180 (+180), average=109 B
[2020-07-31 10:53:53,271: WARNING/MainProcess] /usr/local/lib/python3.8/site-packages/redis/client.py:90: size=230 KiB (+211 KiB), count=2160 (+1980), average=109 B
[2020-07-31 10:54:53,364: WARNING/MainProcess] /usr/local/lib/python3.8/site-packages/redis/client.py:90: size=250 KiB (+19.2 KiB), count=2340 (+180), average=109 B

....

[2020-07-31 12:24:10,633: WARNING/MainProcess] /usr/local/lib/python3.8/site-packages/redis/client.py:90: size=49.9 MiB (+76.8 KiB), count=478620 (+720), average=109 B
[2020-07-31 12:25:14,528: WARNING/MainProcess] /usr/local/lib/python3.8/site-packages/redis/client.py:90: size=49.9 MiB (+19.2 KiB), count=478800 (+180), average=109 B
[2020-07-31 12:27:22,346: WARNING/MainProcess] /usr/local/lib/python3.8/site-packages/redis/client.py:90: size=49.9 MiB (+57.6 KiB), count=479340 (+540), average=109 B
[2020-07-31 12:28:26,265: WARNING/MainProcess] /usr/local/lib/python3.8/site-packages/redis/client.py:90: size=50.2 MiB (+269 KiB), count=481860 (+2520), average=109 B

CELERY_RESULT_BACKEND = False CELERY_IGNORE_RESULT = True CELERY_MAX_TASKS_PER_CHILD = 1 CELERY_WORKER_PREFETCH_MULTIPLIER = 1 CELERY_TASK_RESULT_EXPIRES = 10 CELERY_BROKER_POOL_LIMIT = 70 CELERY_REDIS_MAX_CONNECTIONS = 100
app.conf.broker_transport_options = {'visibility_timeout': 43200}

celery -A proj worker --concurrency=70 --prefetch-multiplier=1 -Ofair --pool=gevent -n --without-gossip --without-mingle

Redis client leaking memory? I'm using celery v4.4.6 with gevent, redis as broker and no result backend.

susoo on 31 Jul 2020

👍1

Maybe that's an issue too. Maybe it's in gevent?
CC @jamadden @andymccurdy
Can you please help us put this issue to rest and ensure no memory is leaking on your end?

thedrow on 2 Aug 2020

👍1

Maybe it's in gevent?

We're not using gevent. Workers are started with concurrency=1 and prefork.

jsynowiec on 3 Aug 2020

👍1

Hi guys, not sure why this issue is closed, we have been having this issue for 2 years now, updating to the last version of Celery every time, and still having big severs (64-128GB of RAM) constantly running out of RAM because of this memory leak issues.

Is there any workaround without downgrading to Celery 3 or replacing Rabbitmq?

This makes Celery completely unstable on production environments, I hope it can be fixed, we can't downgrade to Celery 3 so we are planning on moving to another solution (maybe Dramatiq) in order to stop worrying about Celery eating the whole servers RAM on production every 2 days.

arielcamino on 3 Aug 2020

👍1

@arielcamino - I've been using the settingworker_max_tasks_per_child to replace worker instances after 100~ tasks which has helped maintain memory usage, at least for my servers. I'm running tiny instances of 512MB and this helped (previously would exhaust my ram) so maybe it will help you.

Skowt on 3 Aug 2020

🎉1 👍1

@Skowt wow, that's super helpful, thanks a lot! Will try right now.

arielcamino on 3 Aug 2020

@arielcamino - I've been using the settingworker_max_tasks_per_child to replace worker instances after 100~ tasks which has helped maintain memory usage, at least for my servers. I'm running tiny instances of 512MB and this helped (previously would exhaust my ram) so maybe it will help you.

Thanks for sharing your workaround. This did not help here - we are using redis though.

ErrorInPersona on 3 Aug 2020

@thedrow I'm not aware of any memory leaks in redis-py. If redis-py had a leak I assume that someone would have encountered it outside of the Celery environment and reported it to the redis-py issue tracker.

Happy to help where I can (I use Celery w/ Redis as a broker on several projects), but I haven't encountered this issue in my deployments.

andymccurdy on 3 Aug 2020

❤1 👍1

I am not aware of any memory leaks in current versions of gevent. I assume (hope) someone would have said something if they encountered that (it has happened once or twice before). My current deployments of gevent have multiple workers (web and background) up for weeks at a time heavily using gevent and we haven't encountered memory leaks.

jamadden on 3 Aug 2020

👍2 ❤1

Hi guys, not sure why this issue is closed, we have been having this issue for 2 years now, updating to the last version of Celery every time, and still having big severs (64-128GB of RAM) constantly running out of RAM because of this memory leak issues.

Is there any workaround without downgrading to Celery 3 or replacing Rabbitmq?

This makes Celery completely unstable on production environments, I hope it can be fixed, we can't downgrade to Celery 3 so we are planning on moving to another solution (maybe Dramatiq) in order to stop worrying about Celery eating the whole servers RAM on production every 2 days.

How many workers do you have? How many tasks do you run? How often do you run this tasks and how long do they usually take to finish?

The reason why Rabbitmq/celery starts to use a lot of ram can be related to the amount of tasks queued. If you queue too many tasks and the workers can't complete them all, it will enlarge the queue and this queue will use more and more RAM and eventually it will consume all the RAM available. I believe that this problem might happen with Redis too.

I have another theory but first I want to know if this might be the reason to your problem.

ardilom on 3 Aug 2020

👍1

@ardilom sorry I've just realized we are not sending RabbitMQ data to datadog, but I will try to clarify our situation, this is how some servers RAM goes down every 2 days:
memory-leaks-1

We always check the number of tasks pending, and normally is around 0 (this data is from some days ago):

memory-leaks-2

We run around 250,000 tasks per day, we have around 10 workers, each one with around 4 to 10 concurrency, the average runtime is around 5 seconds, it depends on the kind of task.

We always check messages_ready to make sure there are no too many tasks queued (this is what you see in the second image), do you think it's ok to measure messages_ready? We have some peaks eventually, but normally these are close to 0.

For solving the issue I just restart the Celery worker manually and the RAM usage gets normal again.

Let me know if you need anything else, I've just changed the setting of worker_max_tasks_per_child on one of the tasks servers, to see if there is any difference with the rest of them after applying the configuration.

Thanks!

arielcamino on 3 Aug 2020

Hi guys, this is to confirm that changing worker_max_tasks_per_child to 1000 in my case, fixed the issue 🎉 thanks again @Skowt

Something I've not mentioned yesterday, I'm using the "prefork" mode, maybe moving to gevent is another way to resolve the issue.

arielcamino on 4 Aug 2020

🎉1 👎1

@arielcamino This issue was closed since we resolved a specific memory leak. We have yet to find another cause for the memory leak. We know there is a problem but we don't know how to reproduce it.
We need someone with access to a production environment where the bug reproduces to debug the problem.
If we don't have one, we'll have to determine that this issue is not actionable.

thedrow on 4 Aug 2020

👍1

Hello, can we reopen this issue? We are experiencing similar leaks, using celery==4.4.7 (with rabbitmq) we have the worker running stable for couple hours, sometimes much more, and then all of a sudden starts slowly leaking and ends up using all memory.

Currently using prefork with --concurrency=1 and the flag --max-tasks-per-child=100 which doesn't seem to help since the parent process appears to be the one leaking.

celery_leak

I can provide more information to help debug this issue.

loop0 on 7 Aug 2020

👍1

Hello, can we reopen this issue? We are experiencing similar leaks, using celery==4.4.7 (with rabbitmq) we have the worker running stable for couple hours, sometimes much more, and then all of a sudden starts slowly leaking and ends up using all memory.

Currently using prefork with --concurrency=1 and the flag --max-tasks-per-child=100 which doesn't seem to help since the parent process appears to be the one leaking.

I can provide more information to help debug this issue.

re-opening the issue is not a big deal, it's the interest of someone facing this in production and help track and contribute a fix or at least find out the root cause of the leak in the production.

auvipy on 12 Aug 2020

I can definitely help, but I kind ran out of ideas on what to do, I ran a couple tools but couldn't identify much about the issue. The only thing that kinda narrows it down is the tracemalloc snapshots I took, which shows the memory increase in the same places every couple minutes or so. This is the top 10 comparing two snapshots:

/usr/local/lib/python3.8/site-packages/celery/events/dispatcher.py:148: size=259 KiB (+218 KiB), count=1026 (+867), average=259 B
/usr/local/lib/python3.8/site-packages/kombu/messaging.py:178: size=231 KiB (+194 KiB), count=1056 (+888), average=224 B
/usr/local/lib/python3.8/site-packages/amqp/connection.py:513: size=217 KiB (+182 KiB), count=703 (+591), average=316 B
/usr/local/lib/python3.8/site-packages/celery/events/dispatcher.py:214: size=207 KiB (+174 KiB), count=704 (+592), average=302 B
/usr/local/lib/python3.8/site-packages/kombu/messaging.py:200: size=204 KiB (+171 KiB), count=704 (+592), average=296 B
/usr/local/lib/python3.8/site-packages/amqp/transport.py:253: size=203 KiB (+171 KiB), count=703 (+591), average=296 B
/usr/local/lib/python3.8/site-packages/amqp/connection.py:508: size=184 KiB (+154 KiB), count=703 (+591), average=268 B
/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py:445: size=182 KiB (+153 KiB), count=352 (+296), average=528 B
/usr/local/lib/python3.8/site-packages/amqp/channel.py:1758: size=169 KiB (+143 KiB), count=703 (+593), average=247 B
/usr/local/lib/python3.8/site-packages/kombu/asynchronous/hub.py:301: size=167 KiB (+140 KiB), count=351 (+295), average=486 B

loop0 on 12 Aug 2020

The issue still exists
This happens when a celery task access app context to do a functionality
It won't release or dispose it after performing the task

--max-tasks-per-child=

wasn't helpful

Kalanamith on 13 Aug 2020

Hello, can we reopen this issue? We are experiencing similar leaks, using celery==4.4.7 (with rabbitmq) we have the worker running stable for couple hours, sometimes much more, and then all of a sudden starts slowly leaking and ends up using all memory.
Currently using prefork with --concurrency=1 and the flag --max-tasks-per-child=100 which doesn't seem to help since the parent process appears to be the one leaking.

I can provide more information to help debug this issue.

re-opening the issue is not a big deal, it's the interest of someone facing this in production and help track and contribute a fix or at least find out the root cause of the leak in the production.

for me if i add --max-tasks-per-child it works.
like for this sample arguments --autoscale=5,2 --max-tasks-per-child=40 the result is like this

Screenshot 2020-08-13 at 2 26 13 PM

nixon1333 on 13 Aug 2020

While I believe a recent celery upgrade introduced the memory leak, I can't be total confident. I will share that the following setting solved the leak.

I can't tell which settings are valid by the documentation, so I am settings all of these values in my Django settings file.

CELERY_CONCURRENCY = CELERY_WORKER_CONCURRENCY = 1
CELERY_MAX_TASKS_PER_CHILD = CELERY_WORKER_MAX_TASKS_PER_CHILD = 1

chrisconlan on 14 Aug 2020

❤1 👎1

This does not solve the leak we are seeing, as it also happens in the gevent pool. What I noticed is that we have the celeryev queue quite busy. Because the tracemalloc showed the event dispatch as one of the possible sources for the leak I explicitly disabled task events and turn our flower instance off, for now it appears that the leak is not happening anymore, I will let it run through the weekend and share the results here.

loop0 on 14 Aug 2020

possible sources for the leak I explicitly disabled task events and turn our flower instance off

Anecdotal datapoint from someone who has been watching this issue silently since early on (and has never experienced it directly): I'm aware of one other project (with a not-insubstantial workload for celery) where doing the above had the same outcome of stopping a memory leak. Having only second hand information, I _obviously_ cannot confirm that it was even the same underlying issue (AFAIK it was rabbitmq, no idea about gevent etc), but it's interesting that it correlates.

kezabelle on 14 Aug 2020

👍1

I suspect it has something to do with rabbitmq connection somehow, the stack we've been observing this leak:

celery (latest version): either prefork or gevent pool, both shows the same leak pattern.
rabbitmq (cloudamqp SaaS)
flower

We've checked all of our tasks for leaks and couldn't find any leaks so that's why my suspicion of being something on celery side.

One interesting fact is that we currently have many workers running, and I noticed that, once one starts leaking it also shows on flower as offline.

As I ran out of ideas where to look I disabled flower and task events and will keep monitoring if the leak will come back or not.

loop0 on 14 Aug 2020

I'm open to believing it is another part of my stack that is leaking memory at this point. Celery may have had serendipitous behavior in the past that contributed to controlling memory leaks, but all of us together don't seem to be having similar enough issues to confirm that. I know a lot of us are either running ...

A huge number of nested tasks at once, or
A few monolithic that kick off multi-core processing within the worker

In these cases, we just need to be smart about allowing or disallowing a certain level of concurrency, task queueing, and tasks per child worker. Additionally, all of us should be using built-in safeguards that can kill memory-hungry tasks before they have the opportunity to crash our servers.

An increasing number of people are running heavy-handed CPU-bound and memory-bound processes in celery, which it wasn't really made for, so I think the quickstart documentation should include more detail on this.

chrisconlan on 14 Aug 2020

As already mentioned in my previous comments, we are running workers with both max-tasks-per-child and concurrency set to 1 since long ago. It doesn't do anything about leaking memory. Moreover, we are using Redis as both, broker and results backend.

From my observations, when RabbitMQ is used as a broker, if setting max-tasks-per-child to 1 "solves" the memory leak it's most likely is a problem with task implementation, not celery.

What we are observing and reporting is different. Even If we leave worker idle for several days, without processing a single task, it stil leaks memory to a point when it hits memory limit and is killed by the supervisor. You can find more details and memory charts in earlier comments.

With worker processing a single task on schedule, the memory chart should more or less show a square wave-like, but you can clearly see that overall memory usage only raises.
Screenshot 2020-08-14 at 20 42 24

I've managed to put profiling of celery workers on our roadmap. I'll share memory dumps and more details when we start working on this.

jsynowiec on 14 Aug 2020

🚀1 ❤1 👍1

I can confirm that turning flower off (and explicitly disabling task events through settings) fixed the leak.

As I mentioned before, at the moment that the worker started leaking I noticed in flower that it would go offline and the celeryev always showed quite busy, so I went through the easy way and turned flower off.

Unfortunately I couldn't find the piece of code that causes the leak. But at least there is this work around.

loop0 on 17 Aug 2020

then probably this is not a celery issue but flower?

auvipy on 17 Aug 2020

@auvipy flower triggers the issue, but the leak definitely happens on the worker (celery)

loop0 on 17 Aug 2020

@auvipy flower triggers the issue, but the leak definitely happens on the worker (celery)

fair enough. thanks for sharing.

auvipy on 17 Aug 2020

I'm using Celery with Redis and Flower, and I have to say I'm not currently seeing any memory issues. Anything you can want from me with regards to data?

pembo13 on 17 Aug 2020

👍1

@auvipy not using Flower. Workers are started with events disabled.

jsynowiec on 18 Aug 2020

@auvipy not using Flower. Workers are started with events disabled.

please try to debug and find out the root of the memory leak. could be celery or your could. it would be best if you can share unit and integration tests

auvipy on 18 Aug 2020

please try to debug and find out the root of the memory leak. could be celery or your could. it would be best if you can share unit and integration tests

Mentioned here, that we see OOMs due to celery worker leaking memory even if no tasks are processed by worker.
Can't share unit or integration tests as this would expose company's IP. Sorry. But I've managed to add a task for capturing memory dumps on production workers on our internal roadmap. Will share counters and refs for a few scenarios when it's done.

jsynowiec on 18 Aug 2020

👍2

@jsynowiec If you can make it before 5.0.0 GA (follow #6266 for updates) that would be awesome.

Once a bugfix lands in master it will be backported to 4.x as well.

thedrow on 1 Sep 2020

@thedrow When is GA of 5.0 planned? Unfortunately, we have some legacy code that is still due to be migrated to Py3 😞 so we are stuck with Celery 4 for the time being.

jsynowiec on 7 Sep 2020

We have one release blocker and some documentation to complete.
The answer is very soon.

thedrow on 9 Sep 2020

👍1

I can confirm that turning flower off stops the leak. We've been running without a leak almost a month now.

loop0 on 9 Sep 2020

👍1

So there's still a bug somewhere in our events publishing mechanism.
Does anyone has an idea what it could be?

thedrow on 10 Sep 2020

We don't use Flower and workers are started without --events, yet we experience continuous memory leaks.

The answer is very soon.

I've managed to assign high priority to getting memory dumps and object counters from production workers. I should be able to post some data in the following weeks. We've also elevated priority on finalising py2->py3 porting so everything should run, and be profiled, using Python 3

jsynowiec on 10 Sep 2020

❤3

What I'm worried about is that we're talking about two different issues here.

thedrow on 13 Sep 2020

Apparently. One is related to events and maybe Flower, maybe also using RabbitMQ as a broker. According to reported issues here, on GitHub, it surfaces here and there since a few years. The other one (that affects my project) is related to different components and most likely related to using Redis as a broker. Or maybe at the root, those are the same issues that originate in the same code who knows 🤷🏼 . Like the one with trail keeping track of subtasks and leaking instances of AsyncResult 😉

jsynowiec on 14 Sep 2020

@thedrow @auvipy Just letting you know that we're now moving to memory profiling of workers.

Also, while finalising Python3 migration we hit another issue that seems related to https://github.com/celery/celery/issues/4470 or https://github.com/celery/celery/issues/5359. Under certain conditions on Linux systems, while using Redis as a broker, calls to join_native hangs indefinitely despite all tasks within group being already done. Quick strace point to it literally hanging on read, which might indicate some low-level kernel/lib stuff. For now we switched to plain, pooling join as we focus on memory leaks.

jsynowiec on 28 Sep 2020

👀3

Hello everyone - finally _some_ data: celery-memtrace-1.tar.xz, signature, my key.

The archive contains tracemalloc logs from 8 workers after ~16 days, a memory usage graph for the period and some version information (including Celery startup banner).

I haven't honestly spent any significant time analyzing any of this, but a) our code was never in the list, b) it may as well be some weird interaction with SQLAlchemy which we also use everywhere, so it's not impossible that the problem is elsewhere, or it's a combination/interaction problem.

If any other details would be useful please do ask. We also continue to run those 8 workers with this memory usage logging, so perhaps will be able to collect more/better data.

EDIT: Also this comment from this thread is related - we still use the same settings.

drbig on 3 Nov 2020

🎉1

I hope you'll find the root cause for this leak.
I'll try to make some time to dig into this myself.

thedrow on 3 Nov 2020

I wonder if this could help mitigate the issue.
https://reliability.substack.com/p/run-python-servers-more-efficiently

thedrow on 17 Nov 2020

We are investigating the possibility that the memory-leaks origin is within the requests library not celery itself. Anyone else who is experiencing memory-leaks in celery using requests in the tasks?

ErrorInPersona on 17 Nov 2020

👍2 👎1

@ErrorInPersona Yes, we are registering OOMs in workers with and without requests alike.

jsynowiec on 17 Nov 2020

👍1

@drbig Any luck?

thedrow on 17 Nov 2020

Screenshot_2020-11-17_12-56-28

Well, look at the "green one", the floor is rising, slowly but surely... So except for a quick confirmation of "yep, it's still an issue" not much to add from my side, unfortunately.

However I've skimmed the link that @thedrow provided and my to-do list (a bottomless pit) now includes - try running some workers with jemalloc forced in, so I'll get to that, _eventually_.

drbig on 17 Nov 2020

❤1

Celery: Continuous memory leak

Most helpful comment

All 129 comments

-- coding: utf-8 --

Related issues