Celery: Celery hang With 100% CPU Usages

Created on 21 Dec 2016  ·  29Comments  ·  Source: celery/celery

I am running celery with amazon SQS. In celery task, task is sending put request to a server using requests library of python. After successfully run task for first request. Celery Hangs with 100% cpu usage. Dont know whats going on.
The Strace Dump for hanging pid is -->>
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x999104, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x999100, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x999140, FUTEX_WAKE_PRIVATE, 1) = 1

Configuration -->>
celery == 4.0.2
kombu == 4

Amazon SQS Broker Bug Report Major Confirmed ✔ Has Testcase ✔ Worker Hangs

Most helpful comment

Hey @rafales and @ask, I've done some more testing around this.

I'm using Celery/Kombu 4.0.2 and SQS. I have a small ECS cluster with 3 containers running. If I queue up 100 jobs all at once Celery will make it through 6 before it completely stops processing and CPU utilization goes up to 100% per container. It hangs at 6 because each container is running 2 celery processes (2 processes * 3 containers). That means each process executes one task before hanging.

I _do_ have CELERYD_PREFETCH_MULTIPLIER set to 0, yet this is still happening.

This is what I'm running from the command line:

$ celery worker -A myapp.celery -Q myapp-builds --loglevel=INFO

Update: I bet this setting needs to be CELERY_WORKER_PREFETCH_MULTIPLIER now. I should really just switch to the lowercase setting names...

Setting CELERY_WORKER_PREFETCH_MULTIPLIER to 0 _does_ fix this issue, which is great. Tasks no longer get stuck.

All 29 comments

I have been facing a similar issue with sqs broker but it occurs after the workers pick up the messages from the sqs. the celery remains at 100% an i don't even see any activity in the strace. the last thing i see is the message that the task is picked up.

[2016-12-19 03:37:26,266: DEBUG/MainProcess] TaskPool: Apply

when i run losf it seems normal as well
celery 19304 farshid 17u IPv4 550364 0t0 TCP 172.30.0.57:41524->54.239.28.115:https (CLOSE_WAIT)
celery 19304 farshid 18u IPv4 550366 0t0 TCP 172.30.0.57:41525->54.239.28.115:https (CLOSE_WAIT)
celery 19304 farshid 19r FIFO 0,8 0t0 550359 pipe
celery 19304 farshid 20r FIFO 0,8 0t0 550360 pipe
celery 19304 farshid 21u IPv4 550369 0t0 TCP 172.30.0.57:41526->54.239.28.115:https (ESTABLISHED)

would someone be able to help me add a log statement in celery that i can use to see why the task is never invoked .

i have been using same code base with amqp without any issue for months. this happened after i transitioned to sqs and it was okay first day or so but things stopped working afterwards

i was finally able to get this to work by
1- using -P gevent
2- CELERYD_PREFETCH_MULTIPLIER=0

I added lot of log messages to find out where the message gets stuck and it seems the AsyncPool class ( which is the on_apply) method does not start running the task

I will spend more time o this to find out why AsyncPool does not run. This issue again only occurred when the SQS queue had thousand messages queued only. If i would start with a new Queue things would work fine.

Python 3.5.1 (default, May 11 2016, 01:09:13)
celery - latest
kombu (4.0.2)
celery (4.0.0) ( also tried with 4.0.1 and 4.0.2)

Is this in the main process, or one of the child processes?

The private futex may imply a thread is being used, and some kind of mutex, when there should be no threads in the worker. Maybe 1) boto is using threads internally 2) your tasks are doing something, or 3) your tasks are initializing code in the parent that creates a mutex

@ask I ran into the same problem. With gdb I was able to narrow down the problem to code in SQS.py. Somehow this part of code goes into infinite loop:

``` def _loop1(self, queue, _=None):
self.hub.call_soon(self._schedule_queue, queue)

def _schedule_queue(self, queue):
    if queue in self._active_queues:
        if self.qos.can_consume():
            self._get_bulk_async(
                queue, callback=promise(self._loop1, (queue,)),
            )
        else:
            self._loop1(queue)
Here is one of the backtracks from gdb:

(gdb) py-bt
Traceback (most recent call first):
File "/usr/local/lib/python3.5/abc.py", line 182, in __instancecheck__
if subclass in cls._abc_cache:

File "/usr/local/lib/python3.5/site-packages/kombu/async/hub.py", line 193, in call_soon
if not isinstance(callback, Thenable):
File "/usr/local/lib/python3.5/site-packages/kombu/transport/SQS.py", line 284, in _loop1
self.hub.call_soon(self._schedule_queue, queue)
File "/usr/local/lib/python3.5/site-packages/kombu/transport/SQS.py", line 293, in _schedule_queue
self._loop1(queue)
File "/usr/local/lib/python3.5/site-packages/vine/promises.py", line 131, in __call__
retval = self.fun(final_args, *final_kwargs)
File "/usr/local/lib/python3.5/site-packages/kombu/async/hub.py", line 282, in create_loop
item()

File "/usr/local/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
next(loop)
File "/usr/local/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start
c.loop(c.loop_args())
File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
return self.obj.start()
File "/usr/local/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 256, in run
worker.start()
File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 244, in __call__
ret = self.run(
args, *kwargs)
File "/usr/local/lib/python3.5/site-packages/celery/bin/worker.py", line 221, in run_from_argv
return self(
args, **options)
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 412, in execute
).run_from_argv(self.prog_name, argv[1:], command=argv[0])
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 480, in handle_argv
return self.execute(command, argv)
File "/usr/local/lib/python3.5/site-packages/celery/bin/base.py", line 281, in execute_from_commandline
return self.handle_argv(self.prog_name, argv[1:])
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 488, in execute_from_commandline
super(CeleryCommand, self).execute_from_commandline(argv)))
File "/usr/local/lib/python3.5/site-packages/celery/bin/celery.py", line 326, in main
cmd.execute_from_commandline(argv)
File "/usr/local/lib/python3.5/site-packages/celery/__main__.py", line 14, in main
_main()
File "/usr/local/bin/celery", line 11, in
sys.exit(main())
```

@ask Also this is the case @farshidce ran into, where strace does not print anything.

Is this in the main process, or one of the child processes?
the log messages was from the main process

The private futex may imply a thread is being used, and some kind of mutex, when there should be >>no threads in the worker. Maybe 1) boto is using threads internally 2) your tasks are doing >>something, or 3) your tasks are initializing code in the parent that creates a mute

i could not reproduce this after i flushed out all the messages using gevent ( instead of asyncpool)
the workers and methods i have are straightforward and don't use any thread and they have been working fine without any issue on rabbitmq for months.

I did not repro this issue again after switching to gevent but as soon as i find time i plan to run some load testing again to reproduce having 15k messages on the SQS and see if things get stuck again.

on a side note ( sorry for long response ) has anyone asked SQS team to implement what celery needs to respond to "inspect active" . this can be quite helpful in debugging these sort of issues...

Further investigation revealed some more information. The loop is indeed in the code I pasted above. Loop in in the _loop1/_schedule_queue. The reason for that is self.qos.can_consume() always returns False. I have prefetch multiplier set to 8. All 8 tasks are prefetched. Logs confirm it - before the problems begin I have 8 logs about "received task" from main process. After that celery hangs - received tasks are not being executed.

I think I got to the bottom of this. To trigger this bug you need prefetch multiplier set to X and you need to receive X messages at once. At that point qos.can_consume() returns False, call to _loop1 is made which schedules _schedule_queue in hub.

This means that if prefetch multiplier is reached the CPU usage will be high (especially if you have tasks which execute for a long time) which I think is bug on it's own.

Now, the real bug is in Hub.create_loop function in kombu/async/hub.py.

        while 1:
            for tick_callback in on_tick:
                tick_callback()

            while todo:
                item = todo.pop()
                if item:
                    item()

            poll_timeout = fire_timers(propagate=propagate) if scheduled else 1

The way call_soon() is implemented - if we schedule a function, which schedules a function, which schedules a function and so on... it will create an infinite loop here. Loop will never get a chance to poll readers and writers as it will be always stuck processing "todo" items.

Good catch @rafales. So this appears to affect SQS only. Is that correct? Also, have you found any sort of workaround? Or is "revert to Celery 3.x" the answer for now?

@grantmcconnaughey it may affect other backends too. For SQS setting prefetch multiplier to 0 should fix this for now. I hope it will get fixed soon though.

Cool, thanks a lot @rafales. I have CELERYD_PREFETCH_MULTIPLIER = 1, so I'll change that and see what happens.

Hey @rafales and @ask, I've done some more testing around this.

I'm using Celery/Kombu 4.0.2 and SQS. I have a small ECS cluster with 3 containers running. If I queue up 100 jobs all at once Celery will make it through 6 before it completely stops processing and CPU utilization goes up to 100% per container. It hangs at 6 because each container is running 2 celery processes (2 processes * 3 containers). That means each process executes one task before hanging.

I _do_ have CELERYD_PREFETCH_MULTIPLIER set to 0, yet this is still happening.

This is what I'm running from the command line:

$ celery worker -A myapp.celery -Q myapp-builds --loglevel=INFO

Update: I bet this setting needs to be CELERY_WORKER_PREFETCH_MULTIPLIER now. I should really just switch to the lowercase setting names...

Setting CELERY_WORKER_PREFETCH_MULTIPLIER to 0 _does_ fix this issue, which is great. Tasks no longer get stuck.

-P solo worked for me while -P gevent starts up an immediately exits. I didn't have time to do much investigating since I was in 🔥 fighting mode, but a few other notes:

  • Same behavior where it'd pre-fetch and then get stuck in a loop with 100% CPU usage.
  • Starting a solo worker in the foreground by itself exhibited the same behavior as multiple workers daemonized.
  • Had no issues with our SQS queues for 3 days on newly spun up EC2 instances. All of the sudden it went caput.
  • I have not tried setting CELERY_WORKER_PREFETCH_MULTIPLIER to 0 yet.

-P solo worked for me also, i've created a faux level of concurrency using supervisord numprocs for now.

CELERY_WORKER_PREFETCH_MULTIPLIER set to 0 did unblock the queue, but ultimately dumped everything into the deadletter queue, so instead i set this to 2 (default:4) in order to distribute queue messages out more evenly to the celeryd's. Whilst this works, it is definitely more memory hungry. (but functional!)

I am experiencing this same issue. Seems like a pretty core piece of functionality, I'm surprised that it has gone several months without a fix. For what it's worth, thanks to @etos and @joestump I was able to workaround the problem using -P solo as suggested.

Seeing the same issue. Is there a proposed fix at this point?

We'd appreciate a PR that fixes this.

So, having the same issue when trying to run celery on designated AWS ECS container images (Docker).

I was able to fix my workers with CELERY_WORKER_PREFETCH_MULTIPLIER = 0 & -P solo, however the solo option is not available for celery beat.

Anybody got suggestions how to solve this?

Broker: AWS SQS
Affected version: 4.0.x

@HitDaCa Unless I'm wrong, celery beat is just a simple process that will push new tasks on the given schedule, it doesn't do any processing and thus it doesn't need any config for the workers pool.
Someone correct me if I'm wrong.

@rubendura sounds right, celery beat is a producer for recurring tasks. Thus the solo concurrency task pool setting should not apply here.

@rubendura & @georgepsarakis Thank you for the clarification, I must have gotten my self confused around who processes a beats scheduled task.

Following this I assume the non functioning celery worker was responsible for not processing the scheduled task. The issue of the beat not being triggered in the first place most likely resulted from the worker using 100% of the available cpu.

I'm also experiencing this with:

celery==4.0.2
astroid==1.1.1
kombu==4.0.2

Celery appears to run fine for a few minutes. The logs even show dozens of tasks completing, but then everything appears to stop, no more tasks are retrieved, even though the SQS queue lists hundreds of pending tasks. Celery's logs show nothing is being processed, yet the workers are consuming 100% CPU.

@etos What broker are you using? Would setting CELERY_WORKER_PREFETCH_MULTIPLIER = 0 cause tasks to be deleted and not processed with SQS?

Hi @chrisspen, Yeh, I believe this ticket only occurs with SQS

Continuing with @rafales debugging we digged a little further. The problem is indeed the fact that in kombu/async/hub.py the while todo: statement produces an infinite loop because the "todo item" it calls just immediately calls call_soon which adds this item again and the loop never ends.

i.e: in _schedule_queue if can_consume() is False this calls _loop1 which calls call_soon with the _schedule_queue callback. call_soon just adds this callback to the todo set.
so once its in the todo set the loop starts and calls the _schedule_queue method which immediately calls _loop1 that calls call_soon with _schedule_queue which is added to the todo, so the loop sees it has more items in the todo set, calls that again, and so on...

A possible solution (I'm not committing a PR yet, because I'm not sure what other problems it might cause) is "freeze" the todo set, let it call all of its items, let the rest of the code execute (the writers and readers) and then get back to the todo loop on the next iteration of the external loop, and let it do its magic again. so in kombu/async/hub.py create_loop:

todo = self._ready.copy()
self._ready.clear()
while todo:
     item = todo.pop()
     if item:
         item()

That way if iter() is the _schedule_queue callback it will be added to the todo but it will let the rest of the code execute and potentially let the worker clear some tasks and actually be able to consume more messages from the queue.

One caveat I can see here, is the fact that this problem is only related to SQS but the code changed here is not.
Anyone who knows the internals a bit better ( @ask @thedrow ) can maybe comment on this or a similar solution?

This seems to solve the issue in hand, but still not sure it doesn't create others.
will update.

One caveat I can see here, is the fact that this problem is only related to SQS but the code changed here is not.

@gabriel-amram I've been following this issue for a while because I believed it might be related to similar behavior we're experiencing with RabbitMQ since switching to 4.x. Could be that it's not limited to SQS then?

@tpyo it might be, can you refer me to the issue you are talking about? maybe the code for RabbitMQ uses the same mechanism and then it can explain the other issue as well.

One more here facing this issue:

celery==4.0.2
kombu==4.0.2
python 2.7
Amazon SQS

We have several workers running several different queues. Looks like the problem only occurs when one queue has too many jobs on it, then the workers assigned to that queue stop working.

We're planning to temporary move to Redis, but I can make some more tests before that. I'll keep you all updated if I learn something new. Thanks everyone.

@gabriel-amram I'm not sure I fully understand the solution. I do understand the analysis and I think it is correct.
Could you please issue a PR for us to review? Make sure to ping me so I won't miss it.

Was this page helpful?
0 / 5 - 0 ratings