Celery: Tens of thousands of backend_cleanup tasks executed every night since 10:42 pm. All async functions stop working

Created on 21 Oct 2017 · 3Comments · Source: celery/celery

Checklist

[x] I have included the output of celery -A proj report in the issue.
(if you are not able to do this, then at least specify the Celery
version affected).
[x] I have verified that the issue exists against the master branch of Celery.

My app is developed by using django1.11 and celery4.1[sqs]. It is deployed on Amazon AWS through Elastic Beanstalk. AWS SQS is the broker.

Steps to reproduce

The issue shows up every night around 10:42 pm, lasting for about 1 hour. Tens of thousands of backend_cleanup tasks start to be executed. Per the list of the task results in the Admin, each task is executed successfully. The CPU usage sometimes stays at 100%. Every page of the app is hard to open. 1 hour after the issue, the CPU usage is back to normal. However, the queues are wedging. Any delay() tasks cannot be executed. I have to purge the queue in SQS, re-start the app, and re-upload the entire code to AWS Elastic Beanstalk. Then the delay() tasks will be executed normally. If I don't do the purging, re-starting or re-uploading, none of any async tasks will be executed. Even the built-in backendcleaup task cannot start. My app doesn't have any periodic or crontab tasks except the built-in celery.backend_cleanup task. Could anyone help me with this issue?

Expected behavior

I need the entire app to perform normally after the backend_cleanup tasks are finished.

Actual behavior

Described in the ## Steps to reproduce

celery-sqs-worker-report.txt

Question

Source

aoerliang

Most helpful comment

You can simply overwrite the schedule and avoid using crontab:
'celery.backend_cleanup': { 'task': 'celery.backend_cleanup', 'schedule': 86400, # every 24 hours instead of crontab('0', '4', '*') which is used in celery beat 'options': {'expires': 12 * 3600}} ,

barbarakr on 18 May 2018

👍2

All 3 comments

I just found that the problem is probably about the built-in task, celery.backend_cleanup. I just ran the app in the local development environment. When it was 10:42 pm, tremendous backend_cleanup tasks started and executed successfully. However, after a couple of seconds, an error occurred. Here is the error.

[2017-10-23 22:51:27,677: CRITICAL/MainProcess] Unrecoverable error: Exception('Request Empty body  HTTP 599  Server aborted the SSL handshake (None)',)
Traceback (most recent call last):
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/worker/worker.py", line 203, in start
    self.blueprint.start(self)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/bootsteps.py", line 370, in start
    return self.obj.start()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/worker/consumer/consumer.py", line 320, in start
    blueprint.start(self)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/worker/consumer/consumer.py", line 596, in start
    c.loop(*c.loop_args())
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/celery/worker/loops.py", line 88, in asynloop
    next(loop)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/http/curl.py", line 111, in on_readable
    return self._on_event(fd, _pycurl.CSELECT_IN)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/http/curl.py", line 124, in _on_event
    self._process_pending_requests()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/http/curl.py", line 132, in _process_pending_requests
    self._process(curl, errno, reason)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/http/curl.py", line 178, in _process
    buffer=buffer, effective_url=effective_url, error=error,
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 150, in __call__
    svpending(*ca, **ck)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 143, in __call__
    return self.throw()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 140, in __call__
    retval = fun(*final_args, **final_kwargs)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/funtools.py", line 100, in _transback
    return callback(ret)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 143, in __call__
    return self.throw()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/promises.py", line 140, in __call__
    retval = fun(*final_args, **final_kwargs)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/funtools.py", line 98, in _transback
    callback.throw()
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/vine/funtools.py", line 96, in _transback
    ret = filter_(*args + (ret,), **kwargs)
  File "/Users/user/code/accusize-021/lib/python3.4/site-packages/kombu/async/aws/connection.py", line 253, in _on_status_ready
    raise self._for_status(response, response.read())
Exception: Request Empty body  HTTP 599  Server aborted the SSL handshake (None)

Meanwhile, the queue disappeared from AWS SQS. Then a few of the backend_cleanup tasks continued to execute. And finally, a warning showed up.

[2017-10-23 22:51:28,967: WARNING/MainProcess] Restoring 10 unacknowledged message(s)

And finally, the HTTPS connection to the SQS queue stopped. The app could not re-connect to SQS unless I re-deployed the app to AWS Elastic Beanstalk.

Could anyone help me with this issue? Any response will be significantly appreciated.

aoerliang on 24 Oct 2017

I think I solved the issue. The problem never happens after I changed CELERY_RESULT_BACKEND to "redis" from "django-db", and change CELERY_TIMEZONE to "UTC". Although the Task Results in Django Admin stopped adding any task results, my problem had been majorly solved.

aoerliang on 25 Oct 2017

barbarakr on 18 May 2018

👍2

Was this page helpful?

0 / 5 - 0 ratings