celery 🚀 - celery raise error: [Errno 104] Connection reset by peer after started

I see the same problem with Celery 4.2.0. I don't have it with Celery 4.1.1. Locally, I often, but not always, get the Errno 104. On a travis build it seems to fail more consistently on 4.2.0 (and succeed on 4.1.1). I haven't noticed the time dependency that @axiaoxin reports.

manthey on 6 Jul 2018

👍2

Can you please provide the output of the following command:

$ celery -A proj report

georgepsarakis on 6 Jul 2018

hi @georgepsarakis this is my report

software -> celery:4.2.0 (windowlicker) kombu:4.2.1 py:2.7.5
            billiard:3.5.0.3 py-amqp:2.3.2
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp
results:sentinel://:**@10.18.7.1:26379/1;sentinel://:[email protected]:26379/1;sentinel://:[email protected]:26379/1

JSON_AS_ASCII: False
CACHED_OVER_EXEC_MILLISECONDS: 800
LOG_PEEWEE_SQL: False
SESSION_REFRESH_EACH_REQUEST: True
APP_ROOT_PATH: '/data/srv/zns/app'
REDIS_URL: 'redis://:[email protected]:6379/2'
PROJECT_ROOT_PATH: '/data/srv/zns'
FLATPAGES_ROOT: '/data/srv/zns/app/docs'
SESSION_COOKIE_SAMESITE: None
PROPAGATE_EXCEPTIONS: None
CELERYD_SEND_EVENTS: True
REDIS_LOCK_TIMEOUT: 1800
FAKE_HANDLE_TASK: False
SECRET_KEY: u'********'
BROKER_URL: u'amqp://notifer:********@zns.com:5672/notifer_celery_broker'
ENTRY_RATE_LIMIT: 0
SENTRY_DSN: 'http://6a0ce3f93804422da7321f45353c69d7:[email protected]/10'
SWAGGER: {
    'description': '<a href="/docs" target="_blank">\xe5\x85\xb6\xe4\xbb\x96\xe6\x96\x87\xe6\xa1\xa3</a>',
    'doc_expansion': 'list',
    'footer_text': u'\u6709\u4efb\u4f55\u7591\u95ee\u8bf7\u54a8\u8be2 ashinchen',
    'hide_top_bar': True,
    'specs': [{   'endpoint': 'apispec', 'route': '/apispec.json'}],
    'termsOfService': None,
    'title': 'zns API',
    'uiversion': 3,
    'version': '0.0.1'}
LOG_LEVEL: 'info'
APPLICATION_ROOT: '/'
SERVER_NAME: None
LOG_PATH: '/data/srv/zns/logs'
SERVICE_NAME: 'zns'
CELERYD_MAX_TASKS_PER_CHILD: 10000
TESTING: False
MYSQL_URL: 'mysql+pool://user:[email protected]:3306/zns?max_connections=40&stale_timeout=300'
TEMPLATES_AUTO_RELOAD: None
CELERY_RESULT_PERSISTENT: True
JSONIFY_MIMETYPE: 'application/json'
TOF_APP_KEY: u'********'
TOF_SYS_ID: 1
JSON_KEYCASE: u'********'
TOF_URL: 'http://tof.com/api/v1'
FLATPAGES_EXTENSION: ['.md', '.html', '.htm', '.txt']
SESSION_COOKIE_HTTPONLY: True
USE_X_SENDFILE: False
REQUESTS_POOL_SIZE: 10
API_BIND: u'********'
SESSION_COOKIE_SECURE: False
CACHED_EXPIRE_SECONDS: 60
REDIS_SENTINEL: {
    'db': 0,
    'master_name': 'redis-master',
    'nodes': [   ('10.18.7.1', 26379),
                 ('10.16.19.22', 26379),
                 ('10.16.19.21', 26379)],
    'password': u'********'}
SESSION_COOKIE_DOMAIN: None
SESSION_COOKIE_NAME: 'session'
EXCEPTION_RETRY_COUNT: 2
CELERY_TASK_RESULT_EXPIRES: 604800
MAX_COOKIE_SIZE: 4093
ENTRY_RATE_PER: 0
TOF_WEIXIN_SENDER: 'x-ashin'
ENV: 'production'
CELERYD_TASK_SOFT_TIME_LIMIT: 30
DEBUG: False
PREFERRED_URL_SCHEME: 'http'
EXPLAIN_TEMPLATE_LOADING: False
CELERY_RESULT_BACKEND:u'sentinel://:********@10.18.7.1:26379/1;sentinel://:pwd@'10.16.19.22:26379/1;sentinel://:[email protected]:26379/1'
CACHED_CALL: False
FLATPAGES_AUTO_RELOAD: False
MAX_CONTENT_LENGTH: None
REQUEST_ID_KEY: u'********'
NOTIFY_MODULE: 'tof'
JSONIFY_PRETTYPRINT_REGULAR: False
LOG_FUNC_CALL: True
PERMANENT_SESSION_LIFETIME: datetime.timedelta(31)
TOF_EMAIL_SENDER: '[email protected]'
REDIS_CLUSTER: {
    }
TRAP_BAD_REQUEST_ERRORS: None
JSON_SORT_KEYS: u'********'
TRAP_HTTP_EXCEPTIONS: False
SESSION_COOKIE_PATH: None
SEND_FILE_MAX_AGE_DEFAULT: datetime.timedelta(0, 43200)
SPLIT_LOGFILE_BY_LEVEL: False
PRESERVE_CONTEXT_ON_EXCEPTION: None
CELERY_RESULT_BACKEND_TRANSPORT_OPTIONS: {
    'master_name': 'redis-master'}
LOG_IN_FILE: False

as this report, some password dose not be replaced with *

axiaoxin on 9 Jul 2018

For what it's worth, I am seeing this as well with a clean project using gevent with rabbitmq. After starting the celery workers for a couple minutes, we'll receive a connection reset and no tasks will be consumed thereafter.

https://github.com/sihrc/celery-connection-reset

sihrc on 12 Nov 2018

👍2

Still have the same issue so far.(celery 4.2) It resolved by downgrading the celery version to 4.1, but do not know why this error occurred.

yuda110 on 21 Nov 2018

could you try to install celery from the master branch with all its dependencies from the master and see what's happening?

auvipy on 22 Nov 2018

Still getting this error with Celery 4.2.2.

czyzby on 27 Mar 2019

👍11

@auvipy Thanks, it works!

@yuda110 do you know what changes to your dependencies resolved the issue?

We are getting this ConnectionReset issue, and we are using Celery 4.2.1 with the following versions pinned which are compatible with the celery requirements:

billiard==3.5.0.4 # Celery needs billiard. There's a bug in 3.5.0.5
kombu==4.2.2-post1 # Celery needs kombu >= 4.2.0, < 5.0
redis==2.10.6

charlescapps on 2 Apr 2019

👍1

@charlescapps
Oh, I forgot to erase that answer ☹️☹️ It seemed that the problem was solved after upgrading the version(to 4.2.1), but faced with an unknown problem again. Eventually I had to downgrade to version 4.0.

yuda110 on 2 Apr 2019

I downgraded to 4.1 and it fixed the error. Haven't tried 4.3 yet though.

czyzby on 3 Apr 2019

This error happens pretty rarely for us, and it turns out it's a chained exception that starts with a ConnectionReset error from the redis client. I'm going to just enable retries when an kombu.exceptions.OperationalError is thrown, because the Celery ChangeLog indicates this is a retry-able error.

charlescapps on 3 Apr 2019

Just wanted to say that the issue still persists in 4.3.0 when using RabbitMQ. Somehow moving to Redis fixed the issue.

czyzby on 8 May 2019

We resolved this by doing retries with exponential backoff whenever a kombu.exceptions.OperationalError is thrown. Those are meant to be retried, according to the docs. The issue happens very rarely for us, so retries are a good solution. We're on 4.2.1.

charlescapps on 8 May 2019

Hi,

I am using rabbitmq as broker and backend and I am having the same issue.

Anyone has a solution?

Thanks in advance.

Juanfree on 3 Jun 2019

Same issue here. This is 100% reproducible for me. For some reason the socket to the broker dies in after what seems to be the heartbeat interval.

report:

software -> celery:4.3.0 (rhubarb) kombu:4.5.0 py:3.6.7
            billiard:3.6.0.0 py-amqp:2.4.2
platform -> system:Linux arch:64bit, ELF
            kernel version:4.18.0-20-generic imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp results:rpc:///

broker_url: 'amqp://guest:********@localhost:5672//'
result_backend: 'rpc:///'
result_persistent: False
task_default_queue: 'something something'
result_expires: 3600
task_ignore_result: False
task_serializer: 'json'
result_serializer: 'json'
accept_content: ['json']
timezone: 'Europe/Berlin'
enable_utc: True

I have to say that my problems started when I upgraded to Erlang 22.0. But that may as well be incidential.

pmeier-tiplu on 3 Jun 2019

👍1

can you suggest any fix? If you can then that will be included in 4.4.0rc2

auvipy on 7 Jun 2019

I can confirm this behavior on 4.3.0 with gevent worker as well. Switching from gevent to prefork seems to resolve it. I tried downgrading to 4.1.1 but that doesn't seem to work on Python 3.7 because it requires an older version of gevent (1.2.2) that won't even compile on Python 3.7. I noticed when the issue starts, the rabbitmq logs show this message:

missed heartbeats from client, timeout: 60s

Interestingly, despite the heartbeat failing, the worker still picks up tasks and processes them fine. It's just that celery inspect commands all time out until the worker is restarted. flower still shows information on the dashboard for the worker, but clicking on the worker itself gets a 404 Not Found error, and flower logs failures related to celery inspect commands:

monitor_1    | 2019-08-27 17:39:05.483286 [warning  ] 404 GET /worker/celery@38245f8fef62 (172.20.0.1): Unknown worker 'celery@38245f8fef62' [tornado.general] 
monitor_1    | 2019-08-27 17:39:24.608962 [warning  ] 'stats' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.609429 [warning  ] 'active_queues' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.609847 [warning  ] 'registered' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.610221 [warning  ] 'scheduled' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.610905 [warning  ] 'active' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.611369 [warning  ] 'reserved' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.611890 [warning  ] 'revoked' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.612512 [warning  ] 'conf' inspect method failed [flower.api.control]

jimbobhickville on 28 Aug 2019

need someone to verify this on celery 4.4.0rc3 + kombu 4.6.3?

auvipy on 29 Aug 2019

Will do. FYI, celery 4.4.0rc3 requires kombu 4.6.4:

celery 4.4.0rc3 has requirement kombu<5.0,>=4.6.4, but you'll have kombu 4.6.3 which is incompatible.

jimbobhickville on 29 Aug 2019

Ok, 4.4.0rc3 does appear to resolve this issue. I've left it going for more than 5 minutes with no heartbeat errors using the gevent worker.

jimbobhickville on 29 Aug 2019

👍1

kombu 4.6.3 is also compatible

auvipy on 29 Aug 2019

If that's the case, you might want to update the requirements file on the celery project.

jimbobhickville on 29 Aug 2019

But what did we change?

thedrow on 2 Oct 2019

I would also love some insight into what was changed that caused this to be closed, or a link to a PR/code/etc. We're affected, and mitigating it (prefork, rabbitmq, celery 4.3) by disabling heartbeats which is suboptimal.

zbentley on 15 Oct 2019

@auvipy Ping?

thedrow on 16 Oct 2019

Ok, 4.4.0rc3 does appear to resolve this issue. I've left it going for more than 5 minutes with no heartbeat errors using the gevent worker.

the issue was closed based on this feedback

auvipy on 23 Oct 2019

@auvipy It seems there are multiple issues that lead to similar errors. It would be great if you tried to reproduce the bug locally, possibly with some older Celery versions, before resolving the issue.

czyzby on 23 Oct 2019

you are suggested to try the master branch.

auvipy on 23 Oct 2019

I'm suggesting to reproduce the bug with one of the earlier versions (e.g. 4.1, 4.2, 4.3) and to make sure that upgrading to 4.4 alone fixed the issue. Closing the bug without your own verification - based on feedback of a single person - seems a bit hasty.

czyzby on 23 Oct 2019

since you are facing the issue, you should be the first to verify as you suggested. @czyzby

auvipy on 23 Oct 2019

you should be the first to verify as you suggested

_"Should"?_ ;) If there's anyone that _should_ care about the quality of the project, it's the official maintainers. I'm grateful that you're offering your software for free, but you cannot realistically expect all of your users to contribute to the project.

Anyway, I don't have the configuration that caused the problem anymore, but I think the issue persisted even in case of empty queues with no tasks defined, so it might be easy to reproduce. I've since resolved the issue with a workaround by migrating to Redis, as our team could easily change the technologies at the time. And, to be completely honest, we're considering a move to RQ due to other issues with Celery.

czyzby on 23 Oct 2019

👍1

4.4.0rc4 also appears to solve this issue,

howaryoo on 18 Nov 2019

👍3

any one else who can check it it's fix by celery==4.4.0rc4

auvipy on 18 Nov 2019

@auvipy I was on 4.3.0 with the occasional Connection reset. No more issues with 4.4.0rc4 for me.

SeanBE on 18 Nov 2019

any one else who can check it it's fix by celery==4.4.0rc4

It got the issue on 4.3.0 quite often - with 4.4.0rc4 the problem occurs far less often - but stills occurs from time to time.
I'm using redis-server 5.0.6 and python redis Client 3.3.11 with ~14 periodic Tasks (every 30 seconds).
So I'd ask you to reopen the issue.
Thanks!

sirpaddy on 20 Nov 2019

👍9

It got the issue on 4.3.0 quite often - with 4.4.0rc4 the problem occurs far less often - but stills occurs from time to time.
I'm using redis-server 5.0.6 and python redis Client 3.3.11 with ~14 periodic Tasks (every 30 seconds).
So I'd ask you to reopen the issue.
Thanks

Indeed, the issue still occurs with the default settings. However as mentioned in other threads, setting broker_heartbeat = 0 in celeryconfig.py seems to help.

howaryoo on 4 Dec 2019

👎4

Even after upgrading to celery 4.4.0rc4 and adding CELERY_BROKER_HEARTBEAT = 0 in celery.py nothing much seems to change for me, still getting the error.

sethigoldy on 5 Feb 2020

issue was not solved still after downgrading from celery 4.2.0 to 4.10

Follwoing versions are usinf in our project:
billiard==3.5.0.2, kombu==4.1.0, celery==4.1.0,amqp==2.4.2

please suggest

NithishReddy on 14 Feb 2020

We started seeing this, or something that very much resembles this issue.

software -> celery:4.3.0 (rhubarb) kombu:4.5.0 py:2.7.12
billiard:3.6.0.0 redis:3.2.1

It started occurring a few times a day, with no real changes.
For our next release, we will try to upgrade to the most recent versions, celery 4.4.0, redis etc. and report back.

Moulde on 18 Feb 2020

Happens for me with (concurrency=1000 (gevent) & redis as broker):
celery==4.4.0 (cliffs)
kombu==4.6.7
billiard==3.6.2.0
(py-)redis==3.4.1

Redis server version=5.0.7
Python 3.7.3

CM2Walki on 25 Feb 2020

https://sentry.io/share/issue/85f87e60a7c441198c082b9ebf051693/

7 tasks are set to run every 10 seconds.
The error is occurring only in celery beat, and very rarely an error occurs in less than 3 cases every hour.

Tags

logger: celery.beat
runtime: CPython 3.7.5

Environment

Linux-4.15.0-1060-aws-x86_64-with-Ubuntu-18.04-bionic
Python 3.7.5 (default, Nov 7 2019, 10:50:52) [GCC 8.3.0]
Redis server v=4.0.9 sha=00000000:0 malloc=jemalloc-3.6.0 bits=64 build=9435c3c2879311f3
celery==4.4.0
billiard==3.6.1.0
kombu==4.6.7
redis==3.3.11

EcmaXp on 29 Feb 2020

This happens to me when I connect to a TCP server using asyncio's open_connection. 15 minutes after I connect to the remote server within our vpn, it disconnects me. I suspect this is because the connection is idle. Same thing doesn't happen when I'm connecting from within the remote server. This appears to be network related.

nurettin on 1 Mar 2020

I solved my case! Uff.

It was is not Celery problem. I’ve tried few versions of it, including 4.[234].0, and I’ve tried few versions of python interfaces to redis. And I have always had, more less, the same fail ratio (about 2‰ for 0.5 million requests)

The solution was in redis server reconfiguration, i.e. disabling client-output-buffer-limit for all classes. According to redis documentation:

A client is immediately disconnected once the hard limit is reached, or if the soft limit is reached and remains reached for the specified number of seconds (continuously).
…
Both the hard or the soft limit can be disabled by setting them to zero.

I hope it will help you all, as well. Or maybe you'll improve my solution.

rganowski on 2 Mar 2020

👍6 🎉1

I solved my case! Uff.

It was is not Celery problem. I’ve tried few versions of it, including 4.[234].0, and I’ve tried few versions of python interfaces to redis. And I have always had, more less, the same fail ratio (about 2‰ for 0.5 million requests)

The solution was in redis server reconfiguration, i.e. disabling client-output-buffer-limit for all classes. According to redis documentation:

A client is immediately disconnected once the hard limit is reached, or if the soft limit is reached and remains reached for the specified number of seconds (continuously).
…
Both the hard or the soft limit can be disabled by setting them to zero.

I hope it will help you all, as well. Or maybe you'll improve my solution.

This also worked for me, where none of the other suggestions did. Thank you!

davegravy on 3 Mar 2020

I can also confirm that it resolved the issue for my setup! Thanks for putting time into this @rganowski!

CM2Walki on 5 Mar 2020

Great if this fixes the issue, but before I start removing defaults from the configuration file, I feel like it'd be great to know what that setting does, and why it's part of the default config?

Moulde on 5 Mar 2020

@Moulde I'm not quite sure what you mean by talking about deleting the defaults from the configuration file. Which configuration file? Do you mean changing the default Redis server settings that I pointed out?

I also would like to know why such defaults exists? Was that conscious? If so, what is the risk of giving up them? But, to be honest I'm not going to check that out. I had 10MD task, I had to add 3MD free.

Nobody said that fixes the issue. I said, that I found solution for my case. Two other homies said it also works for them. I read that with pleasure. But your words I read in such a way: "proove it". Am I wrong?

Please, test it on your application and let us know if it works for you. If you resolve other doubts, don't forget to share them with others.

rganowski on 5 Mar 2020

@rganowski It sounds like we agree, and yes, I do see what you mean with my wording, It was not meant like that, but more to add a little healthy skepticism before modifying system defaults, and maybe a little bit critique of the documentation - as a little info about why that setting is needed would be great, beside the "what it does" part that is in the file :)
And thanks for the time you put into this, I would not have figured that out on my own.

Moulde on 5 Mar 2020

👍1

The issue is closed, because there was no error in Celery code, but the problem behind the problem is not solved. I think that adequate warning in redis backend settings documentation should be added.

Googling for 'client-output-buffer-limit' we can find many interesting articles. One, now already 6 years old, has - in results - a really beautiful title The Replication Buffer - How to Avoid Devops Headaches. There we can read:

Before increasing the size of the replication buffers you must make sure you have enough memory on your machine.

In some other article Client Buffers-Before taking Redis in production check this! author says:

By default normal clients(normal execute commands) are not limited because they don’t receive data without asking (in a push way), but just after a request, so only asynchronous clients may create a scenario where data is requested faster than it can read.

Isn’t that our case?

For me, at least until now, reconfiguration turned out to be salvation. No new '104' errors, on a really heavy and instant load.

rganowski on 7 Mar 2020

❤1 👍1

@rganowski @Moulde @CM2Walki

Hi, I may sound very naive but can you please tell me where I can make the necessary modification in order to disable client-output-buffer-limit for all classes. As I am also getting same error but somehow I am not able to interpret your answer. So can you please elaborate on your answer so that I can make the necessary changes. Thank you!

girijesh97 on 18 May 2020

The issue is closed, because there was no error in Celery code, but the problem behind the problem is not solved. I think that adequate warning in redis backend settings documentation should be added.

Googling for 'client-output-buffer-limit' we can find many interesting articles. One, now already 6 years old, has - in results - a really beautiful title The Replication Buffer - How to Avoid Devops Headaches. There we can read:

Before increasing the size of the replication buffers you must make sure you have enough memory on your machine.

In some other article Client Buffers-Before taking Redis in production check this! author says:

By default normal clients(normal execute commands) are not limited because they don’t receive data without asking (in a push way), but just after a request, so only asynchronous clients may create a scenario where data is requested faster than it can read.

Isn’t that our case?

For me, at least until now, reconfiguration turned out to be salvation. No new '104' errors, on a really heavy and instant load.

really appreciate a doc improvement PR

auvipy on 18 May 2020

@girijesh97 @auvipy

In redis.conf

client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 0 0 0
client-output-buffer-limit pubsub 0 0 0

rganowski on 18 May 2020

@rganowski Sir Again I may sound very naive but I have Django application in which I am using celery (version 4.4.2) and facing the connection error issue. Can you please help in locating or creating this redis.conf file. Do I need to create this file in my application or it is available in some package?

girijesh97 on 18 May 2020

If your case is the same, we were talking about, your celery is using redis server results backend. The file I've mentioned is standard redis server configuration file. That's why this is in fact not a celery problem but a side effect.

rganowski on 18 May 2020

@auvipy Is there a fix for RabbitMQ as broker and result backend for the same issue. Seeing this in 4.4 as well on long running tasks. The above fix is for redis backend only.

prajwalkirankumar on 27 Aug 2020

👍10

Also seeing this problem occur intermittantly with RabbitMQ and celery 4.2.0. Even if it is built in retry handling rather than forcing that on the users of the package.

serpens24 on 1 Oct 2020

I'm also experiencing it. I'm on Celery 4.3.0 and RabbitMQ 3.3.5.

docwhite on 11 Nov 2020

celery raise error: [Errno 104] Connection reset by peer after started

Most helpful comment

All 57 comments

Related issues