celery raise error: [Errno 104] Connection reset by peer after started

Created on 29 Jun 2018  ·  57Comments  ·  Source: celery/celery

When I started the worker there will raise error: [Errno 104] Connection reset by peer
if use gevent pool, the error will raise at 3 minutes after the worker started

if use prefork pool, the error will raise at 15 minutes after the worker started

Not a Bug

Most helpful comment

Still getting this error with Celery 4.2.2.

All 57 comments

I see the same problem with Celery 4.2.0. I don't have it with Celery 4.1.1. Locally, I often, but not always, get the Errno 104. On a travis build it seems to fail more consistently on 4.2.0 (and succeed on 4.1.1). I haven't noticed the time dependency that @axiaoxin reports.

Can you please provide the output of the following command:

$ celery -A proj report

hi @georgepsarakis this is my report

software -> celery:4.2.0 (windowlicker) kombu:4.2.1 py:2.7.5
            billiard:3.5.0.3 py-amqp:2.3.2
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp
results:sentinel://:**@10.18.7.1:26379/1;sentinel://:[email protected]:26379/1;sentinel://:[email protected]:26379/1

JSON_AS_ASCII: False
CACHED_OVER_EXEC_MILLISECONDS: 800
LOG_PEEWEE_SQL: False
SESSION_REFRESH_EACH_REQUEST: True
APP_ROOT_PATH: '/data/srv/zns/app'
REDIS_URL: 'redis://:[email protected]:6379/2'
PROJECT_ROOT_PATH: '/data/srv/zns'
FLATPAGES_ROOT: '/data/srv/zns/app/docs'
SESSION_COOKIE_SAMESITE: None
PROPAGATE_EXCEPTIONS: None
CELERYD_SEND_EVENTS: True
REDIS_LOCK_TIMEOUT: 1800
FAKE_HANDLE_TASK: False
SECRET_KEY: u'********'
BROKER_URL: u'amqp://notifer:********@zns.com:5672/notifer_celery_broker'
ENTRY_RATE_LIMIT: 0
SENTRY_DSN: 'http://6a0ce3f93804422da7321f45353c69d7:[email protected]/10'
SWAGGER: {
    'description': '<a href="/docs" target="_blank">\xe5\x85\xb6\xe4\xbb\x96\xe6\x96\x87\xe6\xa1\xa3</a>',
    'doc_expansion': 'list',
    'footer_text': u'\u6709\u4efb\u4f55\u7591\u95ee\u8bf7\u54a8\u8be2 ashinchen',
    'hide_top_bar': True,
    'specs': [{   'endpoint': 'apispec', 'route': '/apispec.json'}],
    'termsOfService': None,
    'title': 'zns API',
    'uiversion': 3,
    'version': '0.0.1'}
LOG_LEVEL: 'info'
APPLICATION_ROOT: '/'
SERVER_NAME: None
LOG_PATH: '/data/srv/zns/logs'
SERVICE_NAME: 'zns'
CELERYD_MAX_TASKS_PER_CHILD: 10000
TESTING: False
MYSQL_URL: 'mysql+pool://user:[email protected]:3306/zns?max_connections=40&stale_timeout=300'
TEMPLATES_AUTO_RELOAD: None
CELERY_RESULT_PERSISTENT: True
JSONIFY_MIMETYPE: 'application/json'
TOF_APP_KEY: u'********'
TOF_SYS_ID: 1
JSON_KEYCASE: u'********'
TOF_URL: 'http://tof.com/api/v1'
FLATPAGES_EXTENSION: ['.md', '.html', '.htm', '.txt']
SESSION_COOKIE_HTTPONLY: True
USE_X_SENDFILE: False
REQUESTS_POOL_SIZE: 10
API_BIND: u'********'
SESSION_COOKIE_SECURE: False
CACHED_EXPIRE_SECONDS: 60
REDIS_SENTINEL: {
    'db': 0,
    'master_name': 'redis-master',
    'nodes': [   ('10.18.7.1', 26379),
                 ('10.16.19.22', 26379),
                 ('10.16.19.21', 26379)],
    'password': u'********'}
SESSION_COOKIE_DOMAIN: None
SESSION_COOKIE_NAME: 'session'
EXCEPTION_RETRY_COUNT: 2
CELERY_TASK_RESULT_EXPIRES: 604800
MAX_COOKIE_SIZE: 4093
ENTRY_RATE_PER: 0
TOF_WEIXIN_SENDER: 'x-ashin'
ENV: 'production'
CELERYD_TASK_SOFT_TIME_LIMIT: 30
DEBUG: False
PREFERRED_URL_SCHEME: 'http'
EXPLAIN_TEMPLATE_LOADING: False
CELERY_RESULT_BACKEND:u'sentinel://:********@10.18.7.1:26379/1;sentinel://:pwd@'10.16.19.22:26379/1;sentinel://:[email protected]:26379/1'
CACHED_CALL: False
FLATPAGES_AUTO_RELOAD: False
MAX_CONTENT_LENGTH: None
REQUEST_ID_KEY: u'********'
NOTIFY_MODULE: 'tof'
JSONIFY_PRETTYPRINT_REGULAR: False
LOG_FUNC_CALL: True
PERMANENT_SESSION_LIFETIME: datetime.timedelta(31)
TOF_EMAIL_SENDER: '[email protected]'
REDIS_CLUSTER: {
    }
TRAP_BAD_REQUEST_ERRORS: None
JSON_SORT_KEYS: u'********'
TRAP_HTTP_EXCEPTIONS: False
SESSION_COOKIE_PATH: None
SEND_FILE_MAX_AGE_DEFAULT: datetime.timedelta(0, 43200)
SPLIT_LOGFILE_BY_LEVEL: False
PRESERVE_CONTEXT_ON_EXCEPTION: None
CELERY_RESULT_BACKEND_TRANSPORT_OPTIONS: {
    'master_name': 'redis-master'}
LOG_IN_FILE: False

as this report, some password dose not be replaced with *

For what it's worth, I am seeing this as well with a clean project using gevent with rabbitmq. After starting the celery workers for a couple minutes, we'll receive a connection reset and no tasks will be consumed thereafter.

https://github.com/sihrc/celery-connection-reset

Still have the same issue so far.(celery 4.2) It resolved by downgrading the celery version to 4.1, but do not know why this error occurred.

could you try to install celery from the master branch with all its dependencies from the master and see what's happening?

Still getting this error with Celery 4.2.2.

@auvipy Thanks, it works!

@yuda110 do you know what changes to your dependencies resolved the issue?

We are getting this ConnectionReset issue, and we are using Celery 4.2.1 with the following versions pinned which are compatible with the celery requirements:

billiard==3.5.0.4 # Celery needs billiard. There's a bug in 3.5.0.5
kombu==4.2.2-post1 # Celery needs kombu >= 4.2.0, < 5.0
redis==2.10.6

@charlescapps
Oh, I forgot to erase that answer ☹️☹️ It seemed that the problem was solved after upgrading the version(to 4.2.1), but faced with an unknown problem again. Eventually I had to downgrade to version 4.0.

I downgraded to 4.1 and it fixed the error. Haven't tried 4.3 yet though.

This error happens pretty rarely for us, and it turns out it's a chained exception that starts with a ConnectionReset error from the redis client. I'm going to just enable retries when an kombu.exceptions.OperationalError is thrown, because the Celery ChangeLog indicates this is a retry-able error.

Just wanted to say that the issue still persists in 4.3.0 when using RabbitMQ. Somehow moving to Redis fixed the issue.

We resolved this by doing retries with exponential backoff whenever a kombu.exceptions.OperationalError is thrown. Those are meant to be retried, according to the docs. The issue happens very rarely for us, so retries are a good solution. We're on 4.2.1.

Hi,

I am using rabbitmq as broker and backend and I am having the same issue.

Anyone has a solution?

Thanks in advance.

Same issue here. This is 100% reproducible for me. For some reason the socket to the broker dies in after what seems to be the heartbeat interval.

report:

software -> celery:4.3.0 (rhubarb) kombu:4.5.0 py:3.6.7
            billiard:3.6.0.0 py-amqp:2.4.2
platform -> system:Linux arch:64bit, ELF
            kernel version:4.18.0-20-generic imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp results:rpc:///

broker_url: 'amqp://guest:********@localhost:5672//'
result_backend: 'rpc:///'
result_persistent: False
task_default_queue: 'something something'
result_expires: 3600
task_ignore_result: False
task_serializer: 'json'
result_serializer: 'json'
accept_content: ['json']
timezone: 'Europe/Berlin'
enable_utc: True

I have to say that my problems started when I upgraded to Erlang 22.0. But that may as well be incidential.

can you suggest any fix? If you can then that will be included in 4.4.0rc2

I can confirm this behavior on 4.3.0 with gevent worker as well. Switching from gevent to prefork seems to resolve it. I tried downgrading to 4.1.1 but that doesn't seem to work on Python 3.7 because it requires an older version of gevent (1.2.2) that won't even compile on Python 3.7. I noticed when the issue starts, the rabbitmq logs show this message:

missed heartbeats from client, timeout: 60s

Interestingly, despite the heartbeat failing, the worker still picks up tasks and processes them fine. It's just that celery inspect commands all time out until the worker is restarted. flower still shows information on the dashboard for the worker, but clicking on the worker itself gets a 404 Not Found error, and flower logs failures related to celery inspect commands:

monitor_1    | 2019-08-27 17:39:05.483286 [warning  ] 404 GET /worker/celery@38245f8fef62 (172.20.0.1): Unknown worker 'celery@38245f8fef62' [tornado.general] 
monitor_1    | 2019-08-27 17:39:24.608962 [warning  ] 'stats' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.609429 [warning  ] 'active_queues' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.609847 [warning  ] 'registered' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.610221 [warning  ] 'scheduled' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.610905 [warning  ] 'active' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.611369 [warning  ] 'reserved' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.611890 [warning  ] 'revoked' inspect method failed [flower.api.control] 
monitor_1    | 2019-08-27 17:39:24.612512 [warning  ] 'conf' inspect method failed [flower.api.control] 

need someone to verify this on celery 4.4.0rc3 + kombu 4.6.3?

Will do. FYI, celery 4.4.0rc3 requires kombu 4.6.4:

celery 4.4.0rc3 has requirement kombu<5.0,>=4.6.4, but you'll have kombu 4.6.3 which is incompatible.

Ok, 4.4.0rc3 does appear to resolve this issue. I've left it going for more than 5 minutes with no heartbeat errors using the gevent worker.

kombu 4.6.3 is also compatible

If that's the case, you might want to update the requirements file on the celery project.

But what did we change?

I would also love some insight into what was changed that caused this to be closed, or a link to a PR/code/etc. We're affected, and mitigating it (prefork, rabbitmq, celery 4.3) by disabling heartbeats which is suboptimal.

@auvipy Ping?

Ok, 4.4.0rc3 does appear to resolve this issue. I've left it going for more than 5 minutes with no heartbeat errors using the gevent worker.

the issue was closed based on this feedback

@auvipy It seems there are multiple issues that lead to similar errors. It would be great if you tried to reproduce the bug locally, possibly with some older Celery versions, before resolving the issue.

you are suggested to try the master branch.

I'm suggesting to reproduce the bug with one of the earlier versions (e.g. 4.1, 4.2, 4.3) and to make sure that upgrading to 4.4 alone fixed the issue. Closing the bug without your own verification - based on feedback of a single person - seems a bit hasty.

since you are facing the issue, you should be the first to verify as you suggested. @czyzby

you should be the first to verify as you suggested

_"Should"?_ ;) If there's anyone that _should_ care about the quality of the project, it's the official maintainers. I'm grateful that you're offering your software for free, but you cannot realistically expect all of your users to contribute to the project.

Anyway, I don't have the configuration that caused the problem anymore, but I think the issue persisted even in case of empty queues with no tasks defined, so it might be easy to reproduce. I've since resolved the issue with a workaround by migrating to Redis, as our team could easily change the technologies at the time. And, to be completely honest, we're considering a move to RQ due to other issues with Celery.

4.4.0rc4 also appears to solve this issue,

any one else who can check it it's fix by celery==4.4.0rc4

@auvipy I was on 4.3.0 with the occasional Connection reset. No more issues with 4.4.0rc4 for me.

any one else who can check it it's fix by celery==4.4.0rc4

It got the issue on 4.3.0 quite often - with 4.4.0rc4 the problem occurs far less often - but stills occurs from time to time.
I'm using redis-server 5.0.6 and python redis Client 3.3.11 with ~14 periodic Tasks (every 30 seconds).
So I'd ask you to reopen the issue.
Thanks!

It got the issue on 4.3.0 quite often - with 4.4.0rc4 the problem occurs far less often - but stills occurs from time to time.
I'm using redis-server 5.0.6 and python redis Client 3.3.11 with ~14 periodic Tasks (every 30 seconds).
So I'd ask you to reopen the issue.
Thanks

Indeed, the issue still occurs with the default settings. However as mentioned in other threads, setting broker_heartbeat = 0 in celeryconfig.py seems to help.

Even after upgrading to celery 4.4.0rc4 and adding CELERY_BROKER_HEARTBEAT = 0 in celery.py nothing much seems to change for me, still getting the error.

issue was not solved still after downgrading from celery 4.2.0 to 4.10

Follwoing versions are usinf in our project:
billiard==3.5.0.2, kombu==4.1.0, celery==4.1.0,amqp==2.4.2

please suggest

We started seeing this, or something that very much resembles this issue.

software -> celery:4.3.0 (rhubarb) kombu:4.5.0 py:2.7.12
billiard:3.6.0.0 redis:3.2.1

It started occurring a few times a day, with no real changes.
For our next release, we will try to upgrade to the most recent versions, celery 4.4.0, redis etc. and report back.

Happens for me with (concurrency=1000 (gevent) & redis as broker):
celery==4.4.0 (cliffs)
kombu==4.6.7
billiard==3.6.2.0
(py-)redis==3.4.1

Redis server version=5.0.7
Python 3.7.3

https://sentry.io/share/issue/85f87e60a7c441198c082b9ebf051693/

  • 7 tasks are set to run every 10 seconds.
  • The error is occurring only in celery beat, and very rarely an error occurs in less than 3 cases every hour.

Tags

  • logger: celery.beat
  • runtime: CPython 3.7.5

Environment

  • Linux-4.15.0-1060-aws-x86_64-with-Ubuntu-18.04-bionic
  • Python 3.7.5 (default, Nov 7 2019, 10:50:52) [GCC 8.3.0]
  • Redis server v=4.0.9 sha=00000000:0 malloc=jemalloc-3.6.0 bits=64 build=9435c3c2879311f3
  • celery==4.4.0
  • billiard==3.6.1.0
  • kombu==4.6.7
  • redis==3.3.11

This happens to me when I connect to a TCP server using asyncio's open_connection. 15 minutes after I connect to the remote server within our vpn, it disconnects me. I suspect this is because the connection is idle. Same thing doesn't happen when I'm connecting from within the remote server. This appears to be network related.

I solved my case! Uff.

It was is not Celery problem. I’ve tried few versions of it, including 4.[234].0, and I’ve tried few versions of python interfaces to redis. And I have always had, more less, the same fail ratio (about 2‰ for 0.5 million requests)

The solution was in redis server reconfiguration, i.e. disabling client-output-buffer-limit for all classes. According to redis documentation:

A client is immediately disconnected once the hard limit is reached, or if the soft limit is reached and remains reached for the specified number of seconds (continuously).

Both the hard or the soft limit can be disabled by setting them to zero.

I hope it will help you all, as well. Or maybe you'll improve my solution.

I solved my case! Uff.

It was is not Celery problem. I’ve tried few versions of it, including 4.[234].0, and I’ve tried few versions of python interfaces to redis. And I have always had, more less, the same fail ratio (about 2‰ for 0.5 million requests)

The solution was in redis server reconfiguration, i.e. disabling client-output-buffer-limit for all classes. According to redis documentation:

A client is immediately disconnected once the hard limit is reached, or if the soft limit is reached and remains reached for the specified number of seconds (continuously).

Both the hard or the soft limit can be disabled by setting them to zero.

I hope it will help you all, as well. Or maybe you'll improve my solution.

This also worked for me, where none of the other suggestions did. Thank you!

I can also confirm that it resolved the issue for my setup! Thanks for putting time into this @rganowski!

Great if this fixes the issue, but before I start removing defaults from the configuration file, I feel like it'd be great to know what that setting does, and why it's part of the default config?

@Moulde I'm not quite sure what you mean by talking about deleting the defaults from the configuration file. Which configuration file? Do you mean changing the default Redis server settings that I pointed out?

I also would like to know why such defaults exists? Was that conscious? If so, what is the risk of giving up them? But, to be honest I'm not going to check that out. I had 10MD task, I had to add 3MD free.

Nobody said that fixes the issue. I said, that I found solution for my case. Two other homies said it also works for them. I read that with pleasure. But your words I read in such a way: "proove it". Am I wrong?

Please, test it on your application and let us know if it works for you. If you resolve other doubts, don't forget to share them with others.

@rganowski It sounds like we agree, and yes, I do see what you mean with my wording, It was not meant like that, but more to add a little healthy skepticism before modifying system defaults, and maybe a little bit critique of the documentation - as a little info about why that setting is needed would be great, beside the "what it does" part that is in the file :)
And thanks for the time you put into this, I would not have figured that out on my own.

The issue is closed, because there was no error in Celery code, but the problem behind the problem is not solved. I think that adequate warning in redis backend settings documentation should be added.

Googling for 'client-output-buffer-limit' we can find many interesting articles. One, now already 6 years old, has - in results - a really beautiful title The Replication Buffer - How to Avoid Devops Headaches. There we can read:

Before increasing the size of the replication buffers you must make sure you have enough memory on your machine.

In some other article Client Buffers-Before taking Redis in production check this! author says:

By default normal clients(normal execute commands) are not limited because they don’t receive data without asking (in a push way), but just after a request, so only asynchronous clients may create a scenario where data is requested faster than it can read.

Isn’t that our case?

For me, at least until now, reconfiguration turned out to be salvation. No new '104' errors, on a really heavy and instant load.

@rganowski @Moulde @CM2Walki

Hi, I may sound very naive but can you please tell me where I can make the necessary modification in order to disable client-output-buffer-limit for all classes. As I am also getting same error but somehow I am not able to interpret your answer. So can you please elaborate on your answer so that I can make the necessary changes. Thank you!

The issue is closed, because there was no error in Celery code, but the problem behind the problem is not solved. I think that adequate warning in redis backend settings documentation should be added.

Googling for 'client-output-buffer-limit' we can find many interesting articles. One, now already 6 years old, has - in results - a really beautiful title The Replication Buffer - How to Avoid Devops Headaches. There we can read:

Before increasing the size of the replication buffers you must make sure you have enough memory on your machine.

In some other article Client Buffers-Before taking Redis in production check this! author says:

By default normal clients(normal execute commands) are not limited because they don’t receive data without asking (in a push way), but just after a request, so only asynchronous clients may create a scenario where data is requested faster than it can read.

Isn’t that our case?

For me, at least until now, reconfiguration turned out to be salvation. No new '104' errors, on a really heavy and instant load.

really appreciate a doc improvement PR

@girijesh97 @auvipy

In redis.conf

client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 0 0 0
client-output-buffer-limit pubsub 0 0 0

@rganowski Sir Again I may sound very naive but I have Django application in which I am using celery (version 4.4.2) and facing the connection error issue. Can you please help in locating or creating this redis.conf file. Do I need to create this file in my application or it is available in some package?

If your case is the same, we were talking about, your celery is using redis server results backend. The file I've mentioned is standard redis server configuration file. That's why this is in fact not a celery problem but a side effect.

@auvipy Is there a fix for RabbitMQ as broker and result backend for the same issue. Seeing this in 4.4 as well on long running tasks. The above fix is for redis backend only.

Also seeing this problem occur intermittantly with RabbitMQ and celery 4.2.0. Even if it is built in retry handling rather than forcing that on the users of the package.

I'm also experiencing it. I'm on Celery 4.3.0 and RabbitMQ 3.3.5.

Was this page helpful?
0 / 5 - 0 ratings