gunicorn CRITICAL WORKER TIMEOUT

Created on 21 Jan 2017 · 18Comments · Source: benoitc/gunicorn

I am running a gunicorn with the following settings:
gunicorn --worker-class gevent --timeout 30 --graceful-timeout 20 --max-requests-jitter 2000 --max-requests 1500 -w 50 --log-level DEBUG --capture-output --bind 0.0.0.0:5000 run:app and I am seeing in all but 3 workers the [CRITICAL] WORKER TIMEOUT. After a certain while, gunicorn can't spawn anymore workers or at least is very slow at spawning them. This causes it to not make the server reachable and any request unreachable.

I reduced the number of workers to 3 and gave each worker 2 threads and now Im not seeing this issue anymore.

I can't get a stacktrace from the timeouts but this seems like after a certain number of workers it can't handle them?

( unconfirmed - Bugs -

Source

jseaidou

👍34 🎉3

Most helpful comment

Anyone who still have this problem, please check the resource availability for the application in addition of increasing the timeout and changing the worker class type

I was having this problem when I try to deploy my application using Docker Swarm and realized I was limiting the resource too low for the application. Increasing the resource solve the problem for me

deploy:
  replicas: 5
  resources:
    limits:
      cpus: "0.1"
      memory: 50M
  restart_policy:
    condition: on-failure

I think this is not a bug, just how the way we configure our apps

hashlash on 9 Aug 2018

❤12 👍7

All 18 comments

when a worker timeout it means that it didn't notified the arbiter in time that it was alive. Do you have any task executed during a request that could take longer than the timeout?

benoitc on 25 Jan 2017

👍3

@jseaidou bump.

benoitc on 27 Jan 2017

Sorry for the late reply @benoitc. The issue I'm seeing is actually the non active threads doing this. My active threads don't timeout, my the non active ones under less load give more critical timeout errors than just gracefully timing out. I switched from gevent to tornado and that seems to have fixed the hang issue but I'm still seeing 3 workers consistently giving the critical timeout every 30 seconds. If it's a normal timeout it shouldnt be a critical error.

jseaidou on 27 Jan 2017

I am facing exactly the same issue.
Gunicorn 19.7.1
gevent 1.2.1
Python 3.5.3
Running in Docker, image based on official "python:3.5"

saabeilin on 25 Jun 2017

@jseaidou it's anormal timeout in the sense the arbiter react to it. It's critical because it should normally not happen.

Most probably one of your worker is doing a blocking operation preventing the gunicorn worker to notify the arbiter. If you have a long operation, make sure to trigger the geven scheduler from time to time by sleeping or such. or anything that call the tornado scheduler back too.

How can I reproduce the issue?

@saabeilin same ^^

benoitc on 25 Jun 2017

👍3

I'm seeing the same thing: workers are timing out even when serving no requests. All I've done is launched my container on AWS ECS.

[2017-06-27 20:41:56 +0000] [1] [DEBUG] Current configuration:
proxy_protocol: False
worker_connections: 1000
statsd_host: None
max_requests_jitter: 0
post_fork: <function post_fork at 0x7f6bbc3f1938>
errorlog: -
enable_stdio_inheritance: False
worker_class: sync
ssl_version: 2
suppress_ragged_eofs: True
syslog: False
syslog_facility: user
when_ready: <function when_ready at 0x7f6bbc3f1668>
pre_fork: <function pre_fork at 0x7f6bbc3f17d0>
cert_reqs: 0
preload_app: False
keepalive: 2
accesslog: -
group: 0
graceful_timeout: 30
do_handshake_on_connect: False
spew: False
workers: 4
proc_name: None
sendfile: None
pidfile: None
umask: 0
on_reload: <function on_reload at 0x7f6bbc3f1500>
pre_exec: <function pre_exec at 0x7f6bbc3f1ed8>
worker_tmp_dir: None
limit_request_fields: 100
pythonpath: None
on_exit: <function on_exit at 0x7f6bbc3f7758>
config: None
logconfig: None
check_config: False
statsd_prefix:
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
reload_engine: auto
proxy_allow_ips: ['127.0.0.1']
pre_request: <function pre_request at 0x7f6bbc3f70c8>
post_request: <function post_request at 0x7f6bbc3f71b8>
forwarded_allow_ips: ['127.0.0.1']
worker_int: <function worker_int at 0x7f6bbc3f1c08>
raw_paste_global_conf: []
threads: 1
max_requests: 0
chdir: /opt/app
daemon: False
user: 0
limit_request_line: 4094
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
certfile: None
on_starting: <function on_starting at 0x7f6bbc3f1398>
post_worker_init: <function post_worker_init at 0x7f6bbc3f1aa0>
child_exit: <function child_exit at 0x7f6bbc3f7320>
worker_exit: <function worker_exit at 0x7f6bbc3f7488>
paste: None
default_proc_name: app:app
syslog_addr: udp://localhost:514
syslog_prefix: None
ciphers: TLSv1
worker_abort: <function worker_abort at 0x7f6bbc3f1d70>
loglevel: DEBUG
bind: ['0.0.0.0:5005']
raw_env: []
initgroups: False
capture_output: False
reload: False
limit_request_field_size: 8190
nworkers_changed: <function nworkers_changed at 0x7f6bbc3f75f0>
timeout: 30
keyfile: None
ca_certs: None
tmp_upload_dir: None
backlog: 2048
logger_class: gunicorn.glogging.Logger
[2017-06-27 20:41:56 +0000] [1] [INFO] Starting gunicorn 19.7.1
[2017-06-27 20:41:56 +0000] [1] [DEBUG] Arbiter booted
[2017-06-27 20:41:56 +0000] [1] [INFO] Listening at: http://0.0.0.0:5005 (1)
[2017-06-27 20:41:56 +0000] [1] [INFO] Using worker: sync
[2017-06-27 20:41:56 +0000] [8] [INFO] Booting worker with pid: 8
[2017-06-27 20:41:57 +0000] [9] [INFO] Booting worker with pid: 9
[2017-06-27 20:41:57 +0000] [10] [INFO] Booting worker with pid: 10
[2017-06-27 20:41:57 +0000] [12] [INFO] Booting worker with pid: 12
[2017-06-27 20:41:57 +0000] [1] [DEBUG] 4 workers
[2017-06-27 20:42:15 +0000] [10] [DEBUG] Closing connection.
[2017-06-27 20:42:15 +0000] [8] [DEBUG] Closing connection.
[2017-06-27 20:42:45 +0000] [8] [DEBUG] Closing connection.
[2017-06-27 20:42:45 +0000] [10] [DEBUG] Closing connection.
[2017-06-27 20:42:46 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
[2017-06-27 20:42:46 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12)
[2017-06-27 20:42:46 +0000] [9] [INFO] Worker exiting (pid: 9)
[2017-06-27 20:42:46 +0000] [12] [INFO] Worker exiting (pid: 12)
[2017-06-27 20:42:46 +0000] [1] [DEBUG] 3 workers
[2017-06-27 20:42:46 +0000] [24] [INFO] Booting worker with pid: 24
[2017-06-27 20:42:46 +0000] [25] [INFO] Booting worker with pid: 25
[2017-06-27 20:42:46 +0000] [1] [DEBUG] 4 workers

This does not occur when running locally. :-/

schwuk on 27 Jun 2017

👍6

Looks like switching to gevent workers has resolved it for me. ¯\_(ツ)_/¯

schwuk on 27 Jun 2017

👍4

Duplicate of #1194, I think.

tilgovi on 30 Jun 2017

I’ve seen this happen repeatedly lately, and for me it seems connected with putting the laptop to sleep. When I open the lid, a bunch of these messages show. Not sure this helps, but I thought I’d mention it…

jenstroeger on 22 Sep 2017

gunicorn --daemon --workers 2 --timeout 120 --bind 127.0.0.1:4000 --pid /var/run/mshc_admin.pid --user danilovskoe --group danilovskoe --chdir /home/danilovskoe/mshc2/src/flask/ --env MSHC_PRODUCTION=/etc/monit/mshc.config.py admin_gunicorn:app

timeout 30 seconds
GET request

ubuntu 16.04
flask 0.12.2
Python 3.6.3 (default, Oct 4 2017, 02:55:45)
[GCC 5.4.0 20160609] on linux
gunicorn (version 19.7.1)

stripcode on 20 Feb 2018

I met with the trouble.

Just after starting the app, it works. But only if there is a request, [CRITICAL] WORKER TIMEOUT will be triggered. For example,

[2018-01-02 16:38:03 +0800] [24355] [INFO] Starting gunicorn 19.7.1
[2018-01-02 16:38:03 +0800] [24355] [DEBUG] Arbiter booted
[2018-01-02 16:38:03 +0800] [24355] [INFO] Listening at: http://0.0.0.0:8080 (24355)
[2018-01-02 16:38:03 +0800] [24355] [INFO] Using worker: gevent
[2018-01-02 16:38:03 +0800] [24358] [INFO] Booting worker with pid: 24358
[2018-01-02 16:38:03 +0800] [24355] [DEBUG] 1 workers

[2018-01-02 16:38:10 +0800] [24358] [DEBUG] GET /v1/bj2/CC/uuid
[2018-01-02 16:38:10 +0800] [24358] [DEBUG] Closing connection. 
[2018-01-02 16:38:41 +0800] [24355] [CRITICAL] WORKER TIMEOUT (pid:24358)
[2018-01-02 16:38:41 +0800] [24358] [INFO] Worker exiting (pid: 24358)
[2018-01-02 16:38:41 +0800] [24381] [INFO] Booting worker with pid: 24381

[2018-01-02 16:48:51 +0800] [24355] [CRITICAL] WORKER TIMEOUT (pid:24381)
[2018-01-02 16:48:51 +0800] [24381] [INFO] Worker exiting (pid: 24381)
[2018-01-02 16:48:51 +0800] [24703] [INFO] Booting worker with pid: 24703
[2018-01-02 16:48:51 +0800] [24703] [INFO] worker pid 24703 notify
[2018-01-02 16:48:51 +0800] [24703] [DEBUG] GET /v1/bj2/CC/uuid
[2018-01-02 16:48:51 +0800] [24703] [DEBUG] Closing connection.

CentOS: 6.7
Python: 3.6.3
Gevent: 1.2.2
Greenlet: 0.4.9
Gunicorn: 19.7.1

RUN CMD: gunicorn --worker-class gevent --log-level debug --bind 0.0.0.0:8080 app

When I switched the worker class to eventlet, that's,
gunicorn --worker-class eventlet --log-level debug --bind 0.0.0.0:8080 app,
it's OK.

Notice: My app runs on the physical host, neither the virtual host nor the cloud host.

Update:

So I guess that it's the question of gevent or gevent worker.

xgfone on 7 Mar 2018

👍1

There are reports on this issue of gevent solving the problem and gevent causing the problem. I cannot identify a root cause here. Some of the reports may be the same as #1194, but others maybe not.

If anyone can share a minimal case to reproduce, that would help.

tilgovi on 10 Mar 2018

👍1

I'm not sure it's definitely the same issue, but I can reproduce this 100% of the time using Virtualbox with the following setup:

Host: Windows 10
Guest: Ubuntu 16.04
Gunicorn: 19.7.1

I forward TCP:8000 between host and guest over the default NAT connection. Using a sync worker, any requests on the host for localhost:8000 cause these errors to appear in the log, but if I make the same requests on the guest, the log is clear. Switching to --worker-class eventlet removes the trace.

I appreciate Virtualbox is a completely different dimension, but it does sound very similar to what is being described above, and is consistently reproducible (for me at least).

mtripp100 on 29 Mar 2018

I am seeing this happen with slow uploads. During an upload (to a Django site) if the worker timeout is hit, the upload dies.

lordmauve on 20 Apr 2018

👍5

@lordmauve if you are using the sync worker that is expected. Long requests will block the worker and eventually the arbiter will kill it. You can use a different worker type if you expect long requests to succeed.

tilgovi on 28 Apr 2018

For anyone reading this thread, please re-open with a minimal case to reproduce. I cannot see any clean investigation to pursue here. For the case of AWS / ECS, I'm still leaving #1194 open until I can test the configurations I listed (https://github.com/benoitc/gunicorn/issues/1194#issuecomment-371250650).

tilgovi on 28 Apr 2018

Anyone who still have this problem, please check the resource availability for the application in addition of increasing the timeout and changing the worker class type

I was having this problem when I try to deploy my application using Docker Swarm and realized I was limiting the resource too low for the application. Increasing the resource solve the problem for me

deploy:
  replicas: 5
  resources:
    limits:
      cpus: "0.1"
      memory: 50M
  restart_policy:
    condition: on-failure

I think this is not a bug, just how the way we configure our apps

hashlash on 9 Aug 2018

❤12 👍7

@jseaidou it's anormal timeout in the sense the arbiter react to it. It's critical because it should normally not happen.

Most probably one of your worker is doing a blocking operation preventing the gunicorn worker to notify the arbiter. If you have a long operation, make sure to trigger the geven scheduler from time to time by sleeping or such. or anything that call the tornado scheduler back too.

How can I reproduce the issue?

@saabeilin same ^^

Thanks, this makes sense.
For me this is happening upon serving a large file that is previously downloaded from cloud storage.
Upon request, the file is retrieved to the local disk from cloud storage, then it is stream-decrypted to the client. The download from cloud storage works fine, even if it takes 10 minutes or longer.
Once the worker starts stream-decrypting the file from disk to the client, it dies after a few hundred MB, probably because it times out while being busy with this blocking operation.