Gunicorn: CRITICAL WORKER TIMEOUT when running Flask app

Created on 5 Jun 2018  ·  82Comments  ·  Source: benoitc/gunicorn

It seems there have been already several reports related to [CRITICAL] WORKER TIMEOUT error but it just keeps popping up. Here is my issue.

I'm running this Flask hello world application.

from flask import Flask
application = Flask(__name__)

@application.route('/')
def hello_world():
    return 'Hello, World!'

The gunicorn command is this one:

gunicorn -b 0.0.0.0:5000 --log-level=debug hello

And this is the console output:

[2018-06-05 14:56:21 +0200] [11229] [INFO] Starting gunicorn 19.8.1
[2018-06-05 14:56:21 +0200] [11229] [DEBUG] Arbiter booted
[2018-06-05 14:56:21 +0200] [11229] [INFO] Listening at: http://0.0.0.0:5000 (11229)
[2018-06-05 14:56:21 +0200] [11229] [INFO] Using worker: sync
[2018-06-05 14:56:21 +0200] [11232] [INFO] Booting worker with pid: 11232
[2018-06-05 14:56:21 +0200] [11229] [DEBUG] 1 workers
[2018-06-05 14:56:32 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:56:57 +0200] [11232] [DEBUG] Closing connection. 
[2018-06-05 14:57:16 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:57:47 +0200] [11229] [CRITICAL] WORKER TIMEOUT (pid:11232)
[2018-06-05 14:57:47 +0200] [11232] [INFO] Worker exiting (pid: 11232)
[2018-06-05 14:57:47 +0200] [11324] [INFO] Booting worker with pid: 11324

Can you please clearly explain why do I get the error and if it's expected in this example? How do I fix it or if it's an expected behavior why critical error then?

Investigation unconfirmed

Most helpful comment

Using gunicorn with gevent didn't fixed the bug.

All 82 comments

The error is not expected, but there is nothing from your example that shows why it happens. Tell us more about your environment.

  • What client is used to connect to Gunicorn?
  • What reverse proxy, if any, is used to connect to Gunicorn?

Tell us more about your environment.

  • What client is used to connect to Gunicorn?
    => I just send request from Chromium: http://localhost:5000/
  • What reverse proxy, if any, is used to connect to Gunicorn?
    => No proxy, just Gunicorn + Flask

I've just reproduced the problem on a completely fresh setup, here are the steps:

mkdir gunicorn
cd gunicorn/
pipenv --python 3.6
pipenv install flask
pipenv install gunicorn
vim hello.py
pipenv shell
gunicorn -b 0.0.0.0:5000 --log-level=debug hello

The hello.py is exactly the same Flask application as I posted in the initial report.
Below is the complete log.

~$ gunicorn -b 0.0.0.0:5000 --log-level=debug hello
[2018-06-06 09:16:21 +0200] [19829] [DEBUG] Current configuration:
  config: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 1
  worker_class: sync
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /home/dima/work/gunicorn
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1000
  group: 985
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  statsd_prefix: 
  proc_name: None
  default_proc_name: hello
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f9757112d08>
  on_reload: <function OnReload.on_reload at 0x7f9757112e18>
  when_ready: <function WhenReady.when_ready at 0x7f9757112f28>
  pre_fork: <function Prefork.pre_fork at 0x7f9756c230d0>
  post_fork: <function Postfork.post_fork at 0x7f9756c231e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f9756c232f0>
  worker_int: <function WorkerInt.worker_int at 0x7f9756c23400>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f9756c23510>
  pre_exec: <function PreExec.pre_exec at 0x7f9756c23620>
  pre_request: <function PreRequest.pre_request at 0x7f9756c23730>
  post_request: <function PostRequest.post_request at 0x7f9756c237b8>
  child_exit: <function ChildExit.child_exit at 0x7f9756c238c8>
  worker_exit: <function WorkerExit.worker_exit at 0x7f9756c239d8>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f9756c23ae8>
  on_exit: <function OnExit.on_exit at 0x7f9756c23bf8>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: TLSv1
  raw_paste_global_conf: []
[2018-06-06 09:16:21 +0200] [19829] [INFO] Starting gunicorn 19.8.1
[2018-06-06 09:16:21 +0200] [19829] [DEBUG] Arbiter booted
[2018-06-06 09:16:21 +0200] [19829] [INFO] Listening at: http://0.0.0.0:5000 (19829)
[2018-06-06 09:16:21 +0200] [19829] [INFO] Using worker: sync
[2018-06-06 09:16:21 +0200] [19832] [INFO] Booting worker with pid: 19832
[2018-06-06 09:16:22 +0200] [19829] [DEBUG] 1 workers
[2018-06-06 09:16:48 +0200] [19832] [DEBUG] GET /
[2018-06-06 09:17:19 +0200] [19829] [CRITICAL] WORKER TIMEOUT (pid:19832)
[2018-06-06 09:17:19 +0200] [19832] [INFO] Worker exiting (pid: 19832)
[2018-06-06 09:17:19 +0200] [19872] [INFO] Booting worker with pid: 19872
^C[2018-06-06 09:17:26 +0200] [19829] [INFO] Handling signal: int
[2018-06-06 09:17:26 +0200] [19872] [INFO] Worker exiting (pid: 19872)
[2018-06-06 09:17:26 +0200] [19829] [INFO] Shutting down: Master
~$ pip list
Package      Version
------------ -------
click        6.7    
Flask        1.0.2  
gunicorn     19.8.1 
itsdangerous 0.24   
Jinja2       2.10   
MarkupSafe   1.0    
pip          10.0.1 
setuptools   39.2.0 
Werkzeug     0.14.1 
wheel        0.31.1

@bigunyak I think it's because of the default timeout, your worker has been silent for 30s. http://docs.gunicorn.org/en/stable/settings.html#timeout

From your log,

[2018-06-05 14:57:16 +0200] [11232] [DEBUG] GET /
[2018-06-05 14:57:47 +0200] [11229] [CRITICAL] WORKER TIMEOUT (pid:11232)
[2018-06-06 09:16:48 +0200] [19832] [DEBUG] GET /
[2018-06-06 09:17:19 +0200] [19829] [CRITICAL] WORKER TIMEOUT (pid:19832)

I'm seeing the same thing: workers are timing out even when serving no requests, with sync worker.
To that sense, the critical level log is quite confusing.

Try use Gevent worker could solve this.

To that sense, the critical level log is quite confusing.

Exactly, that was my original question: if it's an expected behavior why critical error then?
Would be also nice to get some background on why workers need to be restarted, maybe this could be added to the Design document.

I'm seeing this as well (reproduced by using examples/test.py with gunicorn test:app -b localhost:9595 --log-level=debug --timeout=5) and I agree that the critical level is a bit confusing. I'd be OK change it to debug level. @benoitc @tilgovi what do you think?

I'm thinking info level might be a little bit better.

I had the same with MSYS2 on Win10 but finally could solved.

in notify() of ...\gunicorn\workers\workertmp.py, os.fchmod is used originally. But it does not work in MSYS. Instead of os.fchmod, I used os.utime. The code is followed. I think it could work for all platform.

    def notify(self):
        try:
            self.spinner = (self.spinner + 1) % 2
            os.fchmod(self._tmp.fileno(), self.spinner)
            if PLATFORM.startswith('MSYS') :
                os.utime(self._tmp.fileno(), None)
        except AttributeError:
            # python < 2.6
            self._tmp.truncate(0)
            os.write(self._tmp.fileno(), b"X")

@berkerpeksag I wouldn't expect that the worker exit because no requests happen. This error should only happen if the worker has been kept busy for a time > to the timeout. So the error is critical. Imo we should improve the documentation to provide more use cases and responses to such errors.

If the error still happen when the worker is not kept busy then there is something else happening and we have probably a bug.

[EDIT]
Same bug for me.
With Django 1.10 / gunicorn 19.6.0 / Python 2.7.15 in python2.7-alpine on Debian 8.8 and stock kernel 3.16, all was working fine.
After updating to Django 1.11 and gunicorn 19.8.1, the workers keep failing at boot with [CRITICAL WORKER TIMEOUT].
Downgrading gunicorn to 19.6.0 doesn't fix the problem.
We updated the host kernel to 4.9.0 (it was scheduled), and the workers successfully booted without errors.
But:

  • we use 4 workers
  • after exactly 4 calls to the Django app, next calls will timeout, showing [CRITICAL WORKER TIMEOUT] in the logs
  • the linux top command shows 4 gunicorn processes stuck with pretty high CPU consumption.

We're going to try gunicorn gevent to see if we'are able to get our app back online.

Using gunicorn with gevent didn't fixed the bug.

any update on this issue ?

It looks like @neocolor identified a real bug under MSYS. It might deserve a separate issue.

@bigunyak what platform are you running under? I have tried to reproduce with the simple example and I cannot do it following exactly the steps outlined above. This agrees with my experience running multiple applications in production on multiple frameworks. The worker notification system has not changed recently, to my knowledge. My platform is Python 3.7 on MacOS 10.13.6, but I run Gunicorn in production with sync workers for several applications on Python 2.7 and 3.6.5 and only see worker timeouts when there are legitimately long requests that block the workers.

For @Tberdy: what happens if you try to set --worker-tmp-dir to somewhere outside the tmpfs filesystem? I am just wondering if maybe alpine or docker or the combination interferes somehow with how workers notify the arbiter.

See also #1388 for Docker related tmpfs issues.

i have this issue to.

I have this issue too, gunicorn sync was working perfectly well until yesternight, ut started reporting, workers timeout [CRITICAL] using gevent seems to solve my issue, but I'd really want to know why this happened .

@timoj58 @cjmash can you provides more detail about the issue ? How are you running gunicorn (in a vm?, options ...), which fileystem, OS? ANthing that could help to reproduce would be very helpful :)

@benoitc I am running gunicorn to start my Django project on kubernetes my gunicorn arguments are --bind=$port --workers=7 --timeout=1200 --log-level=debug --access-logfile - error-logfile -"

the errors i get from the logs

```E [2018-08-09 21:47:56 +0000] [13] [INFO] Booting worker with pid: 13

E [2018-08-09 21:47:56 +0000] [14] [INFO] Booting worker with pid: 14

E [2018-08-09 21:47:56 +0000] [12] [INFO] Booting worker with pid: 12

E [2018-08-09 21:47:56 +0000] [1] [DEBUG] 7 workers

E [2018-08-09 21:47:56 +0000] [11] [INFO] Booting worker with pid: 11

E [2018-08-09 21:47:55 +0000] [10] [INFO] Booting worker with pid: 10

E [2018-08-09 21:47:55 +0000] [9] [INFO] Booting worker with pid: 9

E [2018-08-09 21:47:55 +0000] [8] [INFO] Booting worker with pid: 8

E [2018-08-09 21:47:55 +0000] [1] [INFO] Using worker: sync

E [2018-08-09 21:47:55 +0000] [1] [INFO] Listening at: http://0.0.0.0:4040 (1)

E [2018-08-09 21:47:55 +0000] [1] [DEBUG] Arbiter booted

E [2018-08-09 21:47:55 +0000] [1] [INFO] Starting gunicorn 19.7.1

E raw_paste_global_conf: []

E ciphers: TLSv1

E do_handshake_on_connect: False

E suppress_ragged_eofs: True

E ca_certs: None

E cert_reqs: 0

E ssl_version: 2

E certfile: None

E keyfile: None

E proxy_allow_ips: ['127.0.0.1']

E proxy_protocol: False

E on_exit:

E nworkers_changed:

E worker_exit:

E child_exit:

E post_request:

E pre_request:

E pre_exec:

E worker_abort:

E worker_int:

E post_worker_init:

E post_fork:

E pre_fork:

E when_ready:

E on_reload:

E on_starting:

E paste: None

E pythonpath: None

E default_proc_name: art.wsgi

E proc_name: None

E statsd_prefix:

E statsd_host: None

E enable_stdio_inheritance: False

E syslog_facility: user

E syslog_prefix: None

E syslog: False

E syslog_addr: udp://localhost:514

E logconfig: None

E logger_class: gunicorn.glogging.Logger

E capture_output: False

E loglevel: debug

E errorlog: -

E access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"

E accesslog: -

E forwarded_allow_ips: ['127.0.0.1']

E secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}

E tmp_upload_dir: None

E initgroups: False

E umask: 0

E group: 0

E user: 0

E worker_tmp_dir: None

E pidfile: None

E raw_env: []

E daemon: False

E chdir: /usr/src/app

E sendfile: None

E preload_app: False

E check_config: False

E spew: False

E reload_engine: auto

E reload: False

E limit_request_field_size: 8190

E limit_request_fields: 100

E limit_request_line: 4094

E keepalive: 2

E graceful_timeout: 30

E timeout: 1200

E max_requests_jitter: 0

E max_requests: 0

E worker_connections: 1000

E threads: 1

E worker_class: sync

E workers: 7

E backlog: 2048

E bind: ['0.0.0.0:4040']

E config: None

E [2018-08-09 21:47:55 +0000] [1] [DEBUG] Current configuration:

I No migrations to apply.

I Running migrations:

I Apply all migrations: admin, auth, contenttypes, core, dashboard, jet, oauth2_provider, sessions

I Operations to perform:

E [2018-08-09 21:47:20 +0000] [13] [INFO] Worker exiting (pid: 13)

E os.path.dirname(os.path.dirname(__file__)), '.env'))

E /usr/src/app/art/wsgi.py:19: UserWarning: Not reading /usr/src/app/.env - it doesn't exist.

E [2018-08-09 21:47:00 +0000] [21] [INFO] Booting worker with pid: 21

E [2018-08-09 21:47:00 +0000] [1] [INFO] Handling signal: term

E [2018-08-09 21:46:35 +0000] [12] [INFO] Booting worker with pid: 12

E [2018-08-09 21:46:34 +0000] [13] [INFO] Booting worker with pid: 13

E [2018-08-09 21:46:34 +0000] [11] [INFO] Booting worker with pid: 11

E [2018-08-09 21:46:34 +0000] [1] [DEBUG] 7 workers

E [2018-08-09 21:46:34 +0000] [10] [INFO] Booting worker with pid: 10

E [2018-08-09 21:46:34 +0000] [9] [INFO] Booting worker with pid: 9

E [2018-08-09 21:46:34 +0000] [8] [INFO] Booting worker with pid: 8

E [2018-08-09 21:46:34 +0000] [7] [INFO] Booting worker with pid: 7

E [2018-08-09 21:46:34 +0000] [1] [INFO] Using worker: sync

E [2018-08-09 21:46:34 +0000] [1] [INFO] Listening at: http://0.0.0.0:4040 (1)

E [2018-08-09 21:46:34 +0000] [1] [DEBUG] Arbiter booted

E [2018-08-09 21:46:34 +0000] [1] [INFO] Starting gunicorn 19.7.1

E raw_paste_global_conf: []

E ciphers: TLSv1

E do_handshake_on_connect: False

E suppress_ragged_eofs: True

E ca_certs: None

E cert_reqs: 0

E ssl_version: 2

E certfile: None

E keyfile: None

E proxy_allow_ips: ['127.0.0.1']

E proxy_protocol: False

E on_exit:

E nworkers_changed:

E worker_exit:

E child_exit:

E post_request:

E pre_request:

E pre_exec:

E worker_abort:

E worker_int:

E post_worker_init:

E post_fork:

E pre_fork:

E when_ready:

E on_reload:

E on_starting:

E paste: None

E pythonpath: None

E default_proc_name: art.wsgi

E proc_name: None

E statsd_prefix:

E statsd_host: None

E enable_stdio_inheritance: False

E syslog_facility: user

E syslog_prefix: None

E syslog: False

E syslog_addr: udp://localhost:514

E logconfig: None

E logger_class: gunicorn.glogging.Logger

E capture_output: False

E loglevel: debug

E errorlog: -

E access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"

E accesslog: -

E forwarded_allow_ips: ['127.0.0.1']

E secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}

E tmp_upload_dir: None

E initgroups: False

E umask: 0

E group: 0

E user: 0

E worker_tmp_dir: None

E pidfile: None

E raw_env: []

E daemon: False

E chdir: /usr/src/app

E sendfile: None

E preload_app: False

E check_config: False

E spew: False

E reload_engine: auto

E reload: False

E limit_request_field_size: 8190

E limit_request_fields: 100

E limit_request_line: 4094

E keepalive: 2

E graceful_timeout: 30

E timeout: 1200

E max_requests_jitter: 0

E max_requests: 0

E worker_connections: 1000

E threads: 1

E worker_class: sync

E workers: 7

E backlog: 2048

E bind: ['0.0.0.0:4040']

E config: None

E [2018-08-09 21:46:34 +0000] [1] [DEBUG] Current configuration:

```

I struggled a bit to reproduce the problem this time but it's still there in the latest gunicorn version 19.9.0.
The steps to reproduce it are exactly the same as I described in this post.
My system is Arch Linux x86_64 GNU/Linux (kernel 4.17.2-1-ARCH), Python 3.6.5
Here is a fresh log trace.

~$ gunicorn -b 0.0.0.0:5000 --log-level=debug hello
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] Current configuration:
  config: None
  bind: ['0.0.0.0:5000']
  backlog: 2048
  workers: 1
  worker_class: sync
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /home/dima/lerning/python/modules/gunicorn
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1000
  group: 985
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: None
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: -
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  statsd_prefix: 
  proc_name: None
  default_proc_name: hello
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f5f8b9dad08>
  on_reload: <function OnReload.on_reload at 0x7f5f8b9dae18>
  when_ready: <function WhenReady.when_ready at 0x7f5f8b9daf28>
  pre_fork: <function Prefork.pre_fork at 0x7f5f8b4ec0d0>
  post_fork: <function Postfork.post_fork at 0x7f5f8b4ec1e0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f5f8b4ec2f0>
  worker_int: <function WorkerInt.worker_int at 0x7f5f8b4ec400>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f5f8b4ec510>
  pre_exec: <function PreExec.pre_exec at 0x7f5f8b4ec620>
  pre_request: <function PreRequest.pre_request at 0x7f5f8b4ec730>
  post_request: <function PostRequest.post_request at 0x7f5f8b4ec7b8>
  child_exit: <function ChildExit.child_exit at 0x7f5f8b4ec8c8>
  worker_exit: <function WorkerExit.worker_exit at 0x7f5f8b4ec9d8>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f5f8b4ecae8>
  on_exit: <function OnExit.on_exit at 0x7f5f8b4ecbf8>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: TLSv1
  raw_paste_global_conf: []
[2018-08-10 09:48:40 +0200] [23114] [INFO] Starting gunicorn 19.9.0
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] Arbiter booted
[2018-08-10 09:48:40 +0200] [23114] [INFO] Listening at: http://0.0.0.0:5000 (23114)
[2018-08-10 09:48:40 +0200] [23114] [INFO] Using worker: sync
[2018-08-10 09:48:40 +0200] [23117] [INFO] Booting worker with pid: 23117
[2018-08-10 09:48:40 +0200] [23114] [DEBUG] 1 workers
[2018-08-10 09:48:45 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:48:54 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:00 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:18 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:49:18 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:23 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:37 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:49:37 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:49:50 +0200] [23117] [DEBUG] Closing connection. 
[2018-08-10 09:51:11 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:51:13 +0200] [23117] [DEBUG] GET /
[2018-08-10 09:51:43 +0200] [23114] [CRITICAL] WORKER TIMEOUT (pid:23117)
[2018-08-10 09:51:43 +0200] [23117] [INFO] Worker exiting (pid: 23117)
[2018-08-10 09:51:44 +0200] [23229] [INFO] Booting worker with pid: 23229

As before to test it I was just hitting http://0.0.0.0:5000/ in Chromium.
Below is the content of Pipfile and Pipfile.lock for you to see the exact environment.

  • Pipfile
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
flask = "*"
gunicorn = "*"

[dev-packages]

[requires]
python_version = "3.6"
  • Pipfile.lock
{
    "_meta": {
        "hash": {
            "sha256": "81cb5d5f0b11719d8d9c5ec9cc683fdcf959c652fda256d5552a82d0f459a99c"
        },
        "pipfile-spec": 6,
        "requires": {
            "python_version": "3.6"
        },
        "sources": [
            {
                "name": "pypi",
                "url": "https://pypi.org/simple",
                "verify_ssl": true
            }
        ]
    },
    "default": {
        "click": {
            "hashes": [
                "sha256:29f99fc6125fbc931b758dc053b3114e55c77a6e4c6c3a2674a2dc986016381d",
                "sha256:f15516df478d5a56180fbf80e68f206010e6d160fc39fa508b65e035fd75130b"
            ],
            "version": "==6.7"
        },
        "flask": {
            "hashes": [
                "sha256:2271c0070dbcb5275fad4a82e29f23ab92682dc45f9dfbc22c02ba9b9322ce48",
                "sha256:a080b744b7e345ccfcbc77954861cb05b3c63786e93f2b3875e0913d44b43f05"
            ],
            "index": "pypi",
            "version": "==1.0.2"
        },
        "gunicorn": {
            "hashes": [
                "sha256:aa8e0b40b4157b36a5df5e599f45c9c76d6af43845ba3b3b0efe2c70473c2471",
                "sha256:fa2662097c66f920f53f70621c6c58ca4a3c4d3434205e608e121b5b3b71f4f3"
            ],
            "index": "pypi",
            "version": "==19.9.0"
        },
        "itsdangerous": {
            "hashes": [
                "sha256:cbb3fcf8d3e33df861709ecaf89d9e6629cff0a217bc2848f1b41cd30d360519"
            ],
            "version": "==0.24"
        },
        "jinja2": {
            "hashes": [
                "sha256:74c935a1b8bb9a3947c50a54766a969d4846290e1e788ea44c1392163723c3bd",
                "sha256:f84be1bb0040caca4cea721fcbbbbd61f9be9464ca236387158b0feea01914a4"
            ],
            "version": "==2.10"
        },
        "markupsafe": {
            "hashes": [
                "sha256:a6be69091dac236ea9c6bc7d012beab42010fa914c459791d627dad4910eb665"
            ],
            "version": "==1.0"
        },
        "werkzeug": {
            "hashes": [
                "sha256:c3fd7a7d41976d9f44db327260e263132466836cef6f91512889ed60ad26557c",
                "sha256:d5da73735293558eb1651ee2fddc4d0dedcfa06538b8813a2e20011583c9e49b"
            ],
            "version": "==0.14.1"
        }
    },
    "develop": {}
}

Just FYI, I'm also seeing this failure very regularly with:

nohup gunicorn -w 8 --access-logfile=- --bind 127.0.0.1:5000 wsgi &

I just have a wsgi.py that has

from chart_api import application
if __name__ == "__main__":
    application.run()

Let me know if there is anything you want me to try/experiment or if there are specifics in the logs you want me to check. I'm running flask on a GCP VM.

sorry for late reply. I am running it as

gunicorn --log-file=/home/ubuntu/log/gunicorn.log predictor_api:app -b localhost:5000 &

i did use the gevent setting etc, but i have changed my design of what i required this for in order to work around issue, hence the basic setting above (which also failed but this is to be expected given no gevent)

Python version 3.6
env: aws tensorflow_p36 optmized (ubuntu)
nginx is sat in front of gunicorn, which is executing a flask app.

flask version 1.0.2
nginx version 1.10.3
gunicorn version 19.9.0

I did change the nginx timeouts as well incase this may have caused it.

Facing the same problem with gunicorn server

# gunicorn ApplicationServer:app -b 0.0.0.0:6001 -w 8 --threads 4 --backlog 2048 \
# --timeout 120 --graceful-timeout 60 --access-logfile logs/access.log \

--error-logfile logs/error.log --log-level=info

Flask==0.12.1
gunicorn==19.7.1

when i start server with above command, my system got freezing for some time and worker pids are keeping on booting , though i kept timeout 120 sec, and the server is not accepting single request.

Any update on this issue ? I have same issue

[CRITICAL] WORKER TIMEOUT

Wondering whether anyone has successfully reproduced this in a Docker image?

Also seeing this when trying to implement datadog's ddtrace-run worker on an existing application starting with gunicorn -k gevent --threads 4.

Funny trace of a SystemExit I've never seen before too...
[2018-11-07 11:11:50 +0000] [15987] [INFO] Booting worker with pid: 15987 [2018-11-07 11:11:50 +0000] [15977] [DEBUG] 1 workers [2018-11-07 11:12:20 +0000] [15977] [CRITICAL] WORKER TIMEOUT (pid:15987) Exception SystemExit: 1 in <bound method LibevLoop._loop_will_run of <cassandra.io.libevreactor.LibevLoop object at 0x7f3cb66d4a50>> ignored

I am able to resolve this issue by matching the number of workers and number of threads.

I had set workers = (2 * cpu_count) + 1 and did not set threads.

Once I changed threads = workers , everything started working fine. Just in case, if this helps someone.

This is how it looks now

def run(host='0.0.0.0', port=8080, workers=1 + (multiprocessing.cpu_count() * 2)):
    """Run the app with Gunicorn."""

    if app.debug:
        app.run(host, int(port), use_reloader=False)
    else:
        gunicorn = WSGIApplication()
        gunicorn.load_wsgiapp = lambda: app
        gunicorn.cfg.set('bind', '%s:%s' % (host, port))
        gunicorn.cfg.set('workers', workers)
        gunicorn.cfg.set('threads', workers)
        gunicorn.cfg.set('pidfile', None)
        gunicorn.cfg.set('worker_class', 'sync')
        gunicorn.cfg.set('keepalive', 10)
        gunicorn.cfg.set('accesslog', '-')
        gunicorn.cfg.set('errorlog', '-')
        gunicorn.cfg.set('reload', True)
        gunicorn.chdir()
        gunicorn.run()

I encountered this issue running Django with a single Docker container on AWS Elastic Beanstalk. I resolved the issue by fixing my security groups to ensure my EC2 instance could talk to my RDS instance. I recognize this may not be the solution for 99% of folks on this issue, but I'm leaving this note to help others avoid wasting hours falling down this rabbit hole.

In my case the issue was resolved by performing systemctl daemon-reload.
Long story short I knew of the default timeout at 30sec, but was unable to change it until i figured i had to of course reload the systemd daemon to apply the changes to the service.

@bigunyak @benoitc @tilgovi
After a long google goose chase and some experimentation I believe the root cause of this issue is chrome "preconnect/prediction service" (which is enabled by default in both Chrome and Chromium).

@jeiting did a nice write up about this issue
https://hackernoon.com/chrome-preconnect-breaks-singly-threaded-servers-95944be16400
Some additional reading:
https://github.com/corydolphin/flask-cors/issues/147
https://github.com/pallets/flask/issues/2169

TLDR
In some cases Chrome/Chromium opens and holds an "empty" TCP connection(s) (i.e its predicting that its going to fetch another resource soon). If the "empty" TCP connection(s) hits your gunicorn server first then any subsequent "real" request from chrome can get stuck behind the "empty" request until the worker(s) handling the "empty" request times out. This is more likely to happen if you are only running a single sync worker in gunicorn. But as my experiments found it can happen even if you are running multiple sync workers.

My Environment

  • My native OS is Ubuntu 17
  • I have a rest API running in a local docker container. Relevant lines from the Dockefile
FROM ubuntu:18.04
...
RUN pip3 install Flask==1.0.2
RUN pip3 install gunicorn==19.9.0
RUN pip3 install flask-cors==3.0.6
......
  • I have 3 browsers installed: Firefox 61.0.1, Chromium 67.0.3396.99, Chrome 70.0.3538.102
  • I have a react app that is served from a different docker container on a different port
  • The react app makes CORS AJAX requests to my API (they are CORS requests because the javascript is served on one localhost port while its sending API calls to another localhost port)
  • The request that always gets "blocked" in my experiments is a CORS OPTIONS request (triggered by the browser to ask for permission to then perform a POST request ). Seems logical that a browser would try to make predictive connections with the OPTIONS call because it's likely going to follow up with a POST call.

Experiment 1
Gunicorn config: 1 sync worker (default 30s timeout)

Firefox: On almost every react page load, the CORS OPTIONS request gets blocked for 5s, and then succeeds.
Chromium: On EVERY react page load, CORS OPTIONS request gets blocked for 1.5minutes!!!! and then succeeds.
Chromium(prediction service disabled): Everything loads fine
Chrome: Everything loads fine

Experiment 2
Gunicorn config: 4 sync workers (default 30s timeout)

Firefox: Everything loads fine
Chromium: On every 3rd react page load, CORS OPTIONS request gets blocked for 30s and then succeeds.
Chromium(prediction service disabled): Everything loads fine
Chrome: Everything loads fine

Experiment 3
Gunicorn config: 8 sync workers (default 30s timeout)

Firefox: Everything loads fine
Chromium: Everything loads fine
Chrome: Everything loads fine

Experiment 4
Run flask dev server with threaded=True

Firefox: Everything loads fine
Chromium: Everything loads fine
Chrome: Everything loads fine

Experiment Summary

  • Firefox seems to also have prediction service, but it seems more graceful. (Experiment 1)
  • Chromium seems the most aggressive in creating "prediction" connections. It appears to launch up to 3-4 "prediction" connections during an OPTIONS request (Experiment 1 and 2)
  • Surprisingly even though Chrome also has prediction service enabled it didn't cause any issues in any of the experiments (maybe the order or the number of "prediction" connections is different from Chromium)

Solution
In Production: Easiest fix is to put nginx in front of gunicorn. If you want to understand why nginx solves this issue, here is nice article that explains it: https://www.brianstorti.com/the-role-of-a-reverse-proxy-to-protect-your-application-against-slow-clients/

In dev: Easiest fix is to run flask dev server with threaded=True. Or you can disable the prediction service in Chrome/Chromium.

gunicorn debug improvements
To help debug an issue like this in the future, I think it's a good idea to add debug log statements next to the select() and accept() calls in the sync worker.
https://github.com/benoitc/gunicorn/blob/e974f30517261b2bc95cfb2017a8688f367c8bf3/gunicorn/workers/sync.py#L26
https://github.com/benoitc/gunicorn/blob/e974f30517261b2bc95cfb2017a8688f367c8bf3/gunicorn/workers/sync.py#L34
This would show that a worker has accepted a new TCP connection, but is not receiving any data.

@asnisarenko super write-up, thanks. #1929 is another unusual case of a slow-client which yields similar symptom in the single-threaded server - in that case, TLS handshake to a non-TLS port looks like a slow-writing client, since it fails to send sensible headers quickly.

It may be the Gunicorn, for single-threaded sync workers is going to need some new tunable, an aggressive client-drop for clients which fail to send at least a first request-header line, if not all headers, within a reasonable time.

If this happens in your app, you can add some temporary diagnostic code:

def trace_on_abort():
    import signal
    import traceback

    def print_trace(sig, frame):
        print(''.join(traceback.format_stack(frame)))

    signal.signal(signal.SIGABRT, print_trace)

Hopefully this should reveal where your app hangs.

For example, I somehow managed to import flask before gevent.monkey.patch_all() and so my Flask's app._before_request_lock ended up being a non-gevent lock (i.e. plain unpatched threading.Lock). In such a case, two requests in close succession on app start-up would cause it to hang (2nd request would lock the entire thread). But this was my bug, and your bug may differ.

@asnisarenko I'm not sure why you're experiencing a hang. You absolutely do _not_ need a worker per connection, since every worker can cooperatively handle thousands of connections (as long as the handling code doesn't monopolize the CPU / lock the entire thread). Specifically, if the browser stalls the request, it simply means that when gunicorn read()s from the socket, it will block a greenlet, but since read should be monkey-patched it'll yield to the other greenlets.

@ikonst
This issue is talking about running gunicorn with default configuration. The default worker class is sync http://docs.gunicorn.org/en/stable/settings.html#worker-class

sync workers use a pre-fork model and each worker processes one TCP connection/request at a time.

I don't know what caused this for me, but switching from the default sync worker type to eventlet resolved the problem:

CMD pipenv run gunicorn webapp -b 0.0.0.0:8080 -k eventlet

Good luck.

Maybe this is gonna help only me, but since it took me 7h to debug this issue on my side (a Flask/Gunicorn app running in Docker on EC2), if I can spare this headache to a few people, it's a small victory.

The container memory limit that was too low. I initially set it up at 100 MB, but apparently the app is requesting about 110 MB when running... Getting rid of the limit prevented my workers to timeout whith (apparently) no reason.

If I have memory leaks in the future, I'll try to put a higher memory limit, but for now, no limit will do.

Thanks

def trace_on_abort():
    import signal
    import traceback

    def print_trace(sig, frame):
        print(''.join(traceback.format_stack(frame)))

    signal.signal(signal.SIGABRT, print_trace)

You ыo much help me. EHU )))

Where does this trace_on_abort() method go? In the app code file? Does it need an annotation?

@mattg-vbt add this to the beginning of your app when you need to debug, and remove it once you're done debugging

@ikonst I added this to my app, but I don't see it getting hit. I get the worker timeout, but this method isn't hit.

@mattg-vbt can you try doing kill -ABRT $pid to see the print_trace function being called? (the idea is that you'll get SIGABRT from werkzeug when your worker times out, but let's first check that it's called at all)

@asnisarenko for the less savvy amongst us how then would I update the procfile?

web gunicorn app:app -k gevent --worker-connections 1000 ?

@SumNeuron
If you are using gevent workers you (instead of sync workers) then you should not be hitting this issue. So the command you posted seems fine.

I'm getting this issue with gevent workers only upon initialization, which is a bit annoying as I am processing some tasks before launching my app. But I am setting a high timeout for now.
I've posted a reproducible test repo here: https://github.com/zamponotiropita/test-gunicorn-worker-timeout -> test_0 fails, test_1 and test_2 pass

@zamponotiropita Are you trying to do this per-worker or per-app before the forking?

@ikonst please check the run file run.sh, it's per worker. The problem didn't come up with preloading, but I had issues connecting to the database when preloading, since the app object (and with it together the database connection) gets copied and passed the processes when forking -> my database couldn't handle identical connections of master and workers, and I couldn't find a workaround

I am having a similar issue. I am trying to generate a lot of data on-the-fly with Flask using
this method, and it fails because the worker times out after whatever was set in --timeout expires. A minimal example to reproduce:

test_gunicorn_timeout.py

import flask
from time import sleep


app = flask.Flask(__name__)


@app.route('/')
def gunicorn_timeout():
    def generator():
        for _ in range(10):
            yield b'Yet another line...'
            sleep(2)
    return flask.Response(generator(), mimetype='text/plain')

Then run gunicorn --timeout 10 test_gunicorn_timeout:app and when requesting localhost:8000 after 10 seconds you get a
[CRITICAL] WORKER TIMEOUT.

I also tried running with -k gevent and -k eventlet and nothing changed.

I am on windows 10 Pro. With docker compose
gunicorn app -b 0.0.0.0:8000 -k gevent
Worked for me by installing gevent in my python container
How to install gevent

Then run gunicorn --timeout 10 test_gunicorn_timeout:app and when requesting localhost:8000 after 10 seconds you get a
[CRITICAL] WORKER TIMEOUT.

I also tried running with -k gevent and -k eventlet and nothing changed.

@ltskv

have you found a solution to the [CRITICAL] WORKER TIMEOUT issue?

I have a similar problem with a flask streaming endpoint I am trying to deploy on heroku (which uses gunicorn). There, the proxy frontend is supposed to keep the connection alive as long as some data is sent within 30 sec, but it seems gunicorn simply kills the process if it has not finished within 30 sec even though it is still running and producing data.

The gunicorn documentation is not entierly clear to me on this point. For --timeout it says that Workers silent for more than this many seconds are killed and restarted. But it seems that workers are killed after 30sec even though they still produce data?

The gunicorn documentation is not entierly clear to me on this point. For --timeout it says that Workers silent for more than this many seconds are killed and restarted. But it seems that workers are killed after 30sec even though they still produce data?

@kurt-hectic that documentation should be improved. By silent, we mean silent from the perspective of the arbiter process, which communicates with the workers through a temporary file. If the worker is busy sending data, it does not update that file. From the perspective of the arbiter, the worker is missing heartbeats.

See also #1974.

Using any worker other than the default, synchronous worker should mitigate the issue with streaming large bodies, because the other workers will heartbeat the temporary file even when they are handling requests.

@kurt-hectic I re-checked once again the -k gevent option and inserted sleep(0) between generator iterations and it did actually work for me (I am not sure why it didn't work at the time I posted the question though).

--timeout=5

This is the most common cause of this issue.

I hope my solution could help you. I met this critical worker timeout problem a few days ago and tried a few solutions. It now works well.

Here are my understanding and solutions:

  1. Try preload in gunicorn

It fails to boot the workers because it needs more time to load the package, such as tensorflow backend, to start the service. So when you are experiencing slow app boot time, try to enable preload option in gunicorn (See https://devcenter.heroku.com/articles/python-gunicorn#advanced-configuration).

gunicorn hello:app --preload

  1. Try to increase the timeout for gunicorn

The default timeout is 30s. If your application really need much time to finish an api, increase the timeout.

gunicorn hello:app --timeout 10

However, from my perspective, it doesn't make sense if an api need more than 1 minutes to finish. If so, try to make some progress in your code.

  1. If you are using k8s, you can also set a timeoutSeconds for you container/image in yaml.

I faced the same issue today. In my case the api was taking about a minute to calculate data and return to the client, which resulted in CRITICAL WORKER TIMEOUT errors. I solved it by increasing the timeout flag for gunicorn to more than a minute - it worked, did not see the issue come back. Hope this helps. I am using uvicorn.workers.UvicornWorker.

I fixed this by adding extra workers to gnuicorn:

web: gunicorn --workers=3 BlocAPI:app --log-file -

No idea why.

Maybe you had a deadlock ? Does your app make requests to itself ?

On Sun, 5 Jan 2020, 10:52 alpinechicken, notifications@github.com wrote:

I fixed this by adding extra workers to gnuicorn:

web: gunicorn --workers=3 BlocAPI:app --log-file -

No idea why.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/benoitc/gunicorn/issues/1801?email_source=notifications&email_token=AAAEQJVQRCW3C63EZJWIN5DQ4G3WTA5CNFSM4FDLD5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIDTZIA#issuecomment-570899616,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAEQJXZM4NLK56DZMFSZALQ4G3WTANCNFSM4FDLD5PA
.

Yep one route calls another - is that bad?

It means that you need at least two workers otherwise your server will
deadlock. The request will wait until the server responds to the second
request (which would be queued).

You get one concurrent request per worker.

On Mon, 6 Jan 2020, 02:45 alpinechicken, notifications@github.com wrote:

Yep one route calls another - is that bad?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/benoitc/gunicorn/issues/1801?email_source=notifications&email_token=AAAEQJSFEFBBI6AMZJCM4C3Q4KLOJA5CNFSM4FDLD5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIEIEXI#issuecomment-570983005,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAEQJXTCPOFIZJU5PUPOODQ4KLOJANCNFSM4FDLD5PA
.

Ah that makes sense. Thanks!

On Tue, Jan 7, 2020 at 6:23 AM bobf notifications@github.com wrote:

It means that you need at least two workers otherwise your server will
deadlock. The request will wait until the server responds to the second
request (which would be queued).

You get one concurrent request per worker.

On Mon, 6 Jan 2020, 02:45 alpinechicken, notifications@github.com wrote:

Yep one route calls another - is that bad?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<
https://github.com/benoitc/gunicorn/issues/1801?email_source=notifications&email_token=AAAEQJSFEFBBI6AMZJCM4C3Q4KLOJA5CNFSM4FDLD5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIEIEXI#issuecomment-570983005
,
or unsubscribe
<
https://github.com/notifications/unsubscribe-auth/AAAEQJXTCPOFIZJU5PUPOODQ4KLOJANCNFSM4FDLD5PA

.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/benoitc/gunicorn/issues/1801?email_source=notifications&email_token=AAH2WRPVPVO2EJ53BKQW5B3Q4OHLRA5CNFSM4FDLD5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIGVJ7Q#issuecomment-571299070,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAH2WRM2LLIB4O6OHCU5UG3Q4OHLRANCNFSM4FDLD5PA
.

worker_class', 'sync')

I am able to resolve this issue by matching the number of workers and number of threads.

I had set workers = (2 * cpu_count) + 1 and did not set threads.

Once I changed threads = workers , everything started working fine. Just in case, if this helps someone.

This is how it looks now

def run(host='0.0.0.0', port=8080, workers=1 + (multiprocessing.cpu_count() * 2)):
    """Run the app with Gunicorn."""

    if app.debug:
        app.run(host, int(port), use_reloader=False)
    else:
        gunicorn = WSGIApplication()
        gunicorn.load_wsgiapp = lambda: app
        gunicorn.cfg.set('bind', '%s:%s' % (host, port))
        gunicorn.cfg.set('workers', workers)
        gunicorn.cfg.set('threads', workers)
        gunicorn.cfg.set('pidfile', None)
        gunicorn.cfg.set('worker_class', 'sync')
        gunicorn.cfg.set('keepalive', 10)
        gunicorn.cfg.set('accesslog', '-')
        gunicorn.cfg.set('errorlog', '-')
        gunicorn.cfg.set('reload', True)
        gunicorn.chdir()
        gunicorn.run()

As per gunicorn doc, it changes the worker class from sync to gthread if more than one threads are mentioned.
PS:-
If you try to use the sync worker type and set the threads setting to more than 1, the gthread worker type will be used instead.

My case:

Environment: Ubuntu18.04+ gunicorn+ nginx +flask

pip install gunicorn[gevent] in my virtual environment

Change gunicorn -b localhost:8000 -w 4 web:app to gunicorn -b localhost:8000 -k gevent web:app

It works.

Thank you to everyone here who has done so much to help one another resolve their issues. Please continue to post to this issue if it seems appropriate.

However, I am closing this issue because I don't think there is any bug in Gunicorn here and I don't think there is any action to take, although I will happily help review PRs that try to add documentation for this somehow or improve log messages.

Please do not misunderstand my intention. If you suspect a bug in Gunicorn and want to continue discussing, please do so. Preferably, open a new ticket with an example application that reproduces your issue. However, at this point, there are too many different problems, resolutions, and conversations in this issue for it to be very legible.

If you run Gunicorn without a buffering reverse proxy in front of it you _will_ get timeouts with the default, sync worker for any number of reasons. Common ones are:

  • Slow clients
  • Preconnect / prefetch connections that are left open by browsers and proxies
  • Long responses due to calling external APIs or doing lots of CPU-bound work

You can switch to asynchronous or threaded worker types, or you can put Gunicorn behind a buffering reverse proxy. If you know that your timeouts are due to your own code making slow calls to external APIs or doing significant work that you expect, you may increase the --timeout option.

It means that you need at least two workers otherwise your server will deadlock. The request will wait until the server responds to the second request (which would be queued). You get one concurrent request per worker.

On Mon, 6 Jan 2020, 02:45 alpinechicken, @.*> wrote: Yep one route calls another - is that bad?

Is this the case when calling the 'redirect' function as the return value for a route?

Is this the case when calling the 'redirect' function as the return value for a route?

No. A flask redirect responds with an HTTP redirect and the worker is then free to accept a new request. The client makes another request when it sees this response and whenever a worker is ready in will receive this request.

I fixed this by adding extra workers to gnuicorn:

web: gunicorn --workers=3 BlocAPI:app --log-file -

No idea why.

Is this related to @anilpai comment earlier where he set workers=1 + (multiprocessing.cpu_count() * 2) .. ?

I had a similar issue to this. Turns out I had an error in my entrypoint to the application. From debugging it seemed that I was essentially launching a flask app from gunicorn, who's workers subsequently enter an infinite connection loop which times out every 30s.

I'm sure that this doesn't affect all users above, but may well affect some.

In my module/wsgi.py file which I'm running with gunicorn module.wsgi I had -

application = my_create_app_function()
application.run(host="0.0.0.0")

Whereas I should've had -

application = my_create_app_function()
if __name__ == "__main__":
     application.run(host="0.0.0.0")

Essentially, you don't want to call application.run() when using gunicorn. The __name__ under gunicorn won't be "__main__", but it will in Flask, so you can still debug locally.

I couldn't find a reference to this in the gunicorn docs, but could imagine it being a common error case, so maybe some warning is necessary.

This is still occuring. Adding --preload to the Gunicorn call fixed the issue for me.

Is this bug still not fixed? I am observing this exact behavior.

Gunicorn starts like this in systemd:

[Service]
PIDFile = /run/gunicorn.pid
WorkingDirectory = /home/pi/pyTest
ExecStart=/usr/local/bin/gunicorn  app:app  -b 0.0.0.0:80 --pid /run/gunicorn.pid
RuntimeDirectory=/home/pi/pyTest
Restart=always
KillSignal=SIGQUIT
Type=notify
StandardError=syslog
NotifyAccess=all
User=root
Group=root
ExecReload = /bin/kill -s HUP $MAINPID
ExecStop = /bin/kill -s TERM $MAINPID
ExecStopPost = /bin/rm -rf /run/gunicorn
PrivateTmp = true

Worker process constantly times out and restarts:

Jul 10 15:19:20 raspberryVM gunicorn[10941]: [2020-07-10 15:19:20 -0700] [10941] [CRITICAL] WORKER TIMEOUT (pid:10944)
Jul 10 15:19:20 raspberryVM gunicorn[10941]: [2020-07-10 15:19:20 -0700] [10944] [INFO] Worker exiting (pid: 10944)
Jul 10 15:20:15 raspberryVM gunicorn[10941]: [2020-07-10 15:20:15 -0700] [10985] [INFO] Booting worker with pid: 10985

app.py is a trival Flask app.

Is this issue closed as Don't Fix?

I was also having the same issue

But after Debugging Im able to find that while gunicorn starts Django App one of the dependency was taking longer than the expected time , ( In my case external DB connection ) which make the gunicron worker to timeout

When I resolved the connection issue , timeout issue also resolved ...

This would not my case. I tested with “Hello, World” type of app, with no dependencies. So I am still puzzled by this, but it seems it’s not possible to have Gunicorn with long running thread. Worker process restarts and therefore kill the long running thread.

@leonbrag
This is likely NOT a gunicorn bug. See my commend above in the thread. It's a side-effect of browsers sending empty "predicted" TCP connections, and running gunicorn with only a few sync workers without protection from empty TCP connections.

Is there a reference architecture/design that shows a proper way to set up Gunicorn flask app with long (permanent) worker thread ?

If this is not a bug, then it’s seems an artifact or a limitation of the Gunicorn architecture/design.

Why would not sync worker run forever and accept clients connections. Such worker would close socket as needed, yet continue to run without exIting (and therefor worker thread continue to run).

@leonbrag
You should be more specific about what problem you are trying to solve.

The problem discussed in this thread happens in dev environment and the easiest solution is either to add more sync workers or use threaded workers.

If you want to avoid this issue in production setup, you can use gevent workers, or you can put an nginx infront of gunicorn.
Some PaaS already put an nginx in front of your docker container, so you don't have to worry about it. Again the solution depends on the context and the details.

This is a good reading.
https://www.brianstorti.com/the-role-of-a-reverse-proxy-to-protect-your-application-against-slow-clients/

you can check the design page from the documentation. Async workers is one
way to run long tasks.

On Sat 8 Aug 2020 at 18:00 leonbrag notifications@github.com wrote:

Is there a reference architecture/design that shows a proper way to set up
Gunicorn flask app with long (permanent) worker thread ?

If this is not a bug, then it’s seems an artifact or a limitation of the
Gunicorn architecture/design.

Why would not sync worker run forever and accept clients connections. Such
worker would close socket as needed, yet continue to run without exIting
(and therefor worker thread continue to run).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/benoitc/gunicorn/issues/1801#issuecomment-670944797,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADRIWRQGIP3R5PMVJ5ENTR7VZA3ANCNFSM4FDLD5PA
.

>

Sent from my Mobile

web: gunicorn --workers=3 app:app --timeout 200 --log-file -

I fixed my problem by incresing the --timeout

See also #1388 for Docker related tmpfs issues.

Oh, thanks a lot Randall, I forgot to add --worker-tmp-dir /dev/shm to gunicorn arguments when I was running gunicorn in Docker.

BTW will 64 Mb be enough for gunicorn cache?

gunicorn app:app --timeout 1000
Or
gunicorn app:app --preload

Worked for me... I prefer timeout one.

Strange, I added --worker-tmp-dir /dev/shm but still receiving:

[2020-11-27 21:01:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:17)

To make sure /dev/shm is ramfs I benchmarked it:

image

The params are next:

    command: /bin/bash -c "cd /code/ && pipenv run gunicorn --worker-tmp-dir /dev/shm conf.wsgi:application --bind 0.0.0.0:8022 --workers 5 --worker-connections=1000"

PS: I am using PyPy

@attajutt timeout is nice but you are risking that gunicorn master process will detect hangup in your worker process only after 1000 seconds, and you will miss a lot of requests. Also it will be hard to detect it if only one of several workers will hangup. I would not do 1000 at least.

@ivictbor thanks for lmk. 1000 is for reference. Nevertheless, I got the app rolling once Its loaded It is running perfectly fine.

I got this error problem too and after several times, I found that the problem is probably caused :

  1. Nginx configuration
  2. Gunicorn/Uwsgi

If you deploy your app in cloud like GAE, that will not surface anything hint error.
you can try to surface the error using this case solution : https://stackoverflow.com/questions/38012797/google-app-engine-502-bad-gateway-with-nodejs

If raised 502 bad gateway;
probably will have 2 possibilities:

  1. gunicorn isn't running
  2. gunicorn got timeout

complete sulotion explained in here : https://www.datadoghq.com/blog/nginx-502-bad-gateway-errors-gunicorn/

hope that can fix anyone got error in [CRITICAL] WORKER TIMEOUT

Adding another possibility for those who find this thread...

This can also be caused by having docker imposed resource constrains that are too low for you web application. For example I had the following constraints:

services:
  web_app:
    image: blah-blah
    deploy:
      resources:
        limits:
          cpus: "0.25"
          memory: 128M

and these were evidently too low for gunicorn so I constantly got the [CRITICAL] WORKER TIMEOUT error until I removed the constraints.

For gunicorn this resources are perfectly fine. But you indeed need to
plane for the number of workers and the resources needed for your
application. 128M and 0.25cpu seems really low for a web application
written in Python.... generally speaking you need at least 1 core /vcpu and
512MB of RAM as a bare minimum.

On Fri 26 Mar 2021 at 02:14, Colton Hicks @.*> wrote:

Adding another possibility for those who find this thread...

This can also be caused by having docker imposed resource constrains that
are too low for you web application. For example I had the following
constraints:

services:
web_app:
image: blah-blah
deploy:
resources:
limits:
cpus: "0.25"
memory: 128M

and these were evidently too low for gunicorn so I constantly got the [CRITICAL]
WORKER TIMEOUT error until I removed the constraints.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/benoitc/gunicorn/issues/1801#issuecomment-807855647,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAADRITPZB7BMA6QW7LFNVLTFPNV3ANCNFSM4FDLD5PA
.

>

Sent from my Mobile

--timeout=1000 worked form me. Issue was a low-cpu resourced GCP machine. It worked fine on my local machine with the default timeout.

Was this page helpful?
0 / 5 - 0 ratings