Gunicorn: Gunicorn sync workers time out on docker + AWS

Created on 29 Jan 2016 · 78Comments · Source: benoitc/gunicorn

Using gunicorn 19.4.5 inside a docker image, which is running on Ubuntu 14.04 LTS on AWS, I see constant worker timeouts when running with >1 sync worker and the default 30s timeout period. This may be related to #588 or #942, although this is a significantly newer version of gunicorn so I couldn't tell for sure.

Circumstantial evidence over several runs seems to suggest that one of the workers is not affected, and it's only the N-1 remaining workers that keep failing and restarting. However, I was only able to get a few data points that suggest this, so don't take it as a strong signal.

Here are things I checked:

Do the issues happen outside of docker, or when using the same docker image, but not on AWS?
No, I was only able to replicate the issue with docker on AWS.
Does the application take >30s to initialize?
No, the application finishes initializing in under 1 second.
Does the application receive a request that causes it to hang?
The issue popped up without any requests being made to the application.
Does the application have broken initialization code that confuses gunicorn?
The issue popped up when running gunicorn in the same configuration (with >1 sync worker), but with a different application. This new application runs without problems on gunicorn when using gevent workers.
Does docker's filesystem fail to update the ctime?
See the below log -- the ctime seems to usually advance fine. Docker's filesystem doesn't allow stat -L on deleted files, but preventing gunicorn from unlinking the file also results in stat -L showing that ctime is being updated as usual.

I added some code to workertmp.py to log the read ctime on each check, and here is one such log (interleaved between all workers):

[2016-01-29 00:21:38 +0000] [3238] [INFO] Starting gunicorn 19.4.0
[2016-01-29 00:21:38 +0000] [3238] [INFO] Listening at: http://0.0.0.0:5000 (3238)
[2016-01-29 00:21:38 +0000] [3238] [INFO] Using worker: sync
[2016-01-29 00:21:38 +0000] [3243] [INFO] Booting worker with pid: 3243
[2016-01-29 00:21:38 +0000] [3244] [INFO] Booting worker with pid: 3244
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026899.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026900.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026900.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026900.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026900.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026899.03
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026905.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026905.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026905.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026905.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026904.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026910.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026910.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026910.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026910.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026909.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026914.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026933.73
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026934.74
FOOBAR: modify /tmp/wgun[2016-01-29 00:22:25 +0000] [3238] [CRITICAL] WORKER TIMEOUT (pid:3243)
[2016-01-29 00:22:25 +0000] [3243] [INFO] Worker exiting (pid: 3243)
[2016-01-29 00:22:25 +0000] [3330] [INFO] Booting worker with pid: 3330
icorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunic54026934.74
FOOBAR: modify /tmp/wgun[2016-01-29 00:22:25 +0000] [3238] [CRITICAL] WORKER TIMEOUT (pid:3243)
[2016-01-29 00:22:25 +0000] [3243] [INFO] Worker exiting (pid: 3243)
[2016-01-29 00:22:25 +0000] [3330] [INFO] Booting worker with pid: 3330
icorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time:fy] [3238] [CRITICAL] WORKER TIMEOUT (pid:3243)
[2016-01-29 00p/2:25 +0000] [3243] [INFO] Worker exiting (pid: 3243)
[2016-0ic29 00:22:25 +0000] [3330] [INFO] Booting worker with pid: 33myicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorntiyjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt45ime: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026945.82
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026946.01
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026946.01
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026946.01
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8corn-myjLwI time:fy] [3238] [CRITICAL] WORKER TIMEOUT (pid:32BA)
[2016-01-29 00p/2:25 +0000] [3243] [INFO] Worker exiting (pdify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
FOOBAR: modify /tmp/wgunicorn-Mw64T1 timeodify /tmp/wgunicorn-myjLwI time: 1454026964.74
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026969.74
FOO[2016-01-29 00:22:59 +0000] [3238] [CRITICAL] WORKER TIMEOUT (pid:3330)
[2016-01-29 00:22:59 +0000] [3330] [INFO] Worker exiting (pid: 3330)
icorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026935.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026939.74
FOOBAR: modify /tmp/wgunicorn-LwI time: 1454026965.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 tI time: 1454026940.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026940.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026940.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026940.0
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026944.74
FOOBAR: modify /tmp/wgunicorn-RhAFmt time: 1454026915.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.0
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026945.8
[2016-01-29 00:22:59 +0000] [3396] [INFO] Booting worker with pid: 3396
BAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
FOOBAR: modify /tmp/wgunicorn-myjL26949.74
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
FOOBAR: modify /tmp/w79.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026979.97
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
80.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454029.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454029.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454029.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026979.95
FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
 WORKER TIMEOUT (pid:3396)
 [2016-01-29 00:23:31 +0000] [3396] [INFO] Worker exiting (pid: 3396)
 BAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026970.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026974.74
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-Mw64T1 time: 1454026949.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454026975.0
 FOOBAR: modify /tmp/wgunicorn-Mw6nicorn-k3uZLy time: 1454026980.16
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027005.0
 FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027010.0
 FOOBAR: modify /tmp/wgunicorn-k3uZLy time: 1454026980.16
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027010.0
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027011.08
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027011.28
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027011.28
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027011.28
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmprn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-Znicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027011.06
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027028.98
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wguicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wguicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wguicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027030.0
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn--ZmVaVS time: 1454027014.74
 FOOBAR: modify /tmp/wgunicorn-myjLwI time: 1454027035.0
 FOOBAR: modify /tmp/wgunicorn-ZmVaVS time: 1454027014.74
 FOOBAR: modify /t[2016-01-29 00:24:05 +0000] [3238] [CRITICAL] WORKER TIMEOUT (pid:3453)
 [2016-01-29 00:24:05 +0000] [3453] [INFO] Worker exiting (pid: 3453)

FeaturIPC PlatforDocker - Bugs -

Source

obi1kenobi

👍29

Most helpful comment

Try using the gevent workers instead, this has solved the issue for me.

esc on 10 Aug 2016

👍15 🎉5 ❤2 🚀1

All 78 comments

hrmmm from the log above I can see that the time is never changing. I will check

benoitc on 7 Feb 2016

Thanks, I appreciate it!

obi1kenobi on 7 Feb 2016

I might be affected by this bug.

I am running gunicorn inside a Docker image on AWS (Elastic Beanstalk + EC2 Container Service).
There are constant worker timeouts that start happening almost immediately after initialization, and only when running the Docker image on AWS.
The same HTTP request takes sometimes a few ms, sometimes 20-30 seconds. Doesn't matter if it's for a non-existent page, a simple non-authenticated "Hello world"-string page or a page that fetches records from a database.

Additional details:

Gunicorn versions: 19.5.0 and 19.6.0.
Base Docker image: python:3.4 (OS: Debian Jessie)
Docker version: 1.9.1 (managed by Amazon ECS), OS: Amazon Linux AMI 2016.03 (managed by Amazon EB).
Web application framework: Django versions 1.9.6 and 1.9.7.

Environments _not_ affected:

Local machine, both _inside_ and _outside_ a Docker container. OS: Ubuntu 16.04 LTS, Docker: 1.11.2.
Heroku (OS: Ubuntu 14.04 LTS)

Amazon container logs:

[2016-06-15 16:46:15 +0000] [1] [DEBUG] Current configuration:
  cert_reqs: 0
  post_request: <function PostRequest.post_request at 0x7fb13e187c80>
  worker_connections: 1000
  tmp_upload_dir: None
  enable_stdio_inheritance: False
  timeout: 30
  pre_fork: <function Prefork.pre_fork at 0x7fb13e1871e0>
  worker_int: <function WorkerInt.worker_int at 0x7fb13e1876a8>
  backlog: 2048
  max_requests: 0
  max_requests_jitter: 0
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  logconfig: None
  syslog_addr: udp://localhost:514
  preload_app: False
  pidfile: None
  umask: 0
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  paste: None
  proxy_protocol: False
  worker_class: sync
  on_starting: <function OnStarting.on_starting at 0x7fb13e414c80>
  worker_abort: <function WorkerAbort.worker_abort at 0x7fb13e187840>
  worker_exit: <function WorkerExit.worker_exit at 0x7fb13e187e18>
  config: config/gunicorn.py
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7fb13e18a048>
  daemon: False
  accesslog: None
  errorlog: -
  loglevel: debug
  syslog_prefix: None
  ssl_version: 3
  suppress_ragged_eofs: True
  limit_request_field_size: 8190
  reload: False
  logger_class: gunicorn.glogging.Logger
  statsd_host: None
  keyfile: None
  raw_env: []
  threads: 1
  django_settings: None
  proc_name: None
  proxy_allow_ips: ['127.0.0.1']
  limit_request_fields: 100
  ciphers: TLSv1
  check_config: False
  do_handshake_on_connect: False
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7fb13e187510>
  graceful_timeout: 30
  worker_tmp_dir: None
  certfile: None
  sendfile: None
  keepalive: 2
  chdir: /app/testapp
  when_ready: <function WhenReady.when_ready at 0x7fb13e187048>
  ca_certs: None
  on_exit: <function OnExit.on_exit at 0x7fb13e18a1e0>
  spew: False
  bind: [':8000']
  post_fork: <function Postfork.post_fork at 0x7fb13e187378>
  limit_request_line: 4094
  syslog_facility: user
  workers: 3
  syslog: False
  pre_request: <function PreRequest.pre_request at 0x7fb13e187b70>
  user: 0
  group: 0
  forwarded_allow_ips: ['127.0.0.1']
  pythonpath: None
  on_reload: <function OnReload.on_reload at 0x7fb13e414e18>
  pre_exec: <function PreExec.pre_exec at 0x7fb13e1879d8>
  default_proc_name: config.wsgi
  statsd_prefix: 
[2016-06-15 16:46:15 +0000] [1] [INFO] Starting gunicorn 19.5.0
[2016-06-15 16:46:15 +0000] [1] [DEBUG] Arbiter booted
[2016-06-15 16:46:15 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2016-06-15 16:46:15 +0000] [1] [INFO] Using worker: sync
[2016-06-15 16:46:15 +0000] [8] [INFO] Booting worker with pid: 8
[2016-06-15 16:46:15 +0000] [9] [INFO] Booting worker with pid: 9
[2016-06-15 16:46:15 +0000] [10] [INFO] Booting worker with pid: 10
[2016-06-15 16:46:15 +0000] [1] [DEBUG] 3 workers
[2016-06-15 16:46:52 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
[2016-06-15 16:46:52 +0000] [9] [INFO] Worker exiting (pid: 9)
[2016-06-15 16:46:52 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:10)
[2016-06-15 16:46:52 +0000] [10] [INFO] Worker exiting (pid: 10)
[2016-06-15 16:46:53 +0000] [20] [INFO] Booting worker with pid: 20
[2016-06-15 16:46:53 +0000] [1] [DEBUG] 2 workers
[2016-06-15 16:46:53 +0000] [21] [INFO] Booting worker with pid: 21
[2016-06-15 16:46:53 +0000] [1] [DEBUG] 3 workers
[2016-06-15 16:47:23 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:8)
(...)

Thanks.

jtrh on 16 Jun 2016

👍13

I am seeing this too. My suspicion is the Elastic Load Balancer. We used to use paste in a previous company and this was causing issues with the ELB. I believe it re-uses connections somehow and this was causing hickups in paste.

esc on 8 Aug 2016

I am now in a new job where we use gunicorn and everything was fine until we begann to put the server behind an ELB, the webapp is now often very unresponsive and seemingly hangs.

esc on 8 Aug 2016

Try using the gevent workers instead, this has solved the issue for me.

esc on 10 Aug 2016

👍15 🎉5 ❤2 🚀1

So is this an issue with the ELB keeping connections alive and tying up the sync worker? That would strike me as very weird since we deliberately signal that the connection should be closed with the sync worker: https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/sync.py#L168

We could close the connection deliberately, though. If anyone wants to test moving the client.close() call to be unconditional that might be an interesting test: https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/sync.py#L199

You could also make sure that the idle timeout on the ELB is lower than the worker timeout, as a quick sanity check that this is even the cause of the issue: http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html

tilgovi on 15 Aug 2016

@tilgovi the elb thing is my current best guess, because a) it happens only if the gunicorn is behind an ELB and b) the issue is fixed by using the gevent worker. But then again, who knows..

esc on 15 Aug 2016

it happens only behind ELB, and if protocol HTTP or HTTPS is set on ELB.
If you change protocol to TCP or SSL the problem disappears.
Does anyone know how to fix it still using sync workers? I used the fix I described (TCP instead of HTTP) until now, when I'm going to monitor access logs and see X-Forwarded-For header, which is enabled only for HTTP/HTTPS on ELB :)

mihasK on 8 Sep 2016

👍6 ❤1

Strange, just to make sure we are on the same page, this means there is some kind of optimization when using the HTTP/HTTPS option in the ELB? And this optimization is causing issues with the sync workers?

esc on 9 Sep 2016

@ecs I think in case of HTTP/HTTPS load balancer reuse the same TCP connection, which is kept by ELB alive long time - some sort of that. But it doesn't explains why gevent worker works fine... (although I haven't test gevent yet, just read from these comments)

mihasK on 9 Sep 2016

Btw, there're certainly a lot of factors (gunicorn version, worker class, inside docker or not, ELB protocol)which can influence, so it could be great if someone will test and confirm that TCP protocol on ELB works unlike HTTP. I'm testing on docker hosted on Ubuntu 14.04 and gunicorn version 19.3.0

mihasK on 9 Sep 2016

I have the same problem:
[2016-10-03 12:13:17 +0000] [6] [CRITICAL] WORKER TIMEOUT (pid:16)
[2016-10-03 12:13:17 +0000] [16] [INFO] Worker exiting (pid: 16)
[2016-10-03 12:13:17 +0000] [6] [CRITICAL] WORKER TIMEOUT (pid:20)
[2016-10-03 12:13:17 +0000] [20] [INFO] Worker exiting (pid: 20)
[2016-10-03 12:13:17 +0000] [33] [INFO] Booting worker with pid: 33

My setup.

I am using kube-aws to set up a kubernetes cluster with core-os on AWS: https://github.com/coreos/coreos-kubernetes/tree/master/multi-node/aws

I have a python app using gunicorn and sync workers. The service for them uses an HTTPS load balancer with my sites certificate. All workers are considered timed out after the default timeout time of 30s it seems. They are eventually all restarted. Kubernetes eventually restarts the container when it exits.. these restarts happen continuously. Someone above mentioned that gevent solves the problem i may try that out but I'd really like to understand why this is happening.

gamykla on 3 Oct 2016

👍13

We're seeing this issue too currently, I'll try and post more details when I get a chance. It is related to the sync worker as moving to gevent does indeed solve the issue. This is on AWS with ECS and ELB.

philipforget on 7 Oct 2016

👍8

Had the same weird problem as well with one of our apps. Changing listeners from HTTP to tcp indeed solves the problem.

serkanh on 27 Oct 2016

👍7

anyone can provide me step-by-step way to reproduce the issue? (config step and such) Or at least point me to a simple way to setup that environment :) I will have a look this week on it.

Relying on HTTP for such balancer generally requires to handle keep-alive behind, (except for tcp) so anything using a sync worker may be an issue until the sync worker will be updated for its support.

Anyway let me know.

benoitc on 27 Oct 2016

Just set up a basic Flask app and run it with gunicorn with sync workers inside docker on AWS, with HTTP load-balancing and the configuration from above:

Using gunicorn 19.4.5 inside a docker image, which is running on Ubuntu 14.04 LTS on AWS, I see constant worker timeouts when running with >1 sync worker and the default 30s timeout period.

You don't have to make any requests to the application, just monitor the logs and you'll see workers timing out.

obi1kenobi on 27 Oct 2016

👍3

I can confirm this behaviour as well as the basic approach to reproduce it.
- The behaviour is not limited to Docker but rather to the sync worker / the keepalive setting, so you could consider changing the title...
I have modified the keepalive setting to follow to what is described here: https://serverfault.com/questions/782022/keepalive-setting-for-gunicorn-behind-elb-without-nginx/ (the most interesting part is the quote from the AWS ELB documentation) and found the server to not exhibit this behaviour anymore.
Furthermore I have additionally changed to the gthread worker class and found the result to be very much acceptable up to now.

lenucksi on 27 Oct 2016

👍3

I am currently using gevent workers with 65s timeout period (AWS ELB timeout is 60s, AFAIK). I still sometimes experience 504 errors even when only one client makes requests, with 1 request in flight at a time, but I wasn't able to pin down a reason just yet.

obi1kenobi on 27 Oct 2016

@obi1kenobi Did you try to increase the keepalive to beyond the ELB timeout option?

lenucksi on 28 Oct 2016

@lenucksi the keepalive is already higher than the ELB timeout by 5s -- hopefully the servers' clocks don't drive quite that much :)

obi1kenobi on 28 Oct 2016

I too encountered this issue running gunicorn 19.6 inside a Docker container on an EC2 instance behind an ELB (using the default HTTP protocol for reverse-proxying to the EC2 instance).

As suggested earlier, setting the worker class to gevent resolved this issue for me (where previously I had been using the default sync worker class).

I'll try and document a minimal example for reproducing the issue.

To corroborate what others have said, the ELB seems to be opening a connection to the EC2 instance and holding it open in a way that locks up gunicorn from handling additional requests.

When that held-open connection terminates due to gunicorn timing out the request from the ELB, all pending requests queued up behind it also fail (including the ELB's health check, causing the instance to be removed from the ELB, making this harder to debug ;) ).

This symptom suggests this is a duplicate of what's been observed in this previous issue that is unrelated to Docker: https://github.com/benoitc/gunicorn/issues/588

mmautner on 31 Dec 2016

@jelis Do you know if this is related to some kubernetes configuration? Was your issue solved?

juanitobebe on 28 Jan 2017

I'm also on an ELB with SSL termination. I'm seeing this in my logs:

[017-03-01 07:36:32 +0000] [1249] [DEBUG] GET /healthcheck
[01/Mar/2017:07:36:32 +0000] "GET /healthcheck HTTP/1.1" 200 22 "-" "ELB-HealthChecker/1.0"
[017-03-01 07:36:37 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:1203)
[017-03-01 07:36:37 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:1225)
[017-03-01 07:36:37 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:1234)
[017-03-01 07:36:37 +0000] [12] [CRITICAL] WORKER TIMEOUT (pid:1243)
[017-03-01 07:36:37 +0000] [1203] [INFO] Worker exiting (pid: 1203)
[017-03-01 07:36:37 +0000] [1225] [INFO] Worker exiting (pid: 1225)
[017-03-01 07:36:37 +0000] [1243] [INFO] Worker exiting (pid: 1243)
[017-03-01 07:36:37 +0000] [1234] [INFO] Worker exiting (pid: 1234)
[017-03-01 07:36:37 +0000] [1256] [INFO] Booting worker with pid: 1256
[017-03-01 07:36:37 +0000] [12] [DEBUG] 14 workers
[017-03-01 07:36:37 +0000] [1257] [INFO] Booting worker with pid: 1257
[017-03-01 07:36:37 +0000] [1262] [INFO] Booting worker with pid: 1262
[017-03-01 07:36:37 +0000] [1265] [INFO] Booting worker with pid: 1265
[017-03-01 07:36:37 +0000] [12] [DEBUG] 16 workers
[017-03-01 07:36:40 +0000] [1262] [DEBUG] GET /healthcheck

blockjon on 1 Mar 2017

Update - I just switched from gunicorn to gevent (http://flask.pocoo.org/docs/0.12/deploying/wsgi-standalone/#gevent) and the problem is resolved. This is in reference to my above comment https://github.com/benoitc/gunicorn/issues/1194#issuecomment-283269046

blockjon on 1 Mar 2017

👍4

I have been hammering on this issue for a couple of days. Everyone's notes here are fantastic, and are giving us a number of avenues to pursue fixes/workarounds/mitigations. (the log of my work is in the bug we're tracking here)

I just attempted the use of gevent with a boost in the worker count and that didn't help us at all - in fact it surprisingly made it look worse - all five gevent workers died off at the same time, which is a new behaviour from what I've read above:

[2017-03-30 23:38:37 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2017-03-30 23:38:37 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
[2017-03-30 23:38:37 +0000] [5] [INFO] Using worker: eventlet
[2017-03-30 23:38:38 +0000] [8] [INFO] Booting worker with pid: 8
[2017-03-30 23:38:40 +0000] [9] [INFO] Booting worker with pid: 9
[2017-03-30 23:38:41 +0000] [11] [INFO] Booting worker with pid: 11
[2017-03-30 23:38:42 +0000] [12] [INFO] Booting worker with pid: 12
[2017-03-30 23:38:42 +0000] [10] [INFO] Booting worker with pid: 10
[2017-03-30 23:40:08 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:8)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:9)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:10)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:11)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:12)
[2017-03-30 23:40:13 +0000] [8] [INFO] Worker exiting (pid: 8)
[2017-03-30 23:40:13 +0000] [10] [INFO] Worker exiting (pid: 10)
[2017-03-30 23:40:16 +0000] [16] [INFO] Booting worker with pid: 16
[2017-03-30 23:40:16 +0000] [17] [INFO] Booting worker with pid: 17
[2017-03-30 23:40:18 +0000] [18] [INFO] Booting worker with pid: 18
[2017-03-30 23:40:20 +0000] [19] [INFO] Booting worker with pid: 19
[2017-03-30 23:40:23 +0000] [20] [INFO] Booting worker with pid: 20

This is likely some screw-up on my part, and wasn't the same behaviour as one of my project counterparts (with a similar but not identical codebase in a near-identical AWS/CloudFormation/ALB/ECS deployment). However I figured I'd report it in case it inspires any new insights into the interactions between ALB, ECS, Docker and gunicorn.

MikeTheCanuck on 31 Mar 2017

I have a similar setup (convox running ELB->ECS Docker Container->Gunicorn->Django) and I can confirm that the following configuration changes seem to have helped:

keepalive = 75 # needs to be longer than the ELB idle timeout
timeout = 90
worker_class = 'gevent'

I set my ELB idle timeout to 60 seconds.

Prior to changing these settings I was using the defaults for keepalive, timeout, and the sync worker.

Interestingly, this issue started exhibiting at 4am UTC April 10. The app had been running fine without any changes since the previous Thursday.

erikcw on 12 Apr 2017

👍7

just piling on, no docker, but bare aws instance was having same issue behind an ELB. works NP on osx, but ubuntu16 with an ELB same issue cropped up. gevent worker class seems to fix it [for me].

wesyoung on 24 May 2017

Can anyone in this thread check this?
https://serverfault.com/questions/782022/keepalive-setting-for-gunicorn-behind-elb-without-nginx

tilgovi on 27 May 2017

@erikcw seems to have found this solution to work. I'd loke to close this issue. If anyone can reproduce this after raising the Gunicorn worker timeout and keepalive timeout, please say so.

tilgovi on 27 May 2017

Alternatively, lower the idle timeout for ELB.

tilgovi on 27 May 2017

Couldn't a client in the wild (or a nefarious client) act like the ELB and cause the same issue?

six8 on 30 May 2017

@six8 yes. ELB doesn't protect you from this either. The documentation discusses this: http://docs.gunicorn.org/en/latest/design.html#choosing-a-worker-type

Some people don't care about this threat and just want to deploy Gunicorn, at the edge or behind an ELB. The issue discussed here is worker timeouts caused by the ELB itself. I would like to provide recommended configuration that makes gunicorn work behind ELB with sync workers and without an additional reverse proxy.

tilgovi on 31 May 2017

I tried @erikcw's solution as well by changing the gunicorn invocation to:

CMD [ "gunicorn", "-k", "gevent", "--timeout", "10", "--keep-alive", "75", "--paste", "/etc/jzoo/app.ini" ]

Unfortunately that does not help:

[2017-07-04 08:36:23 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:24)
[2017-07-04 08:38:11 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:28)
[2017-07-04 08:39:39 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:29)
[2017-07-04 08:44:14 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:25)
[2017-07-04 08:44:55 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:31)

wichert on 4 Jul 2017

I have an application running on ECS behind an ALB with Gunicorn sync workers. I've not seen any issue yet.

Can anyone provide a reproducible deployment?

tilgovi on 8 Jul 2017

piling on, as well. seeing this issue using Kubernetes on AWS with ELB. Currently have 8 workers running. Per https://serverfault.com/questions/782022/keepalive-setting-for-gunicorn-behind-elb-without-nginx/, we're going to up keepalive, and per this thread going to try to use gevent workers instead. But right now, what's failing are some performance tests for api calls that shouldn't be blocking / causing timeouts on their own. would like to know what's going on here with sync workers. even if we get this working per some of the good guidelines in this thread, we'd like a bit more confidence as to what's going on before deploying to production.

alison-f-cooper on 17 Jul 2017

Setting the --worker-class to--worker-class gevent worked for me. Thanks for the comments and help!

jhillhouse92 on 18 Jul 2017

@wichert this is very strange. With the gevent worker, the workers should notify the arbiter of their liveness on a separate greenlet from any connections. Your keep-alive setting should not affect this at all. Something is either wrong with your application, such that gevent is not working as designed, or something about the setup is problematic that might be common to these other scenarios.

Please, everyone on this thread, provide an application that reproduces this problem. I have not been able to reproduce it. Without steps to reproduce, I am tempted to close. There are a mix of recommendations and maybe very different issues buried within these comments. It is hard to know what to do with this issue.

tilgovi on 18 Jul 2017

@tilgovi I experienced this problem when not specifying the worker class to gevent (e.g. using the default sync worker). My setup is virtualenv using pypy3 (PyPy 5.8.0-beta0), falcon 1.2, etc (understanding Falcon is not officially supported with pypy3). I used the cookiecutter from https://github.com/7ideas/cookiecutter-falcon and upgraded to all the latest packages and in trying to load the sample_resource in the browser and hitting it many times with httpie would intermittently see the worker crash error (e.g. the request would be delayed and then I would see the error; if I didn't cancel the request it would eventually respond). This only happened with the default worker class of sync. When I specified it to use gevent, everything worked.

jhillhouse92 on 18 Jul 2017

Thank you, @jhillhouse92, I will give it a try.

tilgovi on 18 Jul 2017

I'm continuing to see problems with timeouts, @tilgovi. I will be troubleshooting this more tomorrow and will post an update with more precise information. Please do keep this issue open; you're right that there may be a variety of problems going on, but the general theme is that gunicorn workers (maybe just with the sync worker class, tbd) are having timeout issues when using gunicorn with an application deployed using AWS/ELB. There is something wrong here, even if we haven't isolated the exact problem. It's worth digging further. I'll update tomorrow.

alison-f-cooper on 18 Jul 2017

Worth noting, I experienced this issue locally not within Docker/AWS/ELB with the above stack I mentioned.

jhillhouse92 on 18 Jul 2017

So I'm using gevent, trying to capture my access logs, and trying to deploy my app using AWS/ELB.

dockerfile commands:

FROM cortex/base

CMD ["gunicorn", "-w", "8", "-b", "0.0.0.0:5000", "--capture-output", "--enable-stdio-inheritance", "--access-logfile", "-", "--worker-class", "gevent", "-t", "60", "--log-level", "debug", "app:app"]

I'm getting crazy OOM errors in relation to Error code 14, regardless of what I set the limits to. Has anyone else seen this in the context of this issue? Previously, I was able to see worker timeouts. With my current configs, I can't reproduce the timeouts / get that far.

[2017-07-19 18:27:39 +0000] [34] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 578, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.5/site-packages/gunicorn/workers/ggevent.py", line 190, in init_process
    super(GeventWorker, self).init_process()
  File "/usr/local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 126, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 135, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.5/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 65, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 52, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.5/site-packages/gunicorn/util.py", line 376, in import_app
    __import__(module)
  File "/code/app/__init__.py", line 2, in <module>
    from flask_alchy import Alchy
  File "/usr/local/lib/python3.5/site-packages/flask_alchy.py", line 8, in <module>
    from flask_sqlalchemy import SQLAlchemy
  File "/usr/local/lib/python3.5/site-packages/flask_sqlalchemy/__init__.py", line 18, in <module>
    import sqlalchemy
  File "/usr/local/lib/python3.5/site-packages/sqlalchemy/__init__.py", line 9, in <module>
    from .sql import (
  File "/usr/local/lib/python3.5/site-packages/sqlalchemy/sql/__init__.py", line 8, in <module>
    from .expression import (
  File "/usr/local/lib/python3.5/site-packages/sqlalchemy/sql/expression.py", line 34, in <module>
    from .functions import func, modifier, FunctionElement, Function
  File "/usr/local/lib/python3.5/site-packages/sqlalchemy/sql/functions.py", line 11, in <module>
    from . import sqltypes, schema
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 658, in exec_module
  File "<frozen importlib._bootstrap_external>", line 763, in get_code
  File "<frozen importlib._bootstrap_external>", line 816, in get_data
OSError: [Errno 14] Bad address

To reiterate, I don't see this locally without docker. I don't see this locally with Docker. This, and timeout issues, only become apparently in the Docker/AWS/ELB stack (for me).

alison-f-cooper on 19 Jul 2017

@tilgovi One thing I initially did wrong (as can be seen in the snippet I posted above) is that I had the timeout wrongly set to 10 seconds. After changing that to 75 seconds the number of worker timeouts went down significantly to perhaps a couple per day. For completeness this is the exact command I am using to start our app:

gunicorn -k gevent --timeout 90 --keep-alive 75 --paste /etc/jzoo/app.ini "$@"

The .ini file looks like this:

[DEFAULT]
sqlalchemy.url = postgresql://XXXX

[app:main]
use = egg:jzoo.api.admin

[server:main]
use = egg:gunicorn#main
bind = 0.0.0.0:5000
workers = 4
preload_app = true
loglevel = warning

# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic,jzoo

# .. more logging configuration

The amount of traffic does not seem to be a factor - I see this happening in our QA environment which only sees very little traffic.

Looking at how often this happens I am wondering if this could be related to app restarts?

wichert on 21 Jul 2017

👍1

I have a somewhat similar problem. After some investigations, I think it's probably network connection error, not gunicorn's.

Here's my setup:

Postgresql server runs on Docker on EC2. The docker image is the official postgres 9.6.
App server runs on Docker in a VM outside EC2. The docker image is the official python 3.6.
Only port 5432 / tcp is opened in AWS security group.

Funnily, most requests from the app server to postgres server were OK, except one. That query seems to run forever, which timeout the gunicorn worker.

The problem is gone if I use the app locally.

The problem is also gone if I don't dockerize the app server.

My temporal fix is open all connection between the app server and the postgres server through AWS security group setting. Now everything works properly again.

My guess there's a connection problem between the Python container and AWS. I'll update after examining the AWS log stream thoroughly.

nguyendv on 5 Aug 2017

Thank you!! Changing ELB and Instance protocol from HTTP to TCP worked just fine 👍

mtrivedi on 10 Aug 2017

👍1

The problem doesn't seem specific to AWS, but to Docker itself when combined with Gunicorn. I see this with gevent as well, at least in our production environment.

@tilgovi Here's a simple app to recreate this error (every now and then, haven't found the pattern, perhaps you might):

app.py:

import falcon


class HelloResource:
    def on_get(self, req, resp):
        resp.media = {'info': 'hello world'}


app = falcon.API()
app.add_route('/', HelloResource())

Dockerfile:

FROM python:3.6
WORKDIR /code
RUN pip install gunicorn falcon
COPY app.py .
CMD ["gunicorn", "app:app", "--bind", "0.0.0.0:8000"]

To run: docker build -t gunicorn-and-docker . && docker run -p 8000:8000 -it --rm gunicorn-and-docker

The logs will look like this (at least for me on macOS Sierra using Docker 17.06.2-ce-mac27 (19124)):

[2017-09-23 22:31:00 +0000] [1] [INFO] Starting gunicorn 19.7.1
[2017-09-23 22:31:00 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2017-09-23 22:31:00 +0000] [1] [INFO] Using worker: sync
[2017-09-23 22:31:00 +0000] [9] [INFO] Booting worker with pid: 9
[2017-09-23 23:22:45 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:9)
[2017-09-23 23:22:45 +0000] [9] [INFO] Worker exiting (pid: 9)
[2017-09-23 23:22:45 +0000] [11] [INFO] Booting worker with pid: 11
[2017-09-23 23:39:46 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:11)
[2017-09-23 23:39:46 +0000] [11] [INFO] Worker exiting (pid: 11)
[2017-09-23 23:39:46 +0000] [13] [INFO] Booting worker with pid: 13
[2017-09-23 23:41:10 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:13)
[2017-09-23 23:41:10 +0000] [13] [INFO] Worker exiting (pid: 13)
[2017-09-23 23:41:10 +0000] [15] [INFO] Booting worker with pid: 15
[2017-09-23 23:42:27 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:15)
[2017-09-23 23:42:27 +0000] [15] [INFO] Worker exiting (pid: 15)
[2017-09-23 23:42:27 +0000] [17] [INFO] Booting worker with pid: 17
[2017-09-23 23:43:44 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:17)
[2017-09-23 23:43:44 +0000] [17] [INFO] Worker exiting (pid: 17)
[2017-09-23 23:43:44 +0000] [19] [INFO] Booting worker with pid: 19
[2017-09-24 18:37:12 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:19)
[2017-09-24 18:37:12 +0000] [19] [INFO] Worker exiting (pid: 19)
[2017-09-24 18:37:12 +0000] [21] [INFO] Booting worker with pid: 21
[2017-09-24 19:49:20 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:21)
[2017-09-24 19:49:20 +0000] [21] [INFO] Worker exiting (pid: 21)
[2017-09-24 19:49:20 +0000] [23] [INFO] Booting worker with pid: 23
[2017-09-24 19:50:37 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:23)
[2017-09-24 19:50:37 +0000] [23] [INFO] Worker exiting (pid: 23)
[2017-09-24 19:50:37 +0000] [25] [INFO] Booting worker with pid: 25
[2017-09-24 20:25:48 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:25)
[2017-09-24 20:25:48 +0000] [25] [INFO] Worker exiting (pid: 25)
[2017-09-24 20:25:48 +0000] [27] [INFO] Booting worker with pid: 27

jmagnusson on 24 Sep 2017

👍3

Hi,

I've recently bumped into the same problem.
In my case (ECS + gunicorn 19.7.1 + Django 1.11.5) I (apparently) managed to solve it by simply increasing the amount of memory dedicated to the container in the task definition (from 128 to 256mb).
No more errors so far. I'll report here if it happens again.

sanjioh on 6 Oct 2017

👍2

Adding to the chorus here a bit, but I was able to solve this only by switching to gevent.

I tried to increase timeout to 90 and keepalive to 75 with the sync worker with no luck.

Also, interestingly, my ELB in this scenario is only doing TCP level forwarding and not HTTP.

TimOrme on 17 Oct 2017

Okay, I found some references that explain this. [1]

It could be solved with small changes on the Gunicorn side. If, in the sync worker, after accepting a new connection, we called select() with a short timeout then we could handle these connections. It could get a little complicated, especially with multiple listening sockets. The simplest thing would be to have a timeout to first byte. I'm not sure how that would interact with SSL, either.

[1] https://cloudavail.com/2013/12/21/aws-elb-pre-open-connection-expose-part-1/

tilgovi on 5 Nov 2017

👍1

If you are experiencing this and do not want to use gevent/eventlet/asyncio, I recommend switching to the threaded worker (provide the --threads setting).

tilgovi on 5 Nov 2017

@tilgovi The issue seems to exist with gevent worker as well, if you look at my example above.

jmagnusson on 5 Nov 2017

@jmagnusson your example (from this comment: https://github.com/benoitc/gunicorn/issues/1194#issuecomment-331740187) seems to be using the sync worker.

[2017-09-23 22:31:00 +0000] [1] [INFO] Using worker: sync

tilgovi on 5 Nov 2017

@tilgovi True, but I've experienced this with the gevent worker as well, in production. I'll try to find some time for reproducing with gevent.

jmagnusson on 6 Nov 2017

This may just be me (silly), but in the EC2 security group -> inbound rules, I have a custom TCP rule for port 5000 (gunicorn here) to allow only a single IP (my IP: XX.XX.XX.XX/32), changing this to all IP will help.

sfdye on 12 Dec 2017

👍1

I meet with the trouble, too.

[2018-01-02 16:38:03 +0800] [24355] [INFO] Starting gunicorn 19.7.1
[2018-01-02 16:38:03 +0800] [24355] [DEBUG] Arbiter booted
[2018-01-02 16:38:03 +0800] [24355] [INFO] Listening at: http://0.0.0.0:8080 (24355)
[2018-01-02 16:38:03 +0800] [24355] [INFO] Using worker: gevent
[2018-01-02 16:38:03 +0800] [24358] [INFO] Booting worker with pid: 24358
[2018-01-02 16:38:03 +0800] [24355] [DEBUG] 1 workers
[2018-01-02 16:38:03 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:04 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:05 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:06 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:07 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:08 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:09 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:10 +0800] [24358] [INFO] worker pid 24358 notify
[2018-01-02 16:38:10 +0800] [24358] [DEBUG] GET /v1/bj2/CC/uuid
[2018-01-02 16:38:10 +0800] [24358] [DEBUG] Closing connection. 
[2018-01-02 16:38:41 +0800] [24355] [CRITICAL] WORKER TIMEOUT (pid:24358)
[2018-01-02 16:38:41 +0800] [24358] [INFO] Worker exiting (pid: 24358)
[2018-01-02 16:38:41 +0800] [24381] [INFO] Booting worker with pid: 24381
[2018-01-02 16:38:41 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:42 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:43 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:44 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:45 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:46 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:38:47 +0800] [24381] [INFO] worker pid 24381 notify
......
[2018-01-02 16:48:20 +0800] [24381] [INFO] worker pid 24381 notify
[2018-01-02 16:48:51 +0800] [24355] [CRITICAL] WORKER TIMEOUT (pid:24381)
[2018-01-02 16:48:51 +0800] [24381] [INFO] Worker exiting (pid: 24381)
[2018-01-02 16:48:51 +0800] [24703] [INFO] Booting worker with pid: 24703
[2018-01-02 16:48:51 +0800] [24703] [INFO] worker pid 24703 notify
[2018-01-02 16:48:51 +0800] [24703] [DEBUG] GET /v1/bj2/CC/uuid
[2018-01-02 16:48:51 +0800] [24703] [DEBUG] Closing connection.

I added the debug log worker pid {WORKER_PID} notify.

According to the debug log,

Only if the worker process receives one request, the notification won't go on. So, whether there are the requests or not, the master will kill the worker process after 30s.
Once the first request closed, I sent the second request right now, but the second request is blocked and there is no response. When the 30s timeout reached, the worker process was killed and the new worker process would respond the the second request. Then the server has been hung again.
When the worker process has been hung, it would not receive the signal from the master, such as SIGTERM. When the graceful timeout reached, the master would kill it by SIGKILL coercively.

Python: 3.6.3
Gevent: 1.2.2
Gunicorn: 19.7.1
RUN CMD: gunicorn --worker-class gevent --log-level debug --bind 0.0.0.0:8080 app

NOTICE: When I use eventlet as the worker, it's OK. I guess it's the question of gevent or gevent worker.

xgfone on 2 Jan 2018

👍1

I'm going to close this issue since I think it's sort of out of date. As best I can determine, the observed behavior is due to AWS Classic Load Balancers doing a pre-open optimization. This optimization is only done by Classic Load Balancers, according to the documentation on how load balancing works. Switching to Application Load Balancers should resolve the issue. Additionally, as documented in this blog post, it's possible to request AWS to disable the pre-open optimization or to increase the worker timeout for Gunicorn.

tilgovi on 6 Mar 2018

@tilgovi FWIW I saw this with ALB. I've never used classic load balancers.

wichert on 7 Mar 2018

👍3

@tilgovi I too observed this repeatedly in ALB-only environments, never having used classic. Even with worker timeout adjustments, we continued to see this.

MikeTheCanuck on 7 Mar 2018

@tilgovi I met with this. I've never used AWS.

xgfone on 7 Mar 2018

Someone has to help reproduce this, then, and try to debug it. I cannot reproduce it with ALB and I cannot see how to reproduce it outside AWS.

tilgovi on 7 Mar 2018

@xgfone you may want to open your own issue, the title of this issue is "docker + AWS"

tilgovi on 7 Mar 2018

Did anyone experience this with the AWS load balancers but without Docker? I'm a little confused, I have to admit :)

My impression was that the issue was really more related to Docker - at least I had exactly the same issues running multiple gunicorn sync workers inside Docker on my local machine.

This config worked fine without Docker on Ubuntu and RHEL machines, both bare metal and virtualized in Digital Ocean:

workers = 2
worker_class = 'sync'
worker_connections = 1000
timeout = 30
keepalive = 2

Inside Docker (also RHEL, everything else the same), the workers always ran into timeout and kept restarting permanently. Taking the recommendations from this thread, I've switched to this, which works perfectly fine:

workers = 1
threads = 8
worker_class = 'sync'
worker_connections = 1000
timeout = 30
keepalive = 2

If you think this is unrelated to this thread, I'm happy to remove it from here and put it someplace else ;-)

Best, Boris

bolau on 7 Mar 2018

👍2

@tilgovi

This error occurs in the arbiter.py, murder_workers. The murder_workers function is invoked when the SIG_QUEUE is empty (e.g. while waiting for signals).

It would then check to see if the timeout was reached every second. The time remaining to live was always 1 sec or 0 secs because the gevent worker is in a controlled infinite loop while self.alive invoked self.notify which updates a file. The updates to the file are how gunicorn is checking when the last active time occurred.

Thus, the only way the time to live check in the murder_workers function fails (e.g. says it's been idle too long) is if the gevent alive property was changed. The following methods control this:

handle_quit
handle_request (if max requests were received, by default this is 2^64 bit of number of requests or 2^32 if 32-bit OS)
changed (invoked from reloader if a file was changed)
handle_exit
handle_abort

These methods get invoked from signals

handle_quit - SIGQUIT, SIGINT
handle_exit- SIGTERM
handle_abort - SIGABRT

Thus, there are 2 possible problems that I see:

The notify method is not able to update the file during its loop
A signal is getting sent incorrectly

I think the more probable cause is the 2nd one. The main source of signals is within the arbiter file and triggered by these exceptions:

StopIteration
KeyboardInterrupt
HaltServer
Exception (unhandled exception)

I think the StopIteration exception is most likely the culprit triggered in the async.py file at req = six.next(parser). It appears that this next method is just invoking parser.next().

Parser.next() raises a StopIteration exception on the following 2 conditions:

1. If the message indicates it should close:
    1. http Connection close (and keep-alive false)
    2. WSGI - request method not HEAD, response_length is none, and status code is > 200 (but not 204, or 304)
2. If self.mesg (Request message) is falsey

I’m not sure what exactly could be causing that unintentionally but out of 62 comments with this issue being seen across multiple apps and environments, I think this is worth investigating further and recommend re-opening it or if we don’t like the title, create a new global issue with workers critically timing out.

jhillhouse92 on 7 Mar 2018

I'll re-open this and try to test these combinations:

Docker local
Docker AWS with Classic Load Balancer
Docker AWS with Application Load Balancer

If anyone wants to create a repo that can reproduce this, that would be helpful, but I can probably do it this weekend.

tilgovi on 7 Mar 2018

👍4

I'm having this problem and I'm not using AWS.
My setup is NGINX -> Docker container (Gunicorn with gevent workers + Django). I'm using the default timeouts of the Gunicorn.

Any tips?

Thanks.

davidmir on 5 Apr 2018

Hi @davidmir,
I had the same problem, changing the worker type did not help. Switching to a single sync worker with multiple threads did the trick for me:

workers = 1
threads = 8
worker_class = 'sync'
worker_connections = 1000
timeout = 30
keepalive = 2

Best, Boris

bolau on 5 Apr 2018

👍2

While it's mentioned some above, to clarify from my own debugging of this issue, some of these are out of memory errors; especially if it's resolved by changing the worker count down to 1.

This is especially problematic on Docker if you have something like this in your config:

workers = multiprocessing.cpu_count() * 2 + 1

But are limiting the memory available to the container; the synchronous worker will start up multiple potentially heavyweight python interpreters. Note that CPU count is based on the overall CPU count of the machine, not what you've provisioned the docker container to use; if you're running a limited container you should base the worker count on that instead.

While it varies from application to application, my rule of thumb is 256 megs per worker in a lean configuration and 512 to be safe.

toddgardner on 5 Apr 2018

👍6

Faced same issue using Classic ELB HTTPS => HTTP Gunicorn Sync Workers

As others mentioned the easiest fix was to use -k gevent argument instead of default sync worker.

bclodius on 25 May 2018

We saw this same issue crop up again for an entirely new Docker container (hosted in ECS behind ALB), and found an entirely different (but in hindsight, should've-been-obvious) solution: allocate more memory to the container in CloudFormation template.

We suffered from this issue last year and never quite found a smoking gun root cause, but tonight I spotted it on a new container, and happened to notice that the container was eating up all the memory we'd allocated. As soon as I doubled the allocation, the containers stopped throwing this error (even before I allocated more than it would ultimately consume):
https://github.com/hackoregon/civic-devops/issues/157

MikeTheCanuck on 5 Jun 2018

Facing a similar issue here. the code is as simple as

while true:
      sleep(10)

can lead worker dying (without this code our deploy is normal)
it is very puzzling, b/c sleep should send a heartbeat already.

also have already changed to gevent, did not help. Wonder if anyone has any thought?

sample output:
[2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:521) [2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:522) [2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:523) [2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:524) [2018-06-16 16:00:16 +0000] [136] [CRITICAL] WORKER TIMEOUT (pid:525) [2018-06-16 16:00:16 +0000] [524] [INFO] Worker exiting (pid: 524) [2018-06-16 16:00:16 +0000] [521] [INFO] Worker exiting (pid: 521) [2018-06-16 16:00:16 +0000] [523] [INFO] Worker exiting (pid: 523) [2018-06-16 16:00:16 +0000] [522] [INFO] Worker exiting (pid: 522) [2018-06-16 16:00:16 +0000] [525] [INFO] Worker exiting (pid: 525) [2018-06-16 16:00:17 +0000] [531] [INFO] Booting worker with pid: 531 /usr/local/lib/python3.6/site-packages/gunicorn/workers/ggevent.py:65: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016 monkey.patch_all(subprocess=True) [2018-06-16 16:00:17 +0000] [136] [DEBUG] 1 workers [2018-06-16 16:00:17 +0000] [532] [INFO] Booting worker with pid: 532 [2018-06-16 16:00:17 +0000] [533] [INFO] Booting worker with pid: 533 /usr/local/lib/python3.6/site-packages/gunicorn/workers/ggevent.py:65: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016 monkey.patch_all(subprocess=True) [2018-06-16 16:00:17 +0000] [534] [INFO] Booting worker with pid: 534 /usr/local/lib/python3.6/site-packages/gunicorn/workers/ggevent.py:65: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier. See https://github.com/gevent/gevent/issues/1016 monkey.patch_all(subprocess=True) [2018-06-16 16:00:17 +0000] [535] [INFO] Booting worker with pid: 535 [2018-06-16 16:00:17 +0000] [136] [DEBUG] 5 workers

yingDD on 16 Jun 2018

Sorry to muddy the waters here but just wanted to share my experience, my stack is HTTP ELB > Nginx > gunicorn/gevent on EC2, processes run using circusd

The problem I found was that after instrumenting my django app with elastic-apm my HTTP healthchecks will start failing almost immediately and gunicorn would become unresponsive after a few seconds.

I tried a few local wrk benchmarks against my healthcheck, sync was getting about 300rps for 1-5 workers, but gevent would get 1rps. Turning off apm brought gevent up to about 50rps and neglibable increase for sync.

Didn't notice any strange memory usage on valgrind, typically about 4MB for gunicorn and 40MB per worker

Anyway, updating gunicorn 19.6.0 > 19.9.0 and gevent 1.2.2 > 1.3.6 is giving me a rock steading ELB healthcheck again and ~300rps locally. I deliberately hadn't been updating gunicorn/gevent until I had some APM logging in place, seems that doing that has prompted some serendipity. Updating gevent gave the greatest boost, understandably, but then updating gunicorn seemed to add another 10% or so.

Haven't looked at the changelogs in much detail yet, perhaps the maintainers can shed light on most likely cause?

ciarancourtney on 27 Aug 2018

👍2

Anyway, updating gunicorn 19.6.0 > 19.9.0 and gevent 1.2.2 > 1.3.6 is giving me a rock steading ELB healthcheck again and ~300rps locally....Updating gevent gave the greatest boost

Wow, that's pretty cool!

jamadden on 27 Aug 2018

closing issue. no activity since awhile.

benoitc on 22 Nov 2019

👎23 😕10

FWIW, I'm still having this issue @benoitc :-\

Django 2.1.5
gunicorn 19.8.1
gevent 1.2.2
AKS 1.15.7 with 5 Nodes (standard B2's - 2 vcores and 2GB memory)
3 pods/docker containers, each of which has:

    spec:
      containers:
        - name: web-container
          command: ["newrelic-admin"]
          args: [
              "run-program",
              "gunicorn",
              "--limit-request-field_size=16380", # limit also specified in nginx
              "--workers=3", # (2 x vcpu) + 1. vcpu < limit defined in resources
              "-k=gevent",
              "-b=0.0.0.0:8000",
              "ourapp.wsgi",
              "--timeout 60",
              "--log-level=debug",
            ]
          resources:
            requests:
              memory: "200Mi"
              cpu: "250m"
            limits:
              memory: "3000Mi"
              cpu: "2000m"

salmicrosoft on 4 Feb 2020

Same issue here with EKS, Nginx, and an NLB.

sinkr on 5 Mar 2020

@salmicrosoft @sinkr OK but please open a new issue and provide any logs you have that could help us to figure what happen. Also what does your app? Do you have any long request?

benoitc on 5 Mar 2020

I had the same issue on the ECS Fargate, but it was related to incorrect keepalive/timeout settings in Docker/gunicorn.conf.py, it was
keepalive (not set)
timeout=2

Before fixing it works well with a container running locally, but failed on the ECS Fargate, now it works correctly on both environments.