Gunicorn: Instant worker timeouts after long-running requests

Created on 8 Aug 2013  ·  45Comments  ·  Source: benoitc/gunicorn

We have been experiencing a very odd issue involving request timeouts with Gunicorn running in front of our Django application on Heroku.

In essence, the issue occurs when a very long-running request is made to the application, causing it to be cut off by Heroku's platform-level request timeout after 30 seconds. This in itself is not the problem - such long-running requests should be timing out. The actual issue occurs on a subsequent request. There appears to be a chance that a request occurring _after_ this long-running request will instantaneously throw a Gunicorn error of the form:

[CRITICAL] WORKER TIMEOUT (pid:15)

This results in an Internal Server Error page being returned to the client straight away. This behaviour is not consistent - sometimes the next request will succeed, sometimes it will not. My suspicion is that the error will occur if the request is handled by the same Gunicorn worker that was dealing with the long-running request.

We are running Gunicorn directly from Django as follows:

python app/manage.py run_gunicorn -b 0.0.0.0:$PORT -w 2 -k gevent

Other relevant configuration parameters, such as worker_connections and max_requests, are set at their defaults. However, we have experimented with changing these settings, as well as the number of workers, but the problem described above occurs in all cases that we have tried. Our current Gunicorn version is 0.14.6, having rolled back after we upgraded to 17.5 in case that was the source of the issue. Unfortunately the same behaviour occurs on both versions.

We are having great difficulty understanding the root cause of this issue. From the Heroku documentation linked above:

While the router has returned a response to the client, your application will not know that the request it is processing has reached a time-out, and your application will continue to work on the request.

I assume this means that despite the fact that the original long-running request has timed out from the perspective of the client, a Gunicorn worker is still processing the request. What baffles me is how this would cause a new, subsequent request to throw a worker timeout straight away, given that the workers are using gevent and thus asynchronous threads to handle requests. I would have thought that the first, long-running request would be

  1. non-blocking due to the fact that there are multiple worker gevent threads, and
  2. killed by Gunicorn before any subsequent requests would occur, since the default request timeout for Gunicorn is 30 seconds - identical to Heroku's own requeest timeout.

It may be worth mentioning that the long-running requests are connecting to our PostgreSQL database and performing several complex queries. The crashes seem to only occur after the long-running connection times out thanks to Heroku while interfacing with the DB. We tested this on our test server by creating two separate views, one of which executes long queries against the database until a Heroku timeout is reached, and the other of which simply waits past the 30 second mark to produce the Heroku timeout. The former view was able to cause Gunicorn worker timeouts on the next request, whereas the 'sleeping' view was not .

It would be fantastic if anyone has any insight into the underlying behaviour causing this problem - any additional information that I can provide to help diagnose the issue, please let me know. I hope that this is the correct way to handle such an issue and that it has not been raised/addressed before.

( FeaturWorker - Bugs -

All 45 comments

@Jwpe Excellent bug report.

The first thing I would look into is if you're using a green'ed PostgreSQL driver. There used to be a specific version that you needed to use with a monkey patch to make this work correctly. I don't know the current state of things now. But the timeout you pasted is a worker timeout and in one of the async workers that basically boils down to "a greenlet has not yielded for over $timeout seconds". Given that you're describing long running PostgreSQL requests that's the first place I'd investigate.

@davisp Thanks for the rapid response on this!

Following your pointer, I had a look at the psycopg2 adapter - which we use to connect our Django app to Postgres - and discovered this section of the documentation which states:

Warning: Psycopg connections are not green thread safe and can’t be used concurrently by different green threads. Trying to execute more than one command at time using one cursor per thread will result in an error (or a deadlock on versions before 2.4.2).
Therefore, programmers are advised to either avoid sharing connections between coroutines or to use a library-friendly lock to synchronize shared connections, e.g. for pooling.

In other words - psycopg2 doesn't like green threads. Based on the behaviour we encountered, I would guess that this is the source of the error. The suggested way to deal with this issue, according to the psycopg docs, is to use a library which enables psycopg support for coroutines. The recommended library is psycogreen.

I will attempt to recreate the issue on our test server using the psycogreen library, and see if it solves the problem. Hopefully I should be able to report some results relatively soon.

@Jwpe That looks like the hook that I was remembering.

@benoitc Perhaps we should call this out in Gunicorn docs somewhere? Not entirely sure where the best place would be though.

@davisp maybe in an FAQ section?

hmppff forgiot it was already existing (http://docs.gunicorn.org/en/latest/faq.html) So maybe we could have a link to a "troubleshoot" section from there. not sure, but we could also move in such section any information about virtualenv and other tricks.

Maybe a Caveats or "Notes about Green Threads" page. Not super pressing.

I was seeing virtually the same exact behavior on Heroku running gunicorn with gevent workers plus some others, so I switched to synchronous workers, as recommended on this thread and was still seeing basically the same serious problem. I would get an H12 30 second Heroku cutoff and then repeated H12's and [CRITICAL] Worker Timeout errors until i reset the entire dyno (sometimes an hour or more later or until the max-requests was hit). So I adjusted down the gunicorn timeout to 28 second so it would timeout before Heroku cut it off. Same (or very similar) problem happening once or twice a day and persisting until the master process is restarted, only this time it begins with an H13 (connection closed) since gunicorn is cutting it off. There is no significant spike in traffic during those times.

Here is my current procfile entry:

web: newrelic-admin run-program gunicorn publisher.wsgi -b 0.0.0.0:$PORT -w 4 --max-requests 1000 --timeout 28 --preload

Detail of the series of events:

First, I get a few requests that seem to take really long(5+ seconds), then a request fails with an H12 timeout (the worker timeout cuts it off) and a few more requests will finish, but with exceptionally long times (20 seconds). From then on it is pure H11 30 second heroku cutoff's until I restart the dyno.

We use gunicorn (v 18.0) as our webserver (which runs python/django).

We run newrelic, which shows the downtime and insanely elevated request queueing times, but offers no other insight. There were no spikes in throughput or other abnormal conditions that i can see in NR. We use papertrail for log processing and to send out error emails.

Oct 15 15:08:53 nutrislice-stockton heroku/router: at=info method=GET path=/marketingtools/api/slides/?format=json-p&callback=_jqjsp&_1381871332239= host=oldham.nutrislice.com fwd="74.138.24.95" dyno=web.2 connect=15ms service=216ms status=200 bytes=21 Oct 15 15:08:54 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/menutypes/?format=json-p&callback=_jqjsp&_1381871332232= host=oldham.nutrislice.com fwd="74.138.24.95" dyno=web.2 connect=2ms service=90ms status=200 bytes=231 Oct 15 15:08:56 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/schools/?format=json-p&callback=_jqjsp&_1381871323514= host=oldham.nutrislice.com fwd="74.138.24.95" dyno=web.2 connect=3ms service=94ms status=200 bytes=5986 Oct 15 15:09:03 nutrislice-stockton heroku/router: at=info method=HEAD path=/heartbeat/ host=stockton.nutrislice.com fwd="54.247.188.179" dyno=web.2 connect=3ms service=23ms status=200 bytes=0 Oct 15 15:09:13 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/settings/?format=json-p&callback=_jqjsp&_1381871237946= host=pcsb.nutrislice.com fwd="66.87.110.127" dyno=web.2 connect=5ms service=166ms status=200 bytes=468 Oct 15 15:09:20 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/settings/?format=json-p&callback=_jqjsp&_1381871323611= host=oldham.nutrislice.com fwd="74.138.24.95" dyno=web.2 connect=6ms service=183ms status=200 bytes=453 Oct 15 15:09:40 nutrislice-stockton heroku/router: at=info method=GET path=/ host=nps.nutrislice.com fwd="74.190.240.28" dyno=web.2 connect=1ms service=260ms status=200 bytes=35951 Oct 15 15:09:55 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/api/list/school-menu-profile/87/menu-type/43/?format=json-p&callback=jQuery18008709754704032093_1381871379465&_=1381871393589 host=nps.nutrislice.com fwd="74.190.240.28" dyno=web.2 connect=15ms service=129ms status=200 bytes=400 Oct 15 15:09:55 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/api/list/school-menu-profile/306/menu-type/187/?format=json-p&callback=jQuery180013075259909965098_1381873891397&_=1381873896600 host=sdhc.nutrislice.com fwd="72.186.96.121" dyno=web.2 connect=2ms service=33ms status=200 bytes=486 Oct 15 15:10:00 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/186/?smp=257 host=coppellisd.nutrislice.com fwd="76.199.114.157" dyno=web.2 connect=7ms service=103ms status=200 bytes=323 Oct 15 15:10:00 nutrislice-stockton app/web.2: INFO http://stockton.nutrislice.com/heartbeat/ Pinged from IP: 10.190.159.205 -- AGENT: NewRelicPinger/1.0 (269661) Oct 15 15:10:00 nutrislice-stockton heroku/router: at=info method=HEAD path=/heartbeat/ host=stockton.nutrislice.com fwd="50.112.95.211" dyno=web.2 connect=1ms service=10ms status=200 bytes=0 Oct 15 15:10:09 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/239/?smp=341 host=edenpr.nutrislice.com fwd="75.73.177.139" dyno=web.2 connect=8ms service=334ms status=200 bytes=277 Oct 15 15:10:16 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/395/?smp=306 host=sdhc.nutrislice.com fwd="72.186.96.121" dyno=web.2 connect=1ms service=96ms status=200 bytes=245 Oct 15 15:10:20 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/391/?smp=305 host=sdhc.nutrislice.com fwd="173.170.34.126" dyno=web.2 connect=32ms service=5207ms status=200 bytes=290 Oct 15 15:10:22 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/350/?smp=305 host=sdhc.nutrislice.com fwd="173.170.34.126" dyno=web.2 connect=60ms service=7676ms status=200 bytes=1147 Oct 15 15:10:31 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/258/?smp=341 host=edenpr.nutrislice.com fwd="75.73.177.139" dyno=web.2 connect=42ms service=517ms status=200 bytes=26974 Oct 15 15:10:43 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/schools/?format=json-p&callback=_jqjsp&_1381871432885= host=ocps.nutrislice.com fwd="71.47.21.97" dyno=web.2 connect=1490ms service=9883ms status=200 bytes=1565 Oct 15 15:10:52 nutrislice-stockton heroku/router: at=error code=H13 desc="Connection closed without response" method=GET path=/ host=jordandistrict.nutrislice.com fwd="71.199.48.37" dyno=web.2 connect=1959ms service=29230ms status=503 bytes=0 Oct 15 15:10:52 nutrislice-stockton app/web.2: 2013-10-15 21:10:50 [2] [CRITICAL] WORKER TIMEOUT (pid:12) Oct 15 15:10:52 nutrislice-stockton app/web.2: 2013-10-15 21:10:50 [2] [CRITICAL] WORKER TIMEOUT (pid:12) Oct 15 15:10:52 nutrislice-stockton app/web.2: 2013-10-15 21:10:50 [26] [INFO] Booting worker with pid: 26 Oct 15 15:10:52 nutrislice-stockton app/web.2: 2013-10-15 21:10:50,930 (26/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.0.0.1) Oct 15 15:10:54 nutrislice-stockton heroku/router: at=info method=GET path=/surveys/api/activesurveycount/?format=json-p&callback=_jqjsp&_1381871433429= host=henrico.nutrislice.com fwd="96.248.5.53" dyno=web.2 connect=1181ms service=20074ms status=200 bytes=32 Oct 15 15:10:55 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/schooltypes/?format=json-p&callback=_jqjsp&_1381871433374= host=henrico.nutrislice.com fwd="96.248.5.53" dyno=web.2 connect=1136ms service=20393ms status=200 bytes=142 Oct 15 15:11:01 nutrislice-stockton app/web.2: using heroku production settings Oct 15 15:11:01 nutrislice-stockton app/web.2: WARNING /app/.heroku/python/lib/python2.7/site-packages/django/utils/hashcompat.py:9: DeprecationWarning: django.utils.hashcompat is deprecated; use hashlib instead Oct 15 15:11:01 nutrislice-stockton app/web.2: DeprecationWarning) Oct 15 15:11:01 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/settings/?format=json-p&callback=_jqjsp&_1381871432922= host=ocps.nutrislice.com fwd="71.47.21.97" dyno=web.2 connect=1435ms service=23198ms status=200 bytes=486 Oct 15 15:11:03 nutrislice-stockton app/web.2: WARNING /app/.heroku/python/lib/python2.7/site-packages/django/conf/urls/defaults.py:3: DeprecationWarning: django.conf.urls.defaults is deprecated; use django.conf.urls instead Oct 15 15:11:03 nutrislice-stockton app/web.2: DeprecationWarning) Oct 15 15:11:05 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/schooltypes/?format=json-p&callback=_jqjsp&_1381871443300= host=martinschools.nutrislice.com fwd="99.114.229.202" dyno=web.2 connect=1089ms service=20040ms status=200 bytes=268 Oct 15 15:11:10 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/weeks/school-menu-profile/135/menu-type/63/2013/10/14/?format=json-p&callback=_jqjsp&_1381871439548= host=henrico.nutrislice.com fwd="96.248.5.53" dyno=web.2 connect=1018ms service=30001ms status=503 bytes=0 Oct 15 15:11:15 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/sales/?format=json-p&callback=_jqjsp&_1381871443267= host=martinschools.nutrislice.com fwd="99.114.229.202" dyno=web.2 connect=1096ms service=30001ms status=503 bytes=0 Oct 15 15:11:15 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/schools/?format=json-p&callback=_jqjsp&_1381871443296= host=martinschools.nutrislice.com fwd="99.114.229.202" dyno=web.2 connect=1108ms service=30000ms status=503 bytes=0 Oct 15 15:11:23 nutrislice-stockton heroku/router: at=info method=GET path=/menu/api/weeks/school-menu-profile/48/menu-type/21/2013/10/14/?format=json-p&callback=_jqjsp&_1381871449451= host=martinschools.nutrislice.com fwd="99.114.229.202" dyno=web.2 connect=1114ms service=31756ms status=200 bytes=48771 Oct 15 15:11:26 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/sales/?format=json-p&callback=_jqjsp&_1381871455129= host=pcsb.nutrislice.com fwd="66.87.110.127" dyno=web.2 connect=990ms service=30001ms status=503 bytes=0 Oct 15 15:11:26 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/schools/?format=json-p&callback=_jqjsp&_1381871455291= host=pcsb.nutrislice.com fwd="66.87.110.127" dyno=web.2 connect=1028ms service=30008ms status=503 bytes=0 Oct 15 15:11:31 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menuwidgets/179/?smp=6 host=cusdnutrition.nutrislice.com fwd="68.99.246.16" dyno=web.2 connect=2492ms service=30000ms status=503 bytes=0 Oct 15 15:11:32 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menuwidgets/192/?smp=6 host=cusdnutrition.nutrislice.com fwd="68.99.246.16" dyno=web.2 connect=2713ms service=30003ms status=503 bytes=0 Oct 15 15:11:39 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/ host=hebisd.nutrislice.com fwd="38.107.226.1" dyno=web.2 connect=2115ms service=30001ms status=503 bytes=0 Oct 15 15:11:45 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/weeks/school-menu-profile/44/menu-type/19/2013/10/14/?format=json-p&callback=_jqjsp&_1381871472583= host=pcsb.nutrislice.com fwd="66.87.110.127" dyno=web.2 connect=2168ms service=30000ms status=503 bytes=0 Oct 15 15:11:48 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/marketingtools/api/active-announcements/?format=json-p&callback=_jqjsp&_1381871476287= host=sdhc.nutrislice.com fwd="65.34.72.116" dyno=web.2 connect=1927ms service=30000ms status=503 bytes=0 Oct 15 15:11:48 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/surveys/api/activesurveycount/?format=json-p&callback=_jqjsp&_1381871476543= host=sdhc.nutrislice.com fwd="65.34.72.116" dyno=web.2 connect=2117ms service=30000ms status=503 bytes=0 Oct 15 15:11:48 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/schooltypes/?format=json-p&callback=_jqjsp&_1381871476481= host=sdhc.nutrislice.com fwd="65.34.72.116" dyno=web.2 connect=2111ms service=30009ms status=503 bytes=0 Oct 15 15:11:50 nutrislice-stockton app/web.2: 2013-10-15 15:11:32,597 (26/NR-Activate-Session/nutrislice-stockton) newrelic.core.data_collector INFO - Successfully registered New Relic Python agent where app_name='nutrislice-stockton', pid=26, redirect_host='collector-2.newrelic.com' and agent_run_id=474482914, in 40.26 seconds. Oct 15 15:11:50 nutrislice-stockton app/web.2: INFO Successfully registered New Relic Python agent where app_name='nutrislice-stockton', pid=26, redirect_host='collector-2.newrelic.com' and agent_run_id=474482914, in 40.26 seconds. Oct 15 15:11:52 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/marketingtools/api/active-announcements/?format=json-p&callback=_jqjsp&_1381871480294= host=sdhc.nutrislice.com fwd="65.34.72.116" dyno=web.2 connect=1689ms service=30006ms status=503 bytes=0 Oct 15 15:11:55 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/settings/?format=json-p&callback=_jqjsp&_1381871482566= host=henrico.nutrislice.com fwd="72.84.233.45" dyno=web.2 connect=2067ms service=30004ms status=503 bytes=0 Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:41 [2] [CRITICAL] WORKER TIMEOUT (pid:26) Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:41 [2] [CRITICAL] WORKER TIMEOUT (pid:26) Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:41 [29] [INFO] Booting worker with pid: 29 Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:41,067 (29/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.0.0.1) Oct 15 15:11:57 nutrislice-stockton app/web.2: using heroku production settings Oct 15 15:11:57 nutrislice-stockton app/web.2: WARNING /app/.heroku/python/lib/python2.7/site-packages/django/utils/hashcompat.py:9: DeprecationWarning: django.utils.hashcompat is deprecated; use hashlib instead Oct 15 15:11:57 nutrislice-stockton app/web.2: DeprecationWarning) Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:44 [2] [CRITICAL] WORKER TIMEOUT (pid:23) Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:44 [2] [CRITICAL] WORKER TIMEOUT (pid:23) Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:44 [32] [INFO] Booting worker with pid: 32 Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:44,154 (32/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.0.0.1) Oct 15 15:11:57 nutrislice-stockton app/web.2: using heroku production settings Oct 15 15:11:57 nutrislice-stockton app/web.2: WARNING /app/.heroku/python/lib/python2.7/site-packages/django/utils/hashcompat.py:9: DeprecationWarning: django.utils.hashcompat is deprecated; use hashlib instead Oct 15 15:11:57 nutrislice-stockton app/web.2: DeprecationWarning) Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:48 [2] [CRITICAL] WORKER TIMEOUT (pid:14) Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:48 [2] [CRITICAL] WORKER TIMEOUT (pid:14) Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:48 [35] [INFO] Booting worker with pid: 35 Oct 15 15:11:57 nutrislice-stockton app/web.2: 2013-10-15 21:11:48,273 (35/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.0.0.1) Oct 15 15:11:57 nutrislice-stockton app/web.2: using heroku production settings Oct 15 15:11:57 nutrislice-stockton app/web.2: WARNING /app/.heroku/python/lib/python2.7/site-packages/django/utils/hashcompat.py:9: DeprecationWarning: django.utils.hashcompat is deprecated; use hashlib instead Oct 15 15:11:57 nutrislice-stockton app/web.2: DeprecationWarning) Oct 15 15:11:57 nutrislice-stockton heroku/router: at=info method=GET path=/menuwidgets/353/?smp=306 host=sdhc.nutrislice.com fwd="72.186.96.121" dyno=web.2 connect=21ms service=76ms status=200 bytes=255 Oct 15 15:12:00 nutrislice-stockton app/web.2: 2013-10-15 21:11:54 [2] [CRITICAL] WORKER TIMEOUT (pid:13) Oct 15 15:12:00 nutrislice-stockton app/web.2: 2013-10-15 21:11:54 [2] [CRITICAL] WORKER TIMEOUT (pid:13) Oct 15 15:12:00 nutrislice-stockton app/web.2: 2013-10-15 21:11:54 [38] [INFO] Booting worker with pid: 38 Oct 15 15:12:00 nutrislice-stockton app/web.2: 2013-10-15 21:11:54,388 (38/MainThread) newrelic.core.agent INFO - New Relic Python Agent (2.0.0.1) Oct 15 15:12:00 nutrislice-stockton app/web.2: using heroku production settings Oct 15 15:12:01 nutrislice-stockton app/web.2: WARNING /app/.heroku/python/lib/python2.7/site-packages/django/utils/hashcompat.py:9: DeprecationWarning: django.utils.hashcompat is deprecated; use hashlib instead Oct 15 15:12:01 nutrislice-stockton app/web.2: DeprecationWarning) Oct 15 15:12:02 nutrislice-stockton app/web.2: WARNING /app/.heroku/python/lib/python2.7/site-packages/django/conf/urls/defaults.py:3: DeprecationWarning: django.conf.urls.defaults is deprecated; use django.conf.urls instead Oct 15 15:12:02 nutrislice-stockton app/web.2: DeprecationWarning) Oct 15 15:12:03 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menuwidgets/353/?smp=306 host=sdhc.nutrislice.com fwd="108.9.154.78" dyno=web.2 connect=3650ms service=30006ms status=503 bytes=0 Oct 15 15:12:03 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menuwidgets/395/?smp=306 host=sdhc.nutrislice.com fwd="108.9.154.78" dyno=web.2 connect=3581ms service=30006ms status=503 bytes=0 Oct 15 15:12:06 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/settings/?format=json-p&callback=_jqjsp&_1381871492466= host=canyonsdistrict.nutrislice.com fwd="174.52.155.49" dyno=web.2 connect=3582ms service=30001ms status=503 bytes=0 Oct 15 15:12:09 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/mobile/api_version/?deviceType=iphone host=pasco.nutrislice.com fwd="173.65.148.9" dyno=web.2 connect=3837ms service=30004ms status=503 bytes=0 Oct 15 15:12:11 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/mobile/api_version/?deviceType=iphone host=canyonsdistrict.nutrislice.com fwd="174.52.155.49" dyno=web.2 connect=3987ms service=30001ms status=503 bytes=0 Oct 15 15:12:11 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/schools/?format=json-p&callback=_jqjsp&_1381871497105= host=canyonsdistrict.nutrislice.com fwd="174.52.155.49" dyno=web.2 connect=3962ms service=30001ms status=503 bytes=0 Oct 15 15:12:11 nutrislice-stockton heroku/router: at=error code=H12 desc="Request timeout" method=GET path=/menu/api/menutypes/?format=json-p&callback=_jqjsp&_1381871497128= host=canyonsdistrict.nutrislice.com fwd="174.52.155.49" dyno=web.2 connect=4020ms service=30007ms status=503 bytes=0
And this "death-spiral" of H12 Time-outs and [CRITICAL] WORKER TIMEOUT's continues on and on until the master process is restarted.

Any ideas if this is the same problem or a different one? Any suggestions to fix? I would think that since gunicorn is killing off the worker processes after they time out and starting them fresh that they'd be able to start processing things normally, but it seems like requests aren't even getting to them after that. Maybe its a Heroku issue.

:+1: Also seeing this problem.

I'm also seeing this issue on Heroku

@sprynmr @richardkeen can you share the way you are launching gunicorn and any log you can? It would help a lot to fix the issue.

@nebstrebor what if you don't limit the number of requests?

We were launching it like this:
web: newrelic-admin run-program python manage.py run_gunicorn -b "0.0.0.0:$PORT" --log-level=DEBUG -w 3 -k gevent --max-requests 250

Even with log-level DEBUG we didn't get much in the way of logs about what was happening. Not using gevent has solved the problem for us for now, but it's not ideal.

@sprynmr what if you remove the max number of requests? Also which version of gevent?

We were on 1.0rc2. Unsure about the max number of requests. Don't really want to switch it back on for our production environment right now.

Reading the Heroku page on the request timeout they suggest to set the application timeout well under 30 seconds, such as 10 or 15 . Did you try the suggestion using the timeout parameter in gunicorn?

Anyway the problem appearing with gevent is probably due to the way the connections are handled. The current version of the worker is spawning a greenlet / accepted connections without any real supervision of this greenlet. What probably happen is that one of the accepted socket will be waiting for a long time, and no other events are handled, making the gunicorn worker restart. But I need to check..

Interesting. No we didn't try that.

:+1: I am seeing this on my application as well. @Jwpe / @sprynmr did you ever get to a resolution on this problem? It looks like it's been over 2 months since any activity occurred on this issue. I am eager to hear if you found anything to address the problem.

Thanks!

@dencold did you try https://github.com/benoitc/gunicorn/issues/588#issuecomment-29267541

I have no account at heroku so not sure what it does. Also do you have the same logs?

Hi @benoitc, thanks so much for the quick response. The logs are exactly the same as what @Jwpe and @nebstrebor reported. I'm happy to pull out the relevant bits and add them here if that would help isolate the issue, just let me know.

I am going to take your advice from your earlier comment and configure the gunicorn timeout to be lower than the heroku router timeout and see if that helps. Thanks for the suggestion.

It would be helpful to find out the actual cause of this cascading failure. It is really strange to see one worker fail and then have the entire gunicorn server become completely unresponsive. Let me know if there is anything I can do to help diagnose. I am not able to reproduce this timeout locally. It appears to only occur on heroku infrastructure.

Sorry @dencold I never did find a solution. Seemed a very deep mysterious bug, and I didn't have time to fight with it. Will be curious to see if the timeout bit helps you. Report back.

@dencold are you also using new relic?

@benoitc yep, we are also using new relic.

@dencold likewise, we never successfully resolved this issue. We didn't have the time to invest in figuring out what was causing the errors, and needed to focus on reducing the duration of the original requests. However, the critical nature of the error - since it was user-facing - led us to switch WSGI servers to avoid the issue in the interim. Hence we haven't run any further experiments to further categorise the problem.

@Jwpe did you at least tried the proposed solution? Also the result is weird and I never reproduce that with gunicorn by itself.

I had the cutoff time down to 15 seconds and was still seeing the issue

On Sunday, January 26, 2014, Benoit Chesneau [email protected]
wrote:

@Jwpe https://github.com/Jwpe did you at least tried the proposed
solution? Also the result is weird and I never reproduce that with gunicorn
by itself.


Reply to this email directly or view it on GitHubhttps://github.com/benoitc/gunicorn/issues/588#issuecomment-33316333
.

Ben Roberts
CTO
Nutrislice, Inc.
866-524-3444 ext 702
[email protected]

@Jwpe / @nebstrebor thanks so much for chiming back in on this issue.

@benoitc we've implemented the proposed timeout solution on our heroku instance. I'll be keeping careful watch on this over the coming days. Will post back if I see the same problem pop up after the new timeout is in place.

Thanks again to everyone for the help. Good to know that I'm not the only one experiencing this issue. Hope we can get to the bottom of the cause.

@benoitc just wanted to report that it's been about two weeks with the lower timeout in place. We've seen a couple of worker timeouts, but none that cause the entire gunicorn process to become unresponsive. This seems to solve the problem, but I'm still not sure why 30s timeouts were causing the cascading error in the first place.

Just wanted to keep you posted. Thanks again for all the help here!

Having the same [CRITICAL] WORKER TIMEOUT error running Django with gunicorn in a Raspberry PI.

No news.

I'm getting the same [CRITICAL] WORKER TIMEOUT.
The issue is that after worker's real timeout(long request) the gunicorn kills the worker and try to spawn new one, but the new worker couldn't start within TIMEOUT limit and do not reach the place where it inform to parent process that it is live, so the gunicorn kills it again and again :(. In our case the issue was related to Gevent and Sentry(https://github.com/getsentry/raven-python/issues/305). Sentry just hangs on the startup.
BTW It will be helpful to have separate timeout for worker start time and some additional logs within a worker that it can't start within a "timeout" time.

The timeout is not a request timeout except on the sync worker. The other workers still heartbeat to the arbiter even when handling long requests. The timeout time is the startup timeout time in that case.

In a sense, there is no "request timeout" in gunicorn.

yes, the worker timeout differs from request timeout in async. But long running requests without any function calls where async "context switching" can happen(like "socket - read/write") are the causes of worker timeouts. And besides there can be timeout on worker startup(where there is no request at all). But in the Gunicorn logs there is no way to distinguish those 2 different cases, there is a just one message "WORKER TIMEOUT".

Should we add a request timeout setting? That would let apps with long startups bump the worker timeout.

Don't think so, because of
1) Measuring request run-time is hard, we can't use time difference at the begging and at the end of request.
2) As I know catching timeouts of "execution units"(the amount of code that runs between context switch) is not possible in gevent.

Any suggestion then?

If this issue is due only to legitimate timeouts and potentially cascading
timeouts due to server load that doesn't exist at first run, is there
anything to do here? Or is this issue closable and this is better handled
operationally at a different level of infrastructure?
On Jul 30, 2014 1:14 PM, "Mkrtich" [email protected] wrote:

Don't think so, because of
1) Measuring request run-time is hard, we can't use time difference at the
begging and at the end of request.
2) As I know catching timeouts of "execution units"(the amount of code
that runs between context switch) is not possible in gevent.


Reply to this email directly or view it on GitHub
https://github.com/benoitc/gunicorn/issues/588#issuecomment-50673040.

Suggestion is to have 2 timeout parameters and 2 different logs.
1) One is the current timeout parameter that will work only during request handling.
2) The second will be worker startup timeout.

Also seeing this problem

We ended up switching to uWSGI for Heroku and have had better success ... not a fix for Gunicorn, I realize, but I thought the suggestion may be helpful to someone who's seeing this issue on their production server and needs a resolution sooner than later.

@nebstrebor well did you try the suggestion to lower the timeout?

@CrazyPython the ticket has been closed. Can you open a new ticket describe the issue you have and how we can reproduce it? It would definitely help to figure if it's the same issue, something else and co :)

Yes, it didn't make a difference, as noted above. The problem was sporadic
enough (was never able to reproduce) and catastrophic enough that we had to
make that move (its been more than a year). I really like Gunicorn though,
and we use it outside of Heroku without issue.

Ben Roberts
CTO & Co-founder
Nutrislice, Inc.
[email protected]
cell - 801-735-7845

On Tue, Jun 30, 2015 at 8:08 AM, Benoit Chesneau [email protected]
wrote:

@nebstrebor https://github.com/nebstrebor well did you try the
suggestion to lower the timeout?

@CrazyPython https://github.com/CrazyPython the ticket has been closed.
Can you open a new ticket describe the issue you have and how we can
reproduce it? It would definitely help to figure if it's the same issue,
something else and co :)


Reply to this email directly or view it on GitHub
https://github.com/benoitc/gunicorn/issues/588#issuecomment-117199606.

I think we're seeing this issue too. Both workers (--workers 2) got hung up on some long running requests, eventually get killed (--timeout 20), and we immediately see two H13s, and then we start to get H12s & WORKER TIMEOUT restarts. Over the next 10 minutes, this continued, the workers never successfully processed a request before they would hit the timeout and get restarted. So, we restarted the dyno, and that fixed it.

One interesting thing that I noticed that I didn't see anyone else note - we see two H13s (Connection closed without response) at the start of the issue, and then when we finally issued the SIGTERM, we saw a flood of H13s - 48 to be exact, which is the same number of WORKER TIMEOUTs we saw (excluding the first two which we saw an immediate H13 for). I'm not exactly sure what that means, but it seems suspicious....

what is H13s or H12s ? Is this something from heroku? How is given the port on which gunicorn will be bound? I suspect that the heroku proxy is not detecting that the socket is closed.

Also which worker are your using?

Yes, they're Heroku error codes:
H12 - Request Timeout - usually 30s (https://devcenter.heroku.com/articles/error-codes#h12-request-timeout)
H13 - Connection Closed without Response - (https://devcenter.heroku.com/articles/error-codes#h13-connection-closed-without-response)

Edit: I just noticed that gunicorn will use PORT if it exists as an environment variable, so that's how the port is configured.

We're using synchronous workers.

And I also forgot to mention, we're running gunicorn 19.3.0

closing the issu since no activity happened there since a long time. Maybe having a delayed timeout would be good for long starting applications but that should be done in another ticket if needed.

Hey I'm also facing same problem with gunicorn uvicorn on heroku with new-relic admin procfile:

newrelic-admin run-program gunicorn -w 4 -k uvicorn.workers.UvicornWorker app.main:fastapi_app -b 0.0.0.0:${PORT:-5000} --log-level info --access-logfile=- --logger-class=gunicorn_color.Logger --preload

This leads me to this log as soon as system is booted:

2021-03-19T13:18:19.187532+00:00 heroku[web.1]: State changed from starting to up
2021-03-19T13:18:51.964740+00:00 heroku[router]: at=error code=H12 desc="Request timeout" method=GET path="/" host=api-app-clienti-pr-49.herokuapp.com request_id=8742009a-3e56-4f83-a147-97ff84d4e30b fwd="5.89.111.249" dyno=web.1 connect=1ms service=30003ms status=503 bytes=0 protocol=https
2021-03-19T13:19:04.292784+00:00 heroku[router]: at=error code=H12 desc="Request timeout" method=GET path="/api" host=api-app-clienti-pr-49.herokuapp.com request_id=85b6320a-7728-4074-87eb-b0992e7c3f9d fwd="5.89.111.249" dyno=web.1 connect=3ms service=30001ms status=503 bytes=0 protocol=https
Was this page helpful?
0 / 5 - 0 ratings