Lua-resty-auto-ssl: Locking problem

Created on 1 Feb 2017 · 20Comments · Source: auto-ssl/lua-resty-auto-ssl

I've been encountering a problem with our (reverse proxy) nginx servers that they have been crashing. They stop responding to requests completely, and don't seem to come out of this state by themselves. These servers deal with a highish volume of requests (>5million a day).

For the past few days I've been at a bit of a loss, and restarting the Docker instance manually whenever I was alerted to this by monitoring, but I decided to put a helper cron script in place that would check if nginx was still responding and restart it via supervisord if there was an issue.

Due to the fact initially I was restarting the Docker container, I wasn't really getting any form of debugging information -- the logging would just stop. However after changing this to instead restart nginx inside the container I have the following in the logs:

2017/02/01 01:14:16 [alert] 489#0: worker process 501 exited on signal 9
2017/02/01 01:14:16 [alert] 489#0: shared memory zone "auto_ssl" was locked by 501
2017/02/01 01:14:16 [alert] 489#0: worker process 502 exited on signal 9

I had around on Google and the only reference I can find is https://github.com/18F/api.data.gov/issues/325 -- however it looks like expirations were put into place, this doesn't seem to be working on our setup, as we (due to bad monitoring) ended up with about a 7 hour downtime recently.

I should mention I cannot recreate this bug at all locally, even using the same Docker container.

I'm at a bit of a loss, our automatic restart script has sorted out the issues for now but it would be nice to see if anyone has ideas. I'd be happy to turn on extra logging and attempting the debug log (I've been a bit scared to turn it on in our production servers).

bug

Source

stackrainbow

Most helpful comment

Also ran into this issue in production – thanks @koszik and al. Just to confirm, to resolve this issue:

Update OpenResty to >1.15.8.1

This seems so pernicious it might be worth releasing f66bb61f11a654f66d35dd793ceaf0293d9c0f46 soon, or at least updating the documentation to requirement, rather than a recommendation.

davidmerfield on 14 Apr 2021

👍2

All 20 comments

Owch, sorry to hear this led to an outage!

I unfortunately haven't seen anything like this in our installation that gets a decent amount of traffic (since the incident from last March you referred to). However, there was another somewhat similar issue like this reported in #29, where we fixed an issue that might have been related, but it may not have totally explained the issue. But that issue may also not be related (it was specific to when registrations occurred).

Thanks for the offer to help debug this, though, it would definitely be good to get to the bottom of. I have a few initial questions:

What version of lua-resty-auto-ssl are you running?
Are you running OpenResty or nginx with the lua module manually installed?
What versions of OpenResty or nginx+lua are you running?
What storage mechanism are you using with lua-resty-auto-ssl (Redis, filesystem, something else)?
How frequently do things hang? Does it seem to only happen when new certs are being registered or renewals are taking place, or is it seemingly random?
Are you reloading nginx at all (sending a SIGHUP to the master process and spawning new workers instead of fully restarting the master process)?
How many nginx workers do you have running (worker_processes setting in nginx config)?
Do you have any other nginx plugins installed (beyond the ones that come with OpenResty by default if you're on OpenResty)?

GUI on 1 Feb 2017

lua-resty-auto-ssl is 0.10.3-1 from luarocks
We're using OpenResty 1.11.2.2.

nginx version: openresty/1.11.2.2
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC)
built with OpenSSL 1.0.2h  3 May 2016
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt=-O2 --add-module=../ngx_devel_kit-0.3.0 --add-module=../echo-nginx-module-0.60 --add-module=../xss-nginx-module-0.05 --add-module=../ngx_coolkit-0.2rc3 --add-module=../set-misc-nginx-module-0.31 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.06 --add-module=../srcache-nginx-module-0.31 --add-module=../ngx_lua-0.10.7 --add-module=../ngx_lua_upstream-0.06 --add-module=../headers-more-nginx-module-0.32 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.17 --add-module=../redis2-nginx-module-0.13 --add-module=../redis-nginx-module-0.3.7 --add-module=../rds-json-nginx-module-0.14 --add-module=../rds-csv-nginx-module-0.07 --with-ld-opt=-Wl,-rpath,/usr/local/openresty/luajit/lib --with-http_ssl_module --with-http_perl_module --with-http_v2_module --with-http_secure_link_module --add-module=/nginx-build/openresty-1.11.2.2/../testcookie-nginx-module --add-module=/nginx-build/openresty-1.11.2.2/../lua-upstream-cache-nginx-module --add-module=/nginx-build/openresty-1.11.2.2/../nginx-module-vts --with-openssl=/openssl

File system for now, as the subdomains each server deals with are separate.
Seems to be completely random when a crash occurs, anything from 3 hours to a good 3 days.
Not reloading nginx atm, just doing a restart, but I'll try this and see if this works too.
Initially I was using 1 worker, but I tried increasing this to 2 when the problems occurred to see if it would make any difference.
Using the following non-OpenResty modules:

nginx-module-vts https://github.com/vozlt/nginx-module-vts
lua-upstream-cache-nginx-module https://github.com/cloudflare/lua-upstream-cache-nginx-module
testcookie-nginx-module https://github.com/kyprizel/testcookie-nginx-module
and --with-http_v2_module --with-http_secure_link_module

I'm not using any other Lua in the configuration code yet other than this project.

stackrainbow on 1 Feb 2017

Sorry for the delayed followup! After searching around some more, I have some theories on what might be happening:

The fact that you're seeing "exited on signal 9" errors, might indicate you're hitting out-of-memory errors, and the system is aggressively killing off processes: http://raspberrypi.stackexchange.com/questions/40883/nginx-out-of-memory-kill-process
When a process crashes or gets forcibly killed like this, then it may lead to nginx thinking the shared memory is still locked by the dead worker process. For example, in your initial example, it looks like worker process 501 gets killed first, but then it still thinks the memory is locked by pid 501, leading to this deadlock.
- It does seem like nginx is supposed to unlock shared memory on crashes, so I'm not entirely sure why that might not be happening. But if workers are getting killed off with SIGKILL (9), then all bets might be off (since sigkill usually means forcibly kill the process and there's no chance to clean up).

So do you see anything in your system-level logs about out of memory or oom-killer? Do you have any other monitoring on these servers that might indicate memory growth or a memory leak in nginx? I don't think we've seen any memory leaks in any of our lua-resty-auto-ssl installations, so I'm wondering if some of the other nginx modules might also be playing a role (there is this mention of a memory leak in lua-upstream-cache-nginx-module).

GUI on 25 Feb 2017

Sorry - I meant to clarify back to @GUI that the killed on signal 9 isn't connected to the bug, but us deliberately killing the nginx process to counter the issue. There is no issue with the memory on these servers, they have around 2GB of memory, with a tiny amount actually being used and the rest mostly cache. No OOM kills on dmesg.

I should add that I've removed some modules to try and help the issue, the deprecated lua-upstream-cache-nginx-module and removed pagespeed, but this doesn't seem to have helped.

I have a few more error lines that may be helpful, I'll try and get them from the servers shortly.

stackrainbow on 3 Apr 2017

@ajmgh: I'm not entirely certain if it's related, but I think I tracked down some potential problems that could lead to strange errors if the configured lua_shared_dict memory size was too low: https://github.com/GUI/lua-resty-auto-ssl/issues/48#issuecomment-294397379

So do you know roughly how many certificates are in your system, and how big lua_shared_dict auto_ssl is configured to be in your nginx configuration? You might also try upgrading to v0.10.6, if possible, since there have been a few updates since 0.10.3 that might fix this (if we're lucky), or at least provide better error handling and messages.

GUI on 17 Apr 2017

I'm facing exactly the same error.
I just update lua-resty-auto-ssl to version 0.10.6-1 and increase lua_shared_dict auto_ssl_settings to 1000m (before it was set to 64k).
lua_shared_dict auto_ssl keeps the same: 1000m

Just waiting to see if these changes will fix this issue :/

@ajmgh did you solve your problem?

aiev on 24 May 2017

@aiev auto_ssl_settings currently only stores a short string as well as one boolean, so changing it won't make any difference. The certificates are stored in auto_ssl. So please try increasing that instead.

luto on 24 May 2017

No, latest update doesn't fix our issue. I've upped the auto_ssl size to 8M, which is overkill since we use only around 10 certificates and didn't see any change.

# Log entries after my script detects nginx is unresponsive and force kills it
2017/05/24 13:29:15 [alert] 462#0: worker process 474 exited on signal 9
2017/05/24 13:29:15 [alert] 462#0: worker process 475 exited on signal 9
2017/05/24 13:29:15 [alert] 462#0: shared memory zone "auto_ssl" was locked by 475

stackrainbow on 24 May 2017

I have experienced the same issue a few times.
I'm using OpenResty 1.11.2.3/4 and lua-resty-auto-ssl 0.11.0-1 from luarocks.
When this issue appears, more than 100 tcp connections are stuck at CLOSE_WAIT state.

RainFlying on 5 Aug 2017

We have experienced the same issue many times as well.
nginx version: openresty/1.11.2.4
lua-resty-auto-ssl 0.11.0-1
There're many CLOSE_WAIT state, and nginx cannot response anymore. Either we need to kill CLOSE_WAIT connection, or restart the docker to solve this issue.

ianchan0817 on 10 Aug 2017

@ajmgh have you solved your issue? We are experiencing the same issue in our openresty containers. We got ~1200 connection in CLOSE_WAIT state and lots of dehydrated files in /tmp in our servers that run only openresty with lua-resty-auto-ssl.

Here is our system configuration

What version of lua-resty-auto-ssl are you running?
0.11.0-1
Are you running OpenResty or nginx with the lua module manually installed?
openresty
What versions of OpenResty or nginx+lua are you running?
openresty 1.11.2.4
What storage mechanism are you using with lua-resty-auto-ssl (Redis, filesystem, something else)?
redis
How frequently do things hang? Does it seem to only happen when new certs are being registered or renewals are taking place, or is it seemingly random?
It looks very random. It just happened yesterday and led to a 30min downtime in our system. The previous time it happened was 2 month ago.
Are you reloading nginx at all (sending a SIGHUP to the master process and spawning new workers instead of fully restarting the master process)?
we just replaced all docker containers
How many nginx workers do you have running (worker_processes setting in nginx config)?
2
Do you have any other nginx plugins installed (beyond the ones that come with OpenResty by default if you're on OpenResty)?
nope lua-resty-auto-ssl is the only plugin we have install

ronail on 10 Aug 2017

@ronail No, however we added extra servers on our roundrobin as well as have an automated restart script when this issue happens so we've heavily mitigated it.

Is everyone else using Docker who experiences this bug? Maybe it's something really weird going on with a mix of Lua/OpenResty and Docker.

stackrainbow on 10 Aug 2017

I'm not using docker and i'm facing the same problem.

I guess that this is an issue when dehydrated tries to issue the certificate.

aiev on 3 Sep 2017

I'm also getting a similar issue, having to force jenkins to restart OpenResty every 30 minutes (Crashes every hour or so constantly...)

I have high memory limits set, however I have noticed I am getting a fair few ratelimits for failed auths on LetsEncrypt if that helps?

BuckinghamIO on 10 Jun 2018

We've been hit with the same issue yesterday, and found these reports (#43, #136) which contained no pointers as to what might be the root cause. We were unable to reproduce the issue on our test system, so we were forced to debug on the production one. 'Fortunately', the hangs were frequent enough, so we could quickly iterate through our debugging methods. First, it was only a strace -fp $pid on all nginx processes, and this revealed that all of them were waiting on a futex() - consistent with the fact that one of the pids always held a lock on a shdict. Next, I added a dump of the gdb backtrace of each process, and after adding debug symbols to the image, it became clear, that the issue is on the following code path:

#3  0x00007f8f4ea50219 in ngx_shmtx_lock (mtx=0x7f8f31a0c068) at src/core/ngx_shmtx.c:111
#4  0x00007f8f4eb7afbe in ngx_http_lua_shdict_set_helper (L=0x418257a0, flags=0) at ../ngx_lua-0.10.13/src/ngx_http_lua_shdict.c:1016
#5  0x00007f8f4eb7a4a4 in ngx_http_lua_shdict_delete (L=0x418257a0) at ../ngx_lua-0.10.13/src/ngx_http_lua_shdict.c:632
#6  0x00007f8f4debd2f3 in lj_BC_FUNCC () from /usr/local/openresty/luajit/lib/libluajit-5.1.so.2
#7  0x00007f8f4dec0b9f in gc_call_finalizer (g=0x418063b8, L=0x418257a0, mo=0x7ffc7592da00, o=0x40e11948) at lj_gc.c:475
#8  0x00007f8f4dec0e2b in gc_finalize (L=0x418257a0) at lj_gc.c:509
#9  0x00007f8f4dec15d9 in gc_onestep (L=0x418257a0) at lj_gc.c:659
#10 0x00007f8f4dec16ef in lj_gc_step (L=0x418257a0) at lj_gc.c:689
#11 0x00007f8f4ded8c3d in lua_pushlstring (L=0x418257a0, str=0x7f8f330a6066 "0\202\002\v\n\001", len=527) at lj_api.c:639
#12 0x00007f8f4eb7a225 in ngx_http_lua_shdict_get_helper (L=0x418257a0, get_stale=0) at ../ngx_lua-0.10.13/src/ngx_http_lua_shdict.c:538
#13 0x00007f8f4eb79eb6 in ngx_http_lua_shdict_get (L=0x418257a0) at ../ngx_lua-0.10.13/src/ngx_http_lua_shdict.c:419

After a quick glance of ngx_http_lua_shdict_get_helper() the root cause of the problem becomes clear: the shdict gets locked, and lua_pushlstring sometimes calls the garbage collector, which might want to remove items from the same shdict, causing the deadlock.

My quick and dirty solution was this (it's so ugly I'm not going to publish a patch);

    case SHDICT_TSTRING:
{
int len = value.len;
char *tmp = malloc(len);
if(!tmp) {
    ngx_log_error(NGX_LOG_ERR, ctx->log, 0, "dict get: malloc: out of memory");
    return luaL_error(L, "out of memory");
}
ngx_memcpy(tmp, value.data, value.len);
ngx_shmtx_unlock(&ctx->shpool->mutex);
lua_pushlstring(L, tmp, len);
free(tmp);
}
        break;

So far this runs flawlessy - someone with more insight to the inner workings of the system might want to produce a better fix.

koszik on 9 Oct 2019

🎉1

Interestingly, it's a known fact !
https://github.com/openresty/lua-nginx-module/issues/1207#issuecomment-350745592

kapouer on 9 Oct 2019

that's interesting indeed. as per the issue you mentioned, using lua-resty-core would fix the problem, and according to its documentation, it's automatically loaded since openresty 1.15.8.1, so this bug was silently fixed in that version. we'll upgrade our proxy and report back.

koszik on 9 Oct 2019

❤1 👍1

looks like it's working perfectly - assuming the condition that caused it to hang before still persist, I'd say the bug has been fixed.

koszik on 14 Oct 2019

Just ran into this, after 3+ years of running smooth.

acoyfellow on 12 Apr 2021

Also ran into this issue in production – thanks @koszik and al. Just to confirm, to resolve this issue:

Update OpenResty to >1.15.8.1

This seems so pernicious it might be worth releasing f66bb61f11a654f66d35dd793ceaf0293d9c0f46 soon, or at least updating the documentation to requirement, rather than a recommendation.

davidmerfield on 14 Apr 2021

👍2

Was this page helpful?

0 / 5 - 0 ratings