supervisord crash when stop a subprocess

Created on 9 Dec 2014  ·  4Comments  ·  Source: Supervisor/supervisor

when stop a subprocess of supervisord, the stopping operation overtake the time of stopwaitsecs and using 'options.kill' to stop the subprocess.

2014-11-20 16:42:45,352 INFO success: resumed process 'hbase--dptst-example--regionserver' with pid 46651
2014-12-09 02:56:47,389 WARN killing 'hbase--dptst-example--regionserver' (46651) with SIGKILL
2014-12-09 02:56:47,422 CRIT unknown problem killing hbase--dptst-example--regionserver (46651):Traceback (most recent call last):
File "/home/work/app/supervisor/supervisor/process.py", line 390, in kill
options.kill(pid, sig)
File "/home/work/app/supervisor/supervisor/options.py", line 1219, in kill
os.kill(pid, signal)
OSError: [Errno 3] No such process

However, this exception crash supervisord, can we ignore and skip this exception rather than crashing the supervisord ?

any idea to share ? thanks

signals

Most helpful comment

This crash is caused by Subprocess.finish() not having any logic to deal with ProcessStates.UNKNOWN, which is only brought up in a few situations (e.g. calling "supervisorctl stop" to terminate a flapping process, only for the PID to become invalid moments before running options.kill, which bombs with an exception that is caught and changes the state to UNKNOWN), and so ends up crashing the daemon here. You can simulate the race condition by dropping in a "raise Exception()" after options.kill(pid, sig).

We opted to fix this for our application servers by changing the state to FATAL (so that we weren't barred from fixing the process), and adding in a self.pid sanity check for the final else branch in finish().

All 4 comments

:+1:

Related: #445

This crash is caused by Subprocess.finish() not having any logic to deal with ProcessStates.UNKNOWN, which is only brought up in a few situations (e.g. calling "supervisorctl stop" to terminate a flapping process, only for the PID to become invalid moments before running options.kill, which bombs with an exception that is caught and changes the state to UNKNOWN), and so ends up crashing the daemon here. You can simulate the race condition by dropping in a "raise Exception()" after options.kill(pid, sig).

We opted to fix this for our application servers by changing the state to FATAL (so that we weren't barred from fixing the process), and adding in a self.pid sanity check for the final else branch in finish().

We've seen this issue as well, and I would like some insight into the root cause, which is that supervisor thinks the process is already terminated at the point where supervisorctl stop sends the kill signal, leading to the OSError: [Errno 3] No such process exception (which as noted above, supervisor fails to handle correctly). But every time we've seen this issue, the process we're trying to kill is in fact still running, and the PID is still there in the list of processes (ps). Some kind of permissions issue maybe?

Was this page helpful?
0 / 5 - 0 ratings