Celery: Restarting celery issues and better supervisord config file

Created on 9 May 2010  ·  16Comments  ·  Source: celery/celery

I use supervisord config file based on the example in celery repository and have some issues when celeryd restart is occured: sometimes task processing silently stops after restarting celery without any error messages in logs. Processes remains visible in process list.

Finally I figured out that sometimes when processes are restarted celery spawns additional process that is not managed by supervisord and this leads to these bugs. So I started to watch ps output carefully after each restart and kill extra processes manually via kill . After killing these processes tasks begin to be executed properly. This is sort of hack that solve a problem for a week or so.

And today I think the real reason is found. The default supervisord value for 'stopwaitsecs' option is 10s. This means that after 10s celery process will be killed with KILL signal instead of TERM. It seems that celery don't like being killed and try to spawn additional process in that case.

So I think it'll be good to add something like 'stopwaitsecs=600' to all supervisord example config files (from faq: "You should never stop celeryd with the KILL signal (-9), unless you’ve tried TERM a few times and waited a few minutes to let it get a chance to shut down.") and investigate celeryd behaviour on KILL signal: it is mentioned in docs that tasks will be lost (and it is tolerably in many cases) but the issue with spawned process is a bit weird.

Most helpful comment

In case you are still having issues terminating your Celery workers, you might try setting stopasgroup=true before increasing your stopwaitsecs.

All 16 comments

Processes spawned when receiving the KILL signal is weird indeed. I don't see that behavior when used outside of supervisord, so maybe this is something caused by it?

If you install the setproctitle module, celery should report the kind of process in ps listings, could you do that to investigate what kind of process is created?

(easy_install setproctitle)

Setting the timeout to 600 is probably good. Is there any setting for infinity (maybe with a warning if it takes too long)? When celeryd is killed via TERM (which is the preferred shutdown signal) it stops receiving messages and waits for the currently executing tasks to finish. And I guess for most applications, termination mid-execution is not acceptable.

As for process spawning: setproctitle and watching for process ids was helpful. It is not process spawning. Worker processes remain alive when parent process is killed.
This is a simulation of supervisord restart with manual killing and zero timeout:

 4976 ?        Ss     0:00 /usr/bin/python /usr/bin/supervisord --pidfile /var/run/supervisord.pid
 5422 ?        S      0:01  \_ [celerybeat] --schedule=/var/lib/celery/celerybeat-schedule-nadovmeste --loglevel=INFO                                                             
 6101 ?        Sl     0:00  \_ [celeryd.MainProcess] Running... (--loglevel=INFO)                                                           
 6108 ?        S      0:00      \_ [celeryd.PoolWorker-1]                                                                                       
 nadovmeste:~# kill 6101 & kill -9 6101 &

ps -afx:

 4976 ?        Ss     0:00 /usr/bin/python /usr/bin/supervisord --pidfile /var/run/supervisord.pid
 5422 ?        S      0:01  \_ [celerybeat] --schedule=/var/lib/celery/celerybeat-schedule-nadovmeste --loglevel=INFO                                                             
 6867 ?        Sl     0:00  \_ [celeryd.MainProcess] Running... (--loglevel=INFO)                                                           
 6875 ?        S      0:00      \_ [celeryd.PoolWorker-1]                                                                                       
 6108 ?        S      0:00 [celeryd.PoolWorker-1]       

I was able reproduce this only with such artifical race between kill and kill -9. Sometimes worker gets killed properly. The issue seems to be supervisord-specific because when I start celeryd from console I have no luck reproducing it.

I was able to reproduce this with console-started scripts after several attempts:

/home/nadovmeste/envs/nadovmeste/bin/python /home/nadovmeste/src/nadovmeste/manage.py celeryd -B --loglevel=INFO&

and then in another terminal session:

nadovmeste:~# ps -afx

 6450 ?        Ss     0:00  \_ sshd: root@pts/2 
 6452 pts/2    Ss+    0:00      \_ -bash
 9343 pts/2    Sl     0:00          \_ [celeryd.MainProcess] Running... (-B --loglevel=INFO)                                                           
 9350 pts/2    S      0:00              \_ [celeryd.PoolWorker-2]                                                                                          
 9355 pts/2    S      0:00              \_ [celerybeat]     

nadovmeste:~# kill 9343 & kill -9 9343

nadovmeste:~# ps -afx

 4526 ?        Ss     0:00  \_ sshd: root@pts/1 
 4529 pts/1    Ss     0:00  |   \_ -bash
 9366 pts/1    R+     0:00  |       \_ ps -afx
 6450 ?        Ss     0:00  \_ sshd: root@pts/2 
 6452 pts/2    Ss+    0:00      \_ -bash    
 ...
 9350 pts/2    S      0:00 [celeryd.PoolWorker-2]                                                                                          
 9355 pts/2    S      0:00 [celerybeat]

Haven't found any special option for infinite timeout with warning in supervisord docs. Probably very large number will suffice if it is what we want.

Maybe it's something related to celerybeat because the I was able to reproduce the issue for console-started celeryd only after using the -B option.

If i am testing some celery tasks locally and i use the -B option sometimes the process is not killed when i used ctrl-c.

I can't reproduce this locally. Btw, are you running the master branch? I just fixed a bug that could hang shutdown. If you could test with this it would be nice.

Yes, I'm running the latest master branch. I saw your bug-fixing commit and hoped that it will help but it seems that it doesn't help in my case: the latest celery seems to behave the same. But it is possible that the initial problem is solved - I check this only with an immediate kill. Can't wrap my hand around it now :) The ctrl-c issue is not reproducible with my setup.

So the bug report, simplified: http://gist.github.com/401028 . The results are always the same (not sometimes). I have some periodic tasks and some non-periodic. Tasks are simple and don't take much time to finish. Is it a bug that children processes stay alive after killing the main process? If so and you can't reproduce it then I'll try to provide the minimal project.

The celerybeat killing behaviour is interesting: when I kill hanging(?) celerybeat process the hanging(?) worker process also shutdowns.

@kmike I still can't reproduce with the commands above. Maybe because I'm on OS X, or maybe you're running Python 2.5? (I'm running 2.6.1)

Could run it with --loglevel=DEBUG? It could provide some info about where it stops.

The celerybeat process is started by the main process, so I'm assuming the main process is waiting
for the celerybeat to exit before it kills the remaining pool processes.

I thought that the main process was killed: it is not visible in process list. Don't have much experience with process management though.

My setup was Debian Lenny + python 2.5.

I'll try to run celeryd with --loglevel=DEBUG and to reproduce it on my macbook.

hmm, you're right of course. It's almost like the beat process takes ownership of the pool processes.

I just tried to reproduce on Debian Lenny with python 2.5, and it works just right there.
Tried killing with both TERM and INT.

Ask, thank you for help.

I think that initial problem was solved with increased supervisord timeout and your bug-fixing commit. The simulation was incorrect because I use kill -9 commands and they send KILL signal instead of TERM. With TERM signal processes are getting killed properly.

Supervisord use TERM signal so all should be fine.

But the thing that scares me a bit is that the initial bug wasn't investigated. I'll try to reproduce it and let you know.

Ah! I'm so sorry. I didn't read the issue carefully enough. Yes! That's exactly what happens when you kill it with SIGKILL. The 9 signal can't be catched, so there's nothing we can do about this AFAIK.

In case you are still having issues terminating your Celery workers, you might try setting stopasgroup=true before increasing your stopwaitsecs.

Was this page helpful?
0 / 5 - 0 ratings