Gunicorn: 尽管没有退出信号，“启动工人”仍在无限循环

创建于 2017-12-11 · 65评论 · 资料来源: benoitc/gunicorn

我正在尝试在 Docker 上设置 gunicorn。它在本地运行良好，生产映像与本地映像完全相同，但我在生产 Docker 引擎上遇到了这种奇怪的行为：

ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [INFO] Starting gunicorn 19.7.1
ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [DEBUG] Arbiter booted
ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [INFO] Listening at: http://0.0.0.0:80 (1)
ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [INFO] Using worker: sync
ml-server_1     | [2017-12-11 13:18:50 +0000] [8] [INFO] Booting worker with pid: 8
ml-server_1     | [2017-12-11 13:18:50 +0000] [1] [DEBUG] 1 workers
ml-server_1     | Using TensorFlow backend.
ml-server_1     | [2017-12-11 13:18:54 +0000] [11] [INFO] Booting worker with pid: 11
ml-server_1     | Using TensorFlow backend.
ml-server_1     | [2017-12-11 13:18:58 +0000] [14] [INFO] Booting worker with pid: 14
ml-server_1     | Using TensorFlow backend.
ml-server_1     | [2017-12-11 13:19:02 +0000] [17] [INFO] Booting worker with pid: 17
ml-server_1     | Using TensorFlow backend.

尽管没有明显的错误消息或退出信号，但看起来 gunicorn 每 4-5 秒启动一次工人。这种行为会无限期地持续，直到终止。

是否有可能工作人员可以退出而不将任何内容记录到 stderr/stdout，或者仲裁者可以无限地产生工作人员？

由于它们是相同的 docker 映像，因此它们在完全相同的架构上运行完全相同的代码，所以我真的很困惑这可能是什么（错误？）。非常感谢任何帮助！

Improvement help wanted

资料来源

benhjames

👍19

最有用的评论

只是一个更新，我的问题实际上是内存错误，并在修复内存问题时修复。

sara-02 于 2018-08-07

👍5 🎉2

所有65条评论

ssh -ing 进入 Docker 容器让我发现了这个错误：

Illegal instruction (core dumped)

也许 gunicorn 应该像这样显示错误而不是吞下它们，或者以不同的方式处理它们？不确定，只是想我会提出这个，因为它可能会帮助别人！

benhjames 于 2017-12-11

👍7

感谢您报告问题！

如果你能弄清楚这是在哪里发生的，那将是非常有帮助的。

也许我们可以在工人退出时添加日志记录。通常，worker 本身会记录日志，但如果它非常突然地被杀死，则不会。

tilgovi 于 2017-12-11

不用担心！

我刚刚添加到此线程的Spacy似乎存在问题： https :

无论如何，它导致SIGILL为strace确认：

--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x7ff48bbe6cea} ---
+++ killed by SIGILL (core dumped) +++
Illegal instruction (core dumped)

我想如果 gunicorn 能够识别出这一点并记录错误而不是虚幻地重新启动工作程序会很好，但是我对退出代码的工作原理知之甚少！

benhjames 于 2017-12-11

👍4

一些退出代码肯定有特殊含义，我们可能会记录这些。
http://tldp.org/LDP/abs/html/exitcodes.html

tilgovi 于 2017-12-11

听起来不错！此外，如果退出代码不是保留的退出代码（例如这种情况），如果可以记录（无需解释）就很酷了，因此很明显工作人员确实正在终止🙂

benhjames 于 2017-12-11

我有类似的问题，当我发出 http 请求时，gunicorn 总是在启动新的工作人员。我没有得到任何回应，它只是总是重新启动新工人。来自两个 http 请求的 Strace 日志：

select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=510, si_uid=0, si_status=SIGSEGV, si_utime=160, si_stime=32} ---
getpid()                                = 495
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV && WCOREDUMP(s)}], WNOHANG, NULL) = 510
lseek(8, 0, SEEK_CUR)                   = 0
close(8)                                = 0
wait4(-1, 0x7ffd455ad844, WNOHANG, NULL) = 0
write(4, ".", 1)                        = 1
select(4, [3], [], [], {0, 840340})     = 1 (in [3], left {0, 840338})
read(3, ".", 1)                         = 1
read(3, 0x7f2682025fa0, 1)              = -1 EAGAIN (Resource temporarily unavailable)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
umask(0)                                = 022
getpid()                                = 495
open("/tmp/wgunicorn-q4aa72u7", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 8
fcntl(8, F_SETFD, FD_CLOEXEC)           = 0
chown("/tmp/wgunicorn-q4aa72u7", 0, 0)  = 0
umask(022)                              = 0
unlink("/tmp/wgunicorn-q4aa72u7")       = 0
fstat(8, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
ioctl(8, TIOCGWINSZ, 0x7ffd455b8e50)    = -1 ENOTTY (Not a tty)
lseek(8, 0, SEEK_CUR)                   = 0
lseek(8, 0, SEEK_CUR)                   = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
fork()                                  = 558
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(0, NULL, NULL, NULL, {0, 37381}[2017-12-28 17:50:23 +0000] [558] [INFO] Booting worker with pid: 558
) = 0 (Timeout)
select(4, [3], [], [], {1, 0}loading test-eu-ovh settings
)          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0}
)          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
select(4, [3], [], [], {1, 0})          = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=499, si_uid=0, si_status=SIGSEGV, si_utime=160, si_stime=31} ---
getpid()                                = 495
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV && WCOREDUMP(s)}], WNOHANG, NULL) = 499
lseek(7, 0, SEEK_CUR)                   = 0
close(7)                                = 0
wait4(-1, 0x7ffd455ad844, WNOHANG, NULL) = 0
write(4, ".", 1)                        = 1
select(4, [3], [], [], {0, 450691})     = 1 (in [3], left {0, 450689})
read(3, ".", 1)                         = 1
read(3, 0x7f2682067de8, 1)              = -1 EAGAIN (Resource temporarily unavailable)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG, st_size=0, ...}) = 0
umask(0)                                = 022
getpid()                                = 495
open("/tmp/wgunicorn-5x9a40ca", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW|O_CLOEXEC, 0600) = 7
fcntl(7, F_SETFD, FD_CLOEXEC)           = 0
chown("/tmp/wgunicorn-5x9a40ca", 0, 0)  = 0
umask(022)                              = 0
unlink("/tmp/wgunicorn-5x9a40ca")       = 0
fstat(7, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
ioctl(7, TIOCGWINSZ, 0x7ffd455b8e50)    = -1 ENOTTY (Not a tty)
lseek(7, 0, SEEK_CUR)                   = 0
lseek(7, 0, SEEK_CUR)                   = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
fork()                                  = 579
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(0, NULL, NULL, NULL, {0, 8144}[2017-12-28 17:50:30 +0000] [579] [INFO] Booting worker with pid: 579
)  = 0 (Timeout)
select(4, [3], [], [], {1, 0})          = 0 (Timeout)
fstat(6, {st_mode=S_IFREG, st_size=0, ...}) = 0
fstat(7, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
fstat(9, {st_mode=S_IFREG|01, st_size=0, ...}) = 0
fstat(8, {st_mode=S_IFREG|01, st_size=0, ...}) = 0

zetaab 于 2017-12-28

我面临同样的问题， gunicorn在几秒钟内为sync工人类型重复启动。将工作超时设置为900没有帮助。

在操作前加载中，我正在从 AWS S3 下载数据。下载各种文件大约需要 1 分 10 秒。

sara-02 于 2018-07-05

👍3

@sara-02 启动 gunicorn 的命令行是什么？

benoitc 于 2018-07-05

@benoitc gunicorn --pythonpath /src -b 0.0.0.0:$SERVICE_PORT --workers=1 -k sync -t $SERVICE_TIMEOUT flask_endpoint:app
出现在这里

sara-02 于 2018-07-05

@sara-02 谢谢。

老工人真的退出了还是他们保持在线并产生了新工人？调试日志还显示什么？

benoitc 于 2018-07-05

日志与 botocore 日志混合，但它是这样的

[INFO] Booting worker with pid:  a
[INFO] Booting worker with pid:  b
[INFO] Booting worker with pid:  c

sara-02 于 2018-07-05

但是工人被杀了吗？什么返回命令ps ax|grep gunicorn ？

benoitc 于 2018-07-05

@benoitc
screenshot from 2018-07-05 19-14-00

sara-02 于 2018-07-05

但是有一个问题，为什么我们会看到 2 个gunicorn进程，当 worker 限制设置为 1 时？是一个主人，一个工人吗？

sara-02 于 2018-07-05

有 1 个仲裁进程（主进程）和 N 个工人进程 ye :)

所以你每次启动工人时都运行命令对吗？如果是这样，似乎老工人被杀死了，就会产生一个新工人。我会调查。

benoitc 于 2018-07-06

@sara-02 最后一件事，这也发生在 docker 中吗？

benoitc 于 2018-07-06

@benoitc在docker-compose按预期工作，但是当我在Openshift上放置相同的代码时，我看到了这个错误。增加内存要求确实解决了问题，但是当我通过docker-compose运行应用程序时，它使用的内存少于limited 。

sara-02 于 2018-07-06

只是一个更新，我的问题实际上是内存错误，并在修复内存问题时修复。

sara-02 于 2018-08-07

👍5 🎉2

@benoitc
我在尝试在 docker 中生成 5 个 gunicorn 工人时遇到了同样的问题。
@sara-02
您是如何确定内存错误的原因的？

gulshan-gaurav 于 2018-10-08

👍2

@gulshan-gaurav 2 件事帮助了我：
我增加了分配给我的 Pod 的内存并停止了崩溃。其次，我们检查了我们的 Openshift Zabbix 日志。

sara-02 于 2018-10-09

❤5

@sara-02
即使在我的暂存 Pod 上，我在内存中加载的文件 + 模型也达到 50Mb，因此 2GB 的内存应该足够 5 个工人使用。

gulshan-gaurav 于 2018-10-09

@gulshan-gaurav 您面临哪个问题？那里有 5 个进程看起来不错....

benoitc 于 2018-10-09

我遇到过同样的问题。我没有找到确切的问题，但是一旦我从python 3.5升级到3.6就解决了。

emilwallner 于 2018-10-29

👍1

我在 Docker 容器中面临同样的问题。每次我调用导致失败的端点时，Gunicorn 都会继续对新工作人员进行 botting，但不会将异常或错误输出到 Gunicorn 的日志文件中。我选择打印的东西被记录下来，然后突然日志文件只说“使用 pid 引导工作人员......”

有帮助的一个步骤是添加环境变量 PYTHONUNBUFFERED。在此之前，即使是打印语句也会消失，不会保存在 Gunicorn 的日志中。

应用程序的 2 个其他端点正常工作。

我使用以下命令运行Gunicorn run:app -b localhost:5000 --enable-stdio-inheritance --error-logfile /var/log/gunicorn/error.log --access-logfile /var/log/gunicorn/access。日志 --capture-output --log-level 调试

已经运行 Python 3.6 并使用 top 检查内存似乎不是问题。

编辑：看起来这是一个 Python 问题，而不是 Gunicorn 的错。某些版本差异导致 Python 在执行特定操作时死掉，没有任何痕迹。

m3h0w 于 2019-01-11

我面临着类似的问题，工作节点不断提出
Booting worker with pid: 17636 。不知道是杀了之前的worker节点还是之前的worker节点还存在。但是在 gunicorn 命令行参数中提到的工人数量只有 3 - -workers=3 。另外我使用的是python 3.7版

我的 scikit-learn 依赖不匹配，但即使解决了这个问题，我仍然得到同样的无限工人。我应该寻找什么样的python版本差异以及如何识别它们？

sumbb 于 2019-01-29

我在 OpenShift 中面临同样的问题。

正如您在图像中看到的，我使用了 6 个工人（我尝试使用 3 个）。
我增加了 pod 的内存，但它不起作用。

构建配置：

任何的想法？

谢谢

jpramos123 于 2019-02-05

你是在 elb 后面的 aws 中运行这个吗？我通过在 elb 和 gunicorn 之间放置 nginx 入口解决了这个问题

zetaab 于 2019-02-05

有同样的问题。

flask_1  | [2019-02-23 09:08:17 +0000] [1] [INFO] Starting gunicorn 19.9.0
flask_1  | [2019-02-23 09:08:17 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
flask_1  | [2019-02-23 09:08:17 +0000] [1] [INFO] Using worker: sync
flask_1  | [2019-02-23 09:08:17 +0000] [8] [INFO] Booting worker with pid: 8
flask_1  | [2019-02-23 09:08:19 +0000] [12] [INFO] Booting worker with pid: 12
flask_1  | [2019-02-23 09:08:19 +0000] [16] [INFO] Booting worker with pid: 16
flask_1  | [2019-02-23 09:08:20 +0000] [20] [INFO] Booting worker with pid: 20
flask_1  | [2019-02-23 09:08:21 +0000] [24] [INFO] Booting worker with pid: 24
flask_1  | [2019-02-23 09:08:22 +0000] [28] [INFO] Booting worker with pid: 28
flask_1  | [2019-02-23 09:08:23 +0000] [32] [INFO] Booting worker with pid: 32
flask_1  | [2019-02-23 09:08:25 +0000] [36] [INFO] Booting worker with pid: 36
flask_1  | [2019-02-23 09:08:26 +0000] [40] [INFO] Booting worker with pid: 40
flask_1  | [2019-02-23 09:08:27 +0000] [44] [INFO] Booting worker with pid: 44
flask_1  | [2019-02-23 09:08:29 +0000] [48] [INFO] Booting worker with pid: 48
flask_1  | [2019-02-23 09:08:30 +0000] [52] [INFO] Booting worker with pid: 52
flask_1  | [2019-02-23 09:08:31 +0000] [56] [INFO] Booting worker with pid: 56
flask_1  | [2019-02-23 09:08:33 +0000] [60] [INFO] Booting worker with pid: 60
flask_1  | [2019-02-23 09:08:34 +0000] [64] [INFO] Booting worker with pid: 64
flask_1  | [2019-02-23 09:08:35 +0000] [68] [INFO] Booting worker with pid: 68
flask_1  | [2019-02-23 09:08:36 +0000] [72] [INFO] Booting worker with pid: 72
flask_1  | [2019-02-23 09:08:37 +0000] [76] [INFO] Booting worker with pid: 76
flask_1  | [2019-02-23 09:08:38 +0000] [80] [INFO] Booting worker with pid: 80
flask_1  | [2019-02-23 09:08:40 +0000] [84] [INFO] Booting worker with pid: 84
flask_1  | [2019-02-23 09:08:41 +0000] [88] [INFO] Booting worker with pid: 88
flask_1  | [2019-02-23 09:08:42 +0000] [92] [INFO] Booting worker with pid: 92
flask_1  | [2019-02-23 09:08:44 +0000] [96] [INFO] Booting worker with pid: 96
flask_1  | [2019-02-23 09:08:45 +0000] [100] [INFO] Booting worker with pid: 100
flask_1  | [2019-02-23 09:08:45 +0000] [104] [INFO] Booting worker with pid: 104
flask_1  | [2019-02-23 09:08:46 +0000] [108] [INFO] Booting worker with pid: 108
flask_1  | [2019-02-23 09:08:47 +0000] [112] [INFO] Booting worker with pid: 112
flask_1  | [2019-02-23 09:08:48 +0000] [116] [INFO] Booting worker with pid: 116
flask_1  | [2019-02-23 09:08:49 +0000] [120] [INFO] Booting worker with pid: 120
flask_1  | [2019-02-23 09:08:50 +0000] [124] [INFO] Booting worker with pid: 124
flask_1  | [2019-02-23 09:08:52 +0000] [128] [INFO] Booting worker with pid: 128

这是docker-compose.yml ：

version: '3'
services:
  flask:
    build: .
    command: gunicorn -b 0.0.0.0:5000 hello:app --reload
    environment:
      - FLASK_APP=hello.py
      - FLASK_DEBUG=1
      - PYTHONUNBUFFERED=True
    ports:
      - "5000:5000"
    volumes:
      - ./:/root

iamtodor 于 2019-02-23

它使用什么泊坞窗图像？

benoitc 于 2019-02-23

@benoitc

[ec2-user@ip-172-31-85-181 web-services-course]$ docker --version
Docker version 18.06.1-ce, build e68fc7a215d7133c34aa18e3b72b4a21fd0c6136
[ec2-user@ip-172-31-85-181 web-services-course]$ docker-compose --version
docker-compose version 1.23.2, build 1110ad01

以下是链接：

Dockerfile https://github.com/glebmikha/web-services-course/blob/master/Dockerfile
docker-compose.yml https://github.com/glebmikha/web-services-course/blob/master/docker-compose.yml

iamtodor 于 2019-02-23

发现可能是内存不足造成的。该应用程序需要比可用内存更多的内存。
但这只是假设

iamtodor 于 2019-02-27

正如信息：当我对 3 个工人进行 gunicorn conf 时，我确实观察到了这种行为，但是我将代码部署在具有单核 CPU 的虚拟机中。然后，我将环境更改为使用2个内核，显然问题消失了

zeandrade 于 2019-04-03

👍6

为什么“工人退出”仅在 INFO 级别 - 为什么工人会因为错误而退出？我花了很长时间才发现我的工作线程被系统 OOM 杀手杀死，除了上面其他一些人报告的不时“使用 pid 引导工作程序”之外，日志中没有任何内容。

HughWarrington 于 2019-04-15

@HughWarrington因为工人退出不一定是错误。工人可以通过信号或--max-requests等选项终止。

tilgovi 于 2019-04-15

@HughWarrington我们可能可以在仲裁器中添加日志记录，以便在工作人员以异常退出代码退出时。

tilgovi 于 2019-04-15

你可以为此打开一张票，或者贡献一个 PR 将这个代码添加到reap_workers方法中。

tilgovi 于 2019-04-15

我遇到了同样的问题，解决方案是增加 pod 的内存大小。

ahmedash95 于 2019-04-18

在使用大型 spaCy 模型的 Docker 上运行 Gunicorn 时遇到了同样的问题，它不断重新启动工作程序而没有任何错误消息。解决办法是为Docker容器增加内存。

chaolyang 于 2019-05-29

👍4

今天刚刚在最新的（19.9.0）gunicorn 上遇到了这个问题，gevent（1.4.0）工作人员在 Kubernetes 上运行。该应用程序是一个Falcon应用程序，Docker 镜像是带有3.7.3标签的官方 Python 镜像。

[2019-07-05 00:07:42 +0000] [8] [INFO] Starting gunicorn 19.9.0
[2019-07-05 00:07:42 +0000] [8] [INFO] Listening at: http://0.0.0.0:5000 (8)
[2019-07-05 00:07:42 +0000] [8] [INFO] Using worker: gevent
[2019-07-05 00:07:43 +0000] [35] [INFO] Booting worker with pid: 35
[2019-07-05 00:07:43 +0000] [36] [INFO] Booting worker with pid: 36
[2019-07-05 00:07:43 +0000] [37] [INFO] Booting worker with pid: 37
[2019-07-05 00:07:43 +0000] [38] [INFO] Booting worker with pid: 38
[2019-07-05 00:07:43 +0000] [41] [INFO] Booting worker with pid: 41
[2019-07-05 00:07:43 +0000] [43] [INFO] Booting worker with pid: 43
[2019-07-05 00:07:43 +0000] [45] [INFO] Booting worker with pid: 45
[2019-07-05 00:07:43 +0000] [49] [INFO] Booting worker with pid: 49
[2019-07-05 00:07:43 +0000] [47] [INFO] Booting worker with pid: 47
[2019-07-05 00:07:49 +0000] [53] [INFO] Booting worker with pid: 53
[2019-07-05 00:07:50 +0000] [54] [INFO] Booting worker with pid: 54
[2019-07-05 00:07:53 +0000] [57] [INFO] Booting worker with pid: 57
[...]

该 Pod 具有以下资源设置：

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

加倍一切解决了这个问题。

alexferl 于 2019-07-05

我们注意到的一件有趣的事情是，当在主机上查看dmesg时，我们可以看到它是segfault在使用 SSL 访问服务器时libcrypto

HewittJC 于 2019-07-08

内存对我来说似乎不是问题，因为我没有在内存中加载任何大模型。工人只是不断崩溃，我看不到任何错误消息。有没有办法解决这个问题？

aleSuglia 于 2019-07-20

👍3

对我来说同样的问题，有什么办法解决吗？ python 3.6.3 与 gunicorn 19.9.0

MrKiven 于 2019-09-16

@MrKiven你的应用程序有什么用？你在使用请求之类的东西吗？

benoitc 于 2019-09-16

有人可以提供一种重现问题的方法吗？

benoitc 于 2019-09-16

它是在管道中执行的多个组件的管理器。其中一些可能会向同一台机器或远程机器上的其他组件发起 HTTP 请求。管道的某些模块可以并行执行，但它们是使用 ThreadPoolExecutor 执行的。它们不使用任何共享对象，但它们只生成稍后聚合为单个结果的数据结构。

不幸的是，我不确定是否可以在不暴露我们拥有的系统的情况下组合一个最小的示例。

aleSuglia 于 2019-09-16

requests 对线程做了很多不安全的事情，这些线程有时会派生一个新进程。我建议使用另一个客户端。你能至少粘贴你用来做请求的行吗？你在使用它的超时功能吗？

benoitc 于 2019-09-16

其中之一可能是：

try:
     resp = requests.post(self._endpoint, json=request_data)

     if resp.status_code != 200:
          logger.critical("[Error]: status code is {}".format(resp.status_code))
          return None

     response = resp.json()
     return {"intent": response["intent"], "intent_ranking": response["intent_ranking"]}
except ConnectionError as exc:
     logger.critical("[Exception] {}".format(str(exc)))
     return None

aleSuglia 于 2019-09-16

谢谢。我将尝试从中创建一个简单的。

如果有人可以向我们发送一个 pr 来重现行为作为示例或单元测试，那么无论如何都会很酷，这样我们就可以确保我们实际上正在修复正确的问题。

benoitc 于 2019-09-17

不确定它是否可以帮助某人，但我在运行 dockerized Flask webapp 时遇到了同样的问题并解决了它将我的 dockerfile 的基本映像更新为python:3.6.9-alpine

主机上的 dmesg 在 lilibpython3.6m.so.1.0 上显示段错误：

[626278.653010] gunicorn[19965]: segfault at 70 ip 00007f6423e7faee sp 00007ffc4e9a2a38 error 4 in libpython3.6m.so.1.0[7f6423d8a000+194000]

我的 docker 映像基于python:3.6-alpine并执行apk update将 python 更新到 3.6.8。

如上所述，将基本图像更改为python:3.6.9-alpine为我解决了

g-bon 于 2019-10-03

我在运行 Flask + Docker + Kubernetes 时遇到了同样的挑战。增加 CPU 和内存限制为我解决了这个问题。

Ogala 于 2019-11-22

👍4

同样的事情发生在我们身上。增加资源限制解决了这个问题。

ramnes 于 2019-12-18

👍2

我在 macOS Catalina（未容器化）上突然发生了这种情况。

对我有帮助的是：

安装openssl：

brew install openssl

运行并将其添加到我的~/.zshrc ：

export DYLD_LIBRARY_PATH=/usr/local/opt/openssl/lib:$DYLD_LIBRARY_PATH

来源： https :

minderov 于 2020-02-08

我遇到了类似的挑战，如果有人能帮助我，我将不胜感激。
这就是我所拥有的；

" root@ubuntu-s-1vcpu-1gb-nyc1-01 :~# sudo systemctl status gunicorn.service ● gunicorn.service - gunicorn daemon Loaded: loaded (/etc/systemd/system/gunicorn.service; disabled; vendor preset:启用）活动：自 2020 年 2 月 24 日星期一 07:48:04 UTC 起处于活动状态（运行）；44 分钟前主 PID：4846（gunicorn）任务：4（限制：1151）CGroup：/system.slice/gunicorn.service ├ ─4846 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - ├─4866 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv /bin/gunicorn - ├─4868 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home/bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - └─4869 /home/bright/djangoprojectdir/djangoprojectenv/bin/python /home /bright/djangoprojectdir/djangoprojectenv/bin/gunicorn - 2 月 24 日 07:48:04 ubuntu-s-1vcpu-1gb-nyc1-01 systemd[1]：停止了 gunicorn 守护进程。2 月 24 日 07:48:04 ubuntu-s-1vcpu -1gb-nyc1-01 systemd[1 ]：启动gunicorn守护进程。 Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] 20.0.4 2月24日开始gunicorn 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] 收听：unix:/run/gunicorn .soc Feb 24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4846] [INFO] 使用工人：同步二月24 07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4866] [INFO] pid 引导工人：4866 2 月 24 日07:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4868] [INFO] 带 pid 的引导工作程序：4868 07 年 2 月 24 日:48:05 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: [2020-02-24 07:48:05 +0000] [4869] [INFO] 带 pid 的引导工作程序：4869 2 月 24 日 08： 03:41 ubuntu-s-1vcpu-1gb-nyc1-01 gunicorn[4846]: - - [24/Feb/2020:08:03:41 +0000] "GET / HTTP/1.0" 400 26 "-" "Mozilla /5.0（Wi 线 1-20/20（结束）”任何人都可以帮我解决这个问题吗？

BrightNana 于 2020-02-24

@BrightNana你能试着给一个dmesg看看你是否有任何枪械错误吗？
dmesg | grep gunicorn可以帮助过滤掉其他错误

g-bon 于 2020-02-24

你好，
当我想提供 gunicorn 作为 systemd 服务时，我在 debian 9 中遇到了相同的错误。如果我从 CLI 启动它，gunicorn 运行时不会出错。

摘自dmesg | grep gunicorn ：

摘自journalctl ：
Mär 12 07:01:06 build-server gunicorn[828]: [2020-03-12 07:01:06 +0100] [1054] [INFO] Booting worker with pid: 1054 Mär 12 07:01:06 build-server gunicorn[828]: [2020-03-12 07:01:06 +0100] [1057] [INFO] Booting worker with pid: 1057 Mär 12 07:01:06 build-server gunicorn[828]: [2020-03-12 07:01:06 +0100] [1060] [INFO] Booting worker with pid: 1060 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1064] [INFO] Booting worker with pid: 1064 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1067] [INFO] Booting worker with pid: 1067 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1070] [INFO] Booting worker with pid: 1070 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1073] [INFO] Booting worker with pid: 1073 Mär 12 07:01:07 build-server gunicorn[828]: [2020-03-12 07:01:07 +0100] [1076] [INFO] Booting worker with pid: 1076 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1079] [INFO] Booting worker with pid: 1079 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1082] [INFO] Booting worker with pid: 1082 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1085] [INFO] Booting worker with pid: 1085 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1088] [INFO] Booting worker with pid: 1088 Mär 12 07:01:08 build-server gunicorn[828]: [2020-03-12 07:01:08 +0100] [1091] [INFO] Booting worker with pid: 1091 Mär 12 07:01:09 build-server gunicorn[828]: [2020-03-12 07:01:09 +0100] [1094] [INFO] Booting worker with pid: 1094
摘自systemctl status ：
● api.service - API Server for BuildingChallenge served with Gunicorn Loaded: loaded (/etc/systemd/system/api.service; disabled; vendor preset: enabled) Active: active (running) since Thu 2020-03-12 08:26:01 CET; 22min ago Main PID: 8150 (gunicorn) Tasks: 3 (limit: 4915) Memory: 37.7M (high: 100.0M max: 500.0M) CGroup: /system.slice/api.service ├─ 8150 /opt/api/venv/bin/python /opt/api/venv/bin/gunicorn --bind unix:api.sock wsgi:app ├─28936 /opt/api/venv/bin/python /opt/api/venv/bin/gunicorn --bind unix:api.sock wsgi:app └─28938 /usr/bin/python3 -Es /usr/bin/lsb_release -a Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28909] [INFO] Booting worker with pid: 28909 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28912] [INFO] Booting worker with pid: 28912 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28915] [INFO] Booting worker with pid: 28915 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28918] [INFO] Booting worker with pid: 28918 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28921] [INFO] Booting worker with pid: 28921 Mär 12 08:48:01 build-server gunicorn[8150]: [2020-03-12 08:48:01 +0100] [28924] [INFO] Booting worker with pid: 28924 Mär 12 08:48:02 build-server gunicorn[8150]: [2020-03-12 08:48:02 +0100] [28927] [INFO] Booting worker with pid: 28927 Mär 12 08:48:02 build-server gunicorn[8150]: [2020-03-12 08:48:02 +0100] [28930] [INFO] Booting worker with pid: 28930 Mär 12 08:48:02 build-server gunicorn[8150]: [2020-03-12 08:48:02 +0100] [28933] [INFO] Booting worker with pid: 28933 Mär 12 08:48:02 build-server gunicorn[8150]: [2020-03-12 08:48:02 +0100] [28936] [INFO] Booting worker with pid: 28936

谢谢你的帮助。

GUlbricht 于 2020-03-12

我提出了一个 PR 可能有助于调试这些类型的情况。有人可以看看吗？
https://github.com/benoitc/gunicorn/pull/2315

tilgovi 于 2020-04-21

我在 Docker 中运行的 Flask 应用程序遇到了同样的问题。工作人员随着进程 ID 的增加而无限重启。

问题与我的内存有关，当我增加 Docker 允许的内存时，工作人员有效地产生了。

Ahmed-Mosharafa 于 2020-06-17

👍1

@tilgovi ，我不介意您是否愿意将我的更改合并到您的 PR 中，因为您是第一次到达那里。这将涵盖通过信号杀死的工人。

mildebrandt 于 2020-09-12

@mildebrandt我去看看，谢谢！

tilgovi 于 2020-09-13

👍1

我也突然看到了这种行为，在 Docker 容器中使用 Gunicorn (20.0.4) + Gevent (1.5.0) + Flask。

[  328.699160] gunicorn[5151]: segfault at 78 ip 00007fc1113c16be sp 00007ffce50452a0 error 4 in _greenlet.cpython-37m-x86_64-linux-gnu.so[7fc11138d000+3e000]

就我而言，正如您所看到的，段错误是由 gevent 引起的。奇怪的是，这个容器在 5 天前运行良好，此后没有任何代码更改更改任何库的任何版本，并且所有这些都设置为特定版本。我确实删除了作为依赖项的flask-mail，这可能会稍微改变其他依赖项的版本。

从 gevent==1.5.0 更新到 gevent==20.9.0 为我解决了这个问题。

ifiddes 于 2020-09-24

👍1

@ifiddes您的问题可能无关。您看到旧版本的 gevent 与最新版本的 greenlet 之间存在 ABI 兼容性问题。见https://github.com/python-greenlet/greenlet/issues/178

jamadden 于 2020-09-24

啊，谢谢@jamadden。这篇文章是我在搜索无限生成引导工人时所能找到的全部内容，但该问题和该问题的发生时间适合我的问题。

ifiddes 于 2020-09-24

我在一台装有Ubuntu 20.04 Server的新AWS机器上遇到了类似的错误，并且使用了相同的生产代码。

该机器与其他生产机器一样使用Ansible进行配置。

[2020-10-15 15:11:49 +0000] [18068] [DEBUG] Current configuration:
  config: None
  bind: ['127.0.0.1:8000']
  backlog: 2048
  workers: 1
  worker_class: uvicorn.workers.UvicornWorker
  threads: 1
  worker_connections: 1000
  max_requests: 0
  max_requests_jitter: 0
  timeout: 30
  graceful_timeout: 30
  keepalive: 2
  limit_request_line: 4094
  limit_request_fields: 100
  limit_request_field_size: 8190
  reload: False
  reload_engine: auto
  reload_extra_files: []
  spew: False
  check_config: False
  preload_app: False
  sendfile: None
  reuse_port: False
  chdir: /var/www/realistico/app
  daemon: False
  raw_env: []
  pidfile: None
  worker_tmp_dir: None
  user: 1001
  group: 1001
  umask: 0
  initgroups: False
  tmp_upload_dir: None
  secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
  forwarded_allow_ips: ['127.0.0.1']
  accesslog: /var/www/realistico/logs/gunicorn/access.log
  disable_redirect_access_to_syslog: False
  access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
  errorlog: /var/www/realistico/logs/gunicorn/error.log
  loglevel: debug
  capture_output: False
  logger_class: gunicorn.glogging.Logger
  logconfig: None
  logconfig_dict: {}
  syslog_addr: udp://localhost:514
  syslog: False
  syslog_prefix: None
  syslog_facility: user
  enable_stdio_inheritance: False
  statsd_host: None
  dogstatsd_tags: 
  statsd_prefix: 
  proc_name: None
  default_proc_name: realistico.asgi:application
  pythonpath: None
  paste: None
  on_starting: <function OnStarting.on_starting at 0x7f7ba5fdd550>
  on_reload: <function OnReload.on_reload at 0x7f7ba5fdd670>
  when_ready: <function WhenReady.when_ready at 0x7f7ba5fdd790>
  pre_fork: <function Prefork.pre_fork at 0x7f7ba5fdd8b0>
  post_fork: <function Postfork.post_fork at 0x7f7ba5fdd9d0>
  post_worker_init: <function PostWorkerInit.post_worker_init at 0x7f7ba5fddaf0>
  worker_int: <function WorkerInt.worker_int at 0x7f7ba5fddc10>
  worker_abort: <function WorkerAbort.worker_abort at 0x7f7ba5fddd30>
  pre_exec: <function PreExec.pre_exec at 0x7f7ba5fdde50>
  pre_request: <function PreRequest.pre_request at 0x7f7ba5fddf70>
  post_request: <function PostRequest.post_request at 0x7f7ba5f6e040>
  child_exit: <function ChildExit.child_exit at 0x7f7ba5f6e160>
  worker_exit: <function WorkerExit.worker_exit at 0x7f7ba5f6e280>
  nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x7f7ba5f6e3a0>
  on_exit: <function OnExit.on_exit at 0x7f7ba5f6e4c0>
  proxy_protocol: False
  proxy_allow_ips: ['127.0.0.1']
  keyfile: None
  certfile: None
  ssl_version: 2
  cert_reqs: 0
  ca_certs: None
  suppress_ragged_eofs: True
  do_handshake_on_connect: False
  ciphers: None
  raw_paste_global_conf: []
  strip_header_spaces: False
[2020-10-15 15:11:49 +0000] [18068] [INFO] Starting gunicorn 20.0.4
[2020-10-15 15:11:49 +0000] [18068] [DEBUG] Arbiter booted
[2020-10-15 15:11:49 +0000] [18068] [INFO] Listening at: unix:/run/gunicorn.sock (18068)
[2020-10-15 15:11:49 +0000] [18068] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2020-10-15 15:11:49 +0000] [18080] [INFO] Booting worker with pid: 18080
[2020-10-15 15:11:49 +0000] [18068] [DEBUG] 1 workers
[2020-10-15 15:11:51 +0000] [18083] [INFO] Booting worker with pid: 18083
[2020-10-15 15:11:53 +0000] [18086] [INFO] Booting worker with pid: 18086
...
[2020-10-15 15:12:09 +0000] [18120] [INFO] Booting worker with pid: 18120
[2020-10-15 15:12:11 +0000] [18123] [INFO] Booting worker with pid: 18123

在尝试解决这个问题但没有成功（并且日志上没有任何错误）失败了很多次之后，我已经尝试过这个 Hello world并且我发现了这个错误：

ModuleNotFoundError: No module named 'httptools'

安装httptools ， Hello world应用程序工作正常，而且出乎意料地，我的应用程序也能正常工作。

我不知道为什么没有记录错误，或者为什么这个库安装在其他机器上而不是新机器上，但这为我解决了问题。

ciotto 于 2020-10-15

如果最近发生这种情况并通过消耗所有 CPU 来关闭它所在的 kubernetes 节点。感谢有关dmesg的提示，我最终确实发现了一个错误：

[225027.348869] traps: python[44796] general protection ip:7f8bd8f8f8b0 sp:7ffc21a0b370 error:0 in libpython3.7m.so.1.0[7f8bd8dca000+2d9000]

最后，我的问题是https://github.com/python-greenlet/greenlet/issues/178 的另一个实例，并通过将 gunicorn、gevent 和 greenlet 更新到最新版本来解决。

由于这些类型的异常不创建 python 日志，无法捕获，返回退出代码 0，并且在发生时可以挂起机器，因此很难管理。

我建议 gunicorn 检测这种性质的快速崩溃循环

要么放弃，要么限制新工人的产生
提供一条有用的信息来引导人们解决这个问题和https://github.com/python-greenlet/greenlet/issues/178

也许max_consecutive_startup_crashes默认为 num_workers * 10 ？

brycedrennan 于 2020-11-10

让我们跟踪 #2504 中的崩溃循环功能请求。我们在 #2315 中还有用于额外登录的 PR。我将关闭这个问题，因为似乎每个人都已经调试了他们的问题，现在我们有一些功能请求和改进来帮助其他人。谢谢大家！

tilgovi 于 2021-02-16

👍1

此页面是否有帮助？

0 / 5 - 0 等级

Gunicorn: 尽管没有退出信号，“启动工人”仍在无限循环

最有用的评论

所有65条评论

相关问题