Go: runtime: mlock of signal stack failed: 12

Created on 25 Feb 2020  ·  128Comments  ·  Source: golang/go

What version of Go are you using (go version)?

$ go version
go version go1.14rc1 linux/amd64

Does this issue reproduce with the latest release?

I hit this with golang:1.14-rc-alpine docker image, the error does not happen in 1.13.

What operating system and processor architecture are you using (go env)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build968395959=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Clone https://github.com/ethereum/go-ethereum, replace the builder version in Dockerfile to golang:1.14-rc-alpine (or use the Dockerfile from below), then from the root build the docker image:

$ docker build .

FROM golang:1.14-rc-alpine

RUN apk add --no-cache make gcc musl-dev linux-headers git

ADD . /go-ethereum
RUN cd /go-ethereum && make geth

What did you expect to see?

Go should run our build scripts successfully.

What did you see instead?

Step 4/9 : RUN cd /go-ethereum && make geth
 ---> Running in 67781151653c
env GO111MODULE=on go run build/ci.go install ./cmd/geth
runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

runtime stack:
runtime.throw(0xa3b461, 0xc)
    /usr/local/go/src/runtime/panic.go:1112 +0x72
runtime.mlockGsignal(0xc0004a8a80)
    /usr/local/go/src/runtime/os_linux_x86.go:72 +0x107
runtime.mpreinit(0xc000401880)
    /usr/local/go/src/runtime/os_linux.go:341 +0x78
runtime.mcommoninit(0xc000401880)
    /usr/local/go/src/runtime/proc.go:630 +0x108
runtime.allocm(0xc000033800, 0xa82400, 0x0)
    /usr/local/go/src/runtime/proc.go:1390 +0x14e
runtime.newm(0xa82400, 0xc000033800)
    /usr/local/go/src/runtime/proc.go:1704 +0x39
runtime.startm(0x0, 0xc000402901)
    /usr/local/go/src/runtime/proc.go:1869 +0x12a
runtime.wakep(...)
    /usr/local/go/src/runtime/proc.go:1953
runtime.resetspinning()
    /usr/local/go/src/runtime/proc.go:2415 +0x93
runtime.schedule()
    /usr/local/go/src/runtime/proc.go:2527 +0x2de
runtime.mstart1()
    /usr/local/go/src/runtime/proc.go:1104 +0x8e
runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1062 +0x6e

...
make: *** [Makefile:16: geth] Error 2
NeedsInvestigation

Most helpful comment

The kernel bug manifested as random memory corruption in Go 1.13 (both with and without preemptive scheduling). What is new in Go 1.14 is that we detect the presence of the bug, attempt to work around it, and prefer to crash early and loudly if that is not possible. You can see the details in the issue I referred you to.

Since you have called me dishonest and nasty, I will remind you again about the code of conduct: https://golang.org/conduct. I am also done participating in this conversation.

All 128 comments

That is the consequence of trying to work around a kernel bug that significantly impacts Go programs. See https://github.com/golang/go/issues/35777. The error message suggests the only two known available fixes: increase the ulimit or upgrade to a newer kernel.

The error message suggests the only two known available fixes: increase the ulimit or upgrade to a newer kernel.

Well, I'm running the official alpine docker image, the purpose of which is to be able to build a Go program. Apparently it cannot. IMHO the upstream image should be the one fixed to fulfill its purpose, not our build infra to hack around a bug in the upstream image.

Is the Alpine image maintained by the Go team? (Genuine question. I don’t know about it.) Either way, yes, the image should be fixed, ideally with a kernel upgrade.

I'm not fully sure who and how maintains the docker images (https://hub.docker.com/_/golang), but the docker hub repo is an "Official Image", which is a super hard to obtain status, so I assume someone high enough the food chain is responsible.

It's "maintained by the Docker Community". Issues should be filed at

https://github.com/docker-library/golang/issues

EDIT: the problem is the host kernel, not the Docker library image, so they can't fix it.

So, the official solution to Go crashing is to point fingers to everyone else to hack around your code? Makes sense.

@karalabe I would like to remind you of https://golang.org/conduct. In particular, please be respectful and be charitable.

Please answer the question

It is standard practice to redirect issues to the correct issue tracking system.

There is an extensive discussion of possible workarounds and fixes in the issue I linked to earlier, if you would like to see what options were considered on the Go side.

This issue does not happen with Go 1.13. Ergo, it is a bug introduced in Go 1.14.

Saying you can't fix it and telling people to use workarounds it is dishonest, because reverting a piece of code would actually fix it. An alternative solution would be to detect the problematic platforms / kernels and provide a fallback mechanism baked into Go.

Telling people to use a different kernel is especially nasty, because it's not as if most people can go around and build themselves a new kernel. If alpine doesn't release a new kernel, there's not much most devs can do. And lastly if your project relies on a stable infrastructure where you can't just swap out kernels, you're again in a pickle.

It is standard practice to redirect issues to the correct issue tracking system.

The fact that Go crashes is not the fault of docker. Redirecting a Go crash to a docker repo is deflection.

You could also disable preemptive scheduling at runtime

$ GODEBUG=asyncpreemptoff=1 ./your_app

@ianlancetaylor we have a suggestion to do this when running on an affected kernel; is that viable?

BTW, It's a known problem that Docker library modules don't get timely updates, which is a security liability. Caveat emptor.

The kernel bug manifested as random memory corruption in Go 1.13 (both with and without preemptive scheduling). What is new in Go 1.14 is that we detect the presence of the bug, attempt to work around it, and prefer to crash early and loudly if that is not possible. You can see the details in the issue I referred you to.

Since you have called me dishonest and nasty, I will remind you again about the code of conduct: https://golang.org/conduct. I am also done participating in this conversation.

@karalabe, I misspoke, the issue is your host kernel, not the Docker image. Are you unable to update it?

I'm on latest Ubuntu and latest available kernel. Apparently all available Ubuntu kernels are unsuitable for Go 1.14 https://packages.ubuntu.com/search?keywords=linux-image-generic based on the error message.

Can you add the output of $ uname -a to the main issue text? And maybe remove the goroutine stack traces?

I've posted a note to golang-dev.

cc @aclements

When you say you are on the latest ubuntu and kernel what exactly do you mean (i.e. output of dpkg -l linux-image-*, lsb_release -a, uname -a, that sort of thing) because as far as I can see the fix is in the kernel in the updates pocket for both 19.10 (current stable release) and 20.04 (devel release). It's not in the GA kernel for 18.04 but is in the HWE kernel, but otoh those aren't built with gcc 9 and so shouldn't be affected anyway.

@networkimprov Disabling signal preemption makes the bug less likely to occur but it is still present. It's a bug in certain Linux kernel versions. The bug affects all programs in all languages. It's particularly likely to be observable with Go programs that use signal preemption, but it's present for all other programs as well.

Go tries to work around the bug by mlocking the signal stack. That works fine unless you run into the mlock limit. I suppose that one downside of this workaround is that we make the problem very visible, rather than occasionally failing due to random memory corruption as would happen if we didn't do the mlock.

At some point there is no way to work around a kernel bug.

@karalabe

I'm on latest Ubuntu and latest available kernel

$ docker pull -q ubuntu:latest
docker.io/library/ubuntu:latest
$ docker run --rm -i -t ubuntu
root@e2689d364a25:/# uname -a
Linux e2689d364a25 5.4.8-050408-generic #202001041436 SMP Sat Jan 4 19:40:55 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

which does satisfy the minimum version requirements.

Similarly:

$ docker pull -q golang:1.14-alpine
docker.io/library/golang:1.14-alpine
$ docker run --rm -i -t golang:1.14-alpine
/go # uname -a
Linux d4a35392c5b8 5.4.8-050408-generic #202001041436 SMP Sat Jan 4 19:40:55 UTC 2020 x86_64 Linux

Can you clarify what you're seeing?

@mwhudson

$ dpkg -l linux-image-*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                   Version      Architecture Description
+++-======================================-============-============-===============================================================
rc  linux-image-4.13.0-16-generic          4.13.0-16.19 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.13.0-19-generic          4.13.0-19.22 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.13.0-21-generic          4.13.0-21.24 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.13.0-25-generic          4.13.0-25.29 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.13.0-36-generic          4.13.0-36.40 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.13.0-37-generic          4.13.0-37.42 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.13.0-38-generic          4.13.0-38.43 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.13.0-41-generic          4.13.0-41.46 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.13.0-45-generic          4.13.0-45.50 amd64        Linux kernel image for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-4.15.0-23-generic          4.15.0-23.25 amd64        Signed kernel image generic
rc  linux-image-4.15.0-30-generic          4.15.0-30.32 amd64        Signed kernel image generic
rc  linux-image-4.15.0-32-generic          4.15.0-32.35 amd64        Signed kernel image generic
rc  linux-image-4.15.0-34-generic          4.15.0-34.37 amd64        Signed kernel image generic
rc  linux-image-4.15.0-36-generic          4.15.0-36.39 amd64        Signed kernel image generic
rc  linux-image-4.15.0-39-generic          4.15.0-39.42 amd64        Signed kernel image generic
rc  linux-image-4.15.0-42-generic          4.15.0-42.45 amd64        Signed kernel image generic
rc  linux-image-4.15.0-43-generic          4.15.0-43.46 amd64        Signed kernel image generic
rc  linux-image-4.15.0-45-generic          4.15.0-45.48 amd64        Signed kernel image generic
rc  linux-image-4.15.0-47-generic          4.15.0-47.50 amd64        Signed kernel image generic
rc  linux-image-4.18.0-17-generic          4.18.0-17.18 amd64        Signed kernel image generic
rc  linux-image-5.0.0-13-generic           5.0.0-13.14  amd64        Signed kernel image generic
rc  linux-image-5.0.0-15-generic           5.0.0-15.16  amd64        Signed kernel image generic
rc  linux-image-5.0.0-16-generic           5.0.0-16.17  amd64        Signed kernel image generic
rc  linux-image-5.0.0-17-generic           5.0.0-17.18  amd64        Signed kernel image generic
rc  linux-image-5.0.0-19-generic           5.0.0-19.20  amd64        Signed kernel image generic
rc  linux-image-5.0.0-20-generic           5.0.0-20.21  amd64        Signed kernel image generic
rc  linux-image-5.0.0-21-generic           5.0.0-21.22  amd64        Signed kernel image generic
rc  linux-image-5.0.0-25-generic           5.0.0-25.26  amd64        Signed kernel image generic
rc  linux-image-5.0.0-27-generic           5.0.0-27.28  amd64        Signed kernel image generic
rc  linux-image-5.0.0-29-generic           5.0.0-29.31  amd64        Signed kernel image generic
rc  linux-image-5.0.0-32-generic           5.0.0-32.34  amd64        Signed kernel image generic
rc  linux-image-5.3.0-19-generic           5.3.0-19.20  amd64        Signed kernel image generic
rc  linux-image-5.3.0-22-generic           5.3.0-22.24  amd64        Signed kernel image generic
rc  linux-image-5.3.0-23-generic           5.3.0-23.25  amd64        Signed kernel image generic
rc  linux-image-5.3.0-24-generic           5.3.0-24.26  amd64        Signed kernel image generic
rc  linux-image-5.3.0-26-generic           5.3.0-26.28  amd64        Signed kernel image generic
ii  linux-image-5.3.0-29-generic           5.3.0-29.31  amd64        Signed kernel image generic
ii  linux-image-5.3.0-40-generic           5.3.0-40.32  amd64        Signed kernel image generic
rc  linux-image-extra-4.13.0-16-generic    4.13.0-16.19 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-extra-4.13.0-19-generic    4.13.0-19.22 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-extra-4.13.0-21-generic    4.13.0-21.24 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-extra-4.13.0-25-generic    4.13.0-25.29 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-extra-4.13.0-36-generic    4.13.0-36.40 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-extra-4.13.0-37-generic    4.13.0-37.42 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-extra-4.13.0-38-generic    4.13.0-38.43 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-extra-4.13.0-41-generic    4.13.0-41.46 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
rc  linux-image-extra-4.13.0-45-generic    4.13.0-45.50 amd64        Linux kernel extra modules for version 4.13.0 on 64 bit x86 SMP
ii  linux-image-generic                    5.3.0.40.34  amd64        Generic Linux kernel image

$ lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 19.10
Release:    19.10
Codename:   eoan
$ uname -a

Linux roaming-parsley 5.3.0-40-generic #32-Ubuntu SMP Fri Jan 31 20:24:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ sudo apt-get dist-upgrade 

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

@myitcv

FROM golang:1.14-alpine
RUN  apk add --no-cache make gcc musl-dev linux-headers git wget

RUN \
  wget -O geth.tgz "https://github.com/ethereum/go-ethereum/archive/v1.9.11.tar.gz" && \
  mkdir /go-ethereum && tar -C /go-ethereum -xzf geth.tgz --strip-components=1 && \
  cd /go-ethereum && make geth
$ docker build .

Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM golang:1.14-alpine
1.14-alpine: Pulling from library/golang
c9b1b535fdd9: Already exists 
cbb0d8da1b30: Already exists 
d909eff28200: Already exists 
8b9d9d6824f5: Pull complete 
a50ef8b76e53: Pull complete 
Digest: sha256:544b5e7984e7b2e7a2a9b967bbab6264cf91a3b3816600379f5dc6fbc09466cc
Status: Downloaded newer image for golang:1.14-alpine
 ---> 51e47ee4db58

Step 2/3 : RUN  apk add --no-cache make gcc musl-dev linux-headers git wget
 ---> Running in 879f98ddb4ff
[...]
OK: 135 MiB in 34 packages
Removing intermediate container 879f98ddb4ff
 ---> 9132e4dae4c3

Step 3/3 : RUN   wget -O geth.tgz "https://github.com/ethereum/go-ethereum/archive/v1.9.11.tar.gz" &&   mkdir /go-ethereum && tar -C /go-ethereum -xzf geth.tgz --strip-components=1 &&   cd /go-ethereum && make geth
 ---> Running in a24c806c60d3
2020-02-26 07:18:54--  https://github.com/ethereum/go-ethereum/archive/v1.9.11.tar.gz
[...]
2020-02-26 07:18:58 (2.48 MB/s) - 'geth.tgz' saved [8698235]

env GO111MODULE=on go run build/ci.go install ./cmd/geth
runtime: mlock of signal stack failed: 12
runtime: increase the mlock limit (ulimit -l) or
runtime: update your kernel to 5.3.15+, 5.4.2+, or 5.5+
fatal error: mlock failed

Sorry, my previous comment was misleading. Because of course the kernel version returned by uname -a within the Docker container will be that of the host.

Hence per:

$ uname -a

Linux roaming-parsley 5.3.0-40-generic #32-Ubuntu SMP Fri Jan 31 20:24:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

you need to upgrade the host OS kernel.

FWIW, the steps you lay out above using Alpine to make geth work for me.:

...
Done building.
Run "./build/bin/geth" to launch geth.

Yes, but in my previous posts I highlighted that I'm already on the latest Ubuntu and have installed the latest available kernel from the package repository. I don't see how I could update my kernel to work with Go 1.14 apart from rebuilding the entire kernel from source. Maybe I'm missing something?

Just to emphasize, I do understand what the workaround is and if I want to make it work, I can. I opened this issue report because I'd expect other people to hit the same problem eventually. If just updating my system would fix the issue I'd gladly accept that as a solution, but unless I'm missing something, the fixed kernel is not available for (recent) Ubuntu users, so quite a large userbase might be affected.

Yes, but in my previous posts I highlighted that I'm already on the latest Ubuntu and have installed the latest available kernel from the package repository. I don't see how I could update my kernel to work with Go 1.14 apart from rebuilding the entire kernel from source. Maybe I'm missing something?

Hm yes, I have just reproduced on focal too. The fix is present in the git for the Ubuntu eoan kernel: https://kernel.ubuntu.com/git/ubuntu/ubuntu-eoan.git/commit/?id=59e7e6398a9d6d91cd01bc364f9491dc1bf2a426 and that commit is in the ancestry for the 5.3.0-40.32 so the fix should be in the kernel you are using. In other words, I think we need to get the kernel team involved -- I'll try to do that.

@karalabe - I've just realised my mistake: I though I was using the latest Ubuntu, I am in fact using eoan.

@mwhudson - just one thing to note (although you're probably already aware of this), a superficial glance at the code responsible for this switch:

https://github.com/golang/go/blob/20a838ab94178c55bc4dc23ddc332fce8545a493/src/runtime/os_linux_x86.go#L56-L61

seems to suggest that the Go side is checking for patch release 15 or greater. What does 5.3.0-40.32 report as a patch version? I'm guessing 0?

Re-opening this discussion until we round out the issue here.

A little summary because I had to piece it together myself:

So it seems like Ubuntu's kernel is patched, but the workaround gets enabled anyways.

So it seems like Ubuntu's kernel is patched, but the workaround gets enabled anyways.

Oh right, yes I should actually read the failure shouldn't I? This is the workaround failing rather than the original bug, in a case where the workaround isn't actually needed but there's no good way for Go to know this. I can patch the check out of the Go 1.14 package in Ubuntu but that doesn't help users running e.g. the docker golang:1.14-alpine image. Hrm.

I guess the question is, how many users are using "vulnerable" kernels at this point. There can't be all that many distributions that are compiling an unpatched kernel with gcc 9 by now.

There can't be all that many...

Famous last words? :D

In all seriousness though, I don't think people update kernels frequently, and many systems can't. Perhaps a better question to ask would be what systems/distros use vulnerable kernels by default and just assume that there will be many people stuck on them.

I can patch the check out of the Go 1.14 package in Ubuntu but that doesn't help users running e.g. the docker golang:1.14-alpine image. Hrm.

You would also miss users that build from source. E.g. Ethereum does source builds for our PPAs because there's no recent Go bundle for all the distros on Launchpad.

Is it common for Ubuntu (and other distributions?) to use cherry-picking instead of following Linux kernel patch releases? On https://packages.ubuntu.com/search?keywords=linux-image-generic all kernels have a patch release of zero.

As far as a quick Google search goes, Ubuntu does not release new kernels after they ship the distro, rather just cherry pick security fixes (i.e. no patch version bump). Exception to this are LTS versions (supported for 5 years), which may get kernel updates every few years to support new hardware (but that's also super rare). Don't know about other distros.

With my limited knowledge right now, this seems weird. Patch releases are intended to distribute patches in a controlled way so people (and in this case the Go runtime) can know which patches are included and which are not. However, this is the way it is, so we have to live with it.

Open questions:

  • How many people are affected by this issue?
  • Is ulimit a viable workaround?
  • Is Ubuntu the only distribution that "disregards" Linux kernel patch numbers?
  • Depending on the answers to the questions above: Would it be reasonable to add a special detection for Ubuntu?

@neelance

Is it common for Ubuntu (and other distributions?) to use cherry-picking instead of following Linux kernel patch releases?

A lot of distributions do it, not just Ubuntu. Debian does it, Red Hat Enterprise Linux does it, and I expect that SUSE does it for their enterprise distributions as well. Cherry-picking is the only way to get any bug-fixes at all if you cannot aggressively follow upstream stable releases (and switching stable releases as upstream support goes away). Fedora is an exception; it rebases to the latest stable release upstream kernel after a bit.

There's also the matter of proprietary kernels used by container engines. We can't even look at sources for them, and some of them have lied about kernel version numbers in the past. I expect they also use cherry-picking.

Generally, version checks for kernel features (or bugs) are a really bad idea. It's worse for Go due to the static linking, so it's impossible to swap out the run-time underneath an application to fix its kernel expectations.

Is it common for Ubuntu (and other distributions?) to use cherry-picking instead of following Linux kernel patch releases? On https://packages.ubuntu.com/search?keywords=linux-image-generic all kernels have a patch release of zero.

The base kernel version string does not change, that's true. But it doesn't mean that upstream stable releases are not merged, the ABI number is bumped instead when there are code changes.

Note that you picked the meta-package which doesn't show the proper changelog, here you can see that the latest version 5.4.0-14.17 has merged 5.4.18 stable release:
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux-5.4/linux-5.4_5.4.0-14.17/changelog

It sounds like proper automatic detection across all distributions is nearly impossible. I see three options:

  • Do nothing.
  • Make the workaround opt-in.
  • Make the workaround opt-out.

Or disable async preemption by default on 5.3.x & 5.4.x, and let users enable it at runtime.

https://github.com/golang/go/issues/37436#issuecomment-591237929 states that disabling async preemption is not a proper solution, is it?

Not strictly speaking, but there weren't reports of the problem before async preemption landed.

Actually, could the runtime fork a child process on startup (for 5.3.x & 5.4.x) that triggers the bug and enable the workaround if it does? IIRC there is a reliable reproducer, see https://github.com/golang/go/issues/35326#issuecomment-558690446

Disabling async preemption is a distraction. Programs running on faulty kernels are still broken. It's just that the brokenness shows up as weird memory corruption rather than as an error about running into an mlock limit that points to kernel versions. While obviously we want to fix the problem entirely, I think that given the choice of a clear error or random memory corruption we should always pick the clear error.

I agree that kernel version detection is terrible, it's just that we don't know of any other option. If anybody has any suggestions in that regard, that would be very helpful.

One thing that we could do is add a GODEBUG setting to disable mlocking the signal stack. That would give people a workaround that is focused on the actual problem. We can mention that setting in the error message. I'm afraid that it will lead to people to turn on the setting whether they have a patched kernel or not. But at least it will give people who really do have a patched kernel a way to work around this problem. CC @aclements

Actually, could the runtime fork a child process on startup (for 5.3.x & 5.4.x) that triggers the bug and enable the workaround if it does? IIRC there is a reliable reproducer, see #35326 (comment)

It's an interesting idea but I think that in this case the test is much too expensive to run at startup for every single Go program.

I may have missed something (this thread got long fast!), but what's the downside or difficulty of just raising the mlock limit? There's little reason to not just set it to unlimited, but even if you don't want to do that, you only need 4 KiB per thread, so a mere 64 MiB is literally more than the runtime of a single process is capable of mlocking. AFAIK, most distros leave it unlimited by default. The only notable exception I'm aware of is Docker, which sets it to (I think) 64 KiB by default, but this can be raised by passing --ulimit memlock=67108864 to Docker.

It seems like we already have a fairly simple workaround in place. Is there something preventing people from doing this?

The point of the issue is that you shouldn't have to apply a manual workaround, if at all possible. It looks like a regression in 1.14.

It can't be fixed on the Docker library side: https://github.com/docker-library/golang/issues/320

The issue with the ulimit workaround is that it's a time bomb. Currently no automated process needs raising the ulimit with Go 1.13. Anyone for whom Go 1.14 fails immediately after update will notice and fix it.

The more interesting scenario is what happens if someone uses an old version of a kernel that isn't affected, and at some point switches over the kernel to a new one that is. All of a sudden things will start breaking, but since Go is app layer and the kernel is OS layer, it will take time to make the connection and figure out a fix. The question is what will be the cost.


What's not immediately clear though is whether just the Go compiler or all Go built apps have issues too? If just the compiler, that's a lucky case and the fallout can be contained. However if all Go apps built with 1.14 have a tendency to panic, this could really hurt pre-built binary portability, because all of a sudden my code could fail to work on a different system (that's actually completely valid, just uses a different kernel versioning scheme).

What's not immediately clear though is whether just the Go compiler or all Go built apps have issues too?

The Linux kernel bug is not even Go-specific. If I understand correctly, it affects any program — even one written in C! — that uses the XMM or YMM registers and may receive signals. Go programs under 1.14 happen to be _more_ severely affected than many other programs because they use signals internally for goroutine preemption, and that is why the Go 1.14 runtime includes the mlock workaround.

The Linux kernel bug is not even Go-specific.

Yeah, but my kernel is patched, but Go still panics on it :)

@aclements

The only notable exception I'm aware of is Docker, which sets it to (I think) 64 KiB by default, but this can be raised by passing --ulimit memlock=67108864 to Docker.

It seems like we already have a fairly simple workaround in place. Is there something preventing people from doing this?

Unfortunately, yes. In our case here, we can't tell our customers to reconfigure their docker containers. There are too many and they are sensitive to environment changes in their setups; indeed, that's why we chose to deliver the application in a docker container, so the façade of an isolation tool would ease the worry about changing the configuration options. Changing the container contents is fine, but how we invoke the container may not be as fine.

@aclements

There's little reason to not just set it to unlimited, but even if you don't want to do that, you only need 4 KiB per thread, so a mere 64 MiB is literally more than the runtime of a single process is capable of mlocking. AFAIK, most distros leave it unlimited by default.

That does not seem to be case - this is a Ubuntu LTS with just 6KiB in locking space:

~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3795
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3795
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.6 LTS
Release:    16.04
Codename:   xenial
~$ uname -a
Linux bastion0 4.4.0-174-generic #204-Ubuntu SMP Wed Jan 29 06:41:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

What are the contents of /proc/version for an example kernel that is patched to work correctly but for which Go is currently producing the mlock error? Thanks.

@ucirello Are you seeing any problem running Go programs on the system that you describe?

$ cat /proc/version
Linux version 5.3.0-40-generic (buildd@lcy01-amd64-026) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #32-Ubuntu SMP Fri Jan 31 20:24:34 UTC 2020

Perhaps we could use the date recorded in /proc/version as an additional signal. It should probably be release specific, which is a pain. But the whole thing is painful.

The discussion seems to be about either accepting more false positives or false negatives. Here's a summary:

False positive: The workaround gets enabled on a patched kernel.

  • Reproducible. Instructions can be shown.
  • Looks like a regression.
  • Hard to fix in certain environments.
  • Go binary may run in some environments but fails to run in others.

False negative: The workaround is not enabled on an unpatched kernel.

  • Failure only happens rarely, especially if async preemption is disabled.
  • Possibly severe consequences due to memory corruption.
  • Hard to debug.

@ianlancetaylor I don't see any problems in the system described, but from what I gather this could be a coincidence, namely the error message says I should upgrade past 5.3.15, 5.4.2 or 5.5 and this is a 4.4.x box.

In any case, what I wanted to highlight is that the assumption that most distro deliver ulimit -lwith 64M seems to be incorrect. This is a Linux box whose ulimit -l is 64K - and it is a standard installation.

Is there a way to detect the ulimit setting and avoid the crash if the workaround can not be applied? This way we could still improve the situation but not cause new crashes? We could then instead print a warning about potential memory corruption.

Is there a way to detect the ulimit setting and avoid the crash if the workaround can not be applied?

In effect, the current code is already detecting the ulimit setting. It doesn't query it directly, but it does detect when it's run up against it, which is why it's printing that particular error message asking the user to raise the ulimit. If something else goes wrong mlocking, it will actually print a different message. The crash is intentional because we considered that preferable to random memory corruption.

In any case, what I wanted to highlight is that the assumption that most distro deliver ulimit -lwith 64M seems to be incorrect. This is a Linux box whose ulimit -l is 64K - and it is a standard installation.

@ucirello, thanks for that data point. I don't have an Ubuntu box handy, but I did the original debugging on an Ubuntu install and could have sworn it didn't have a ulimit set. This is definitely not in a container?

@ianlancetaylor I've created quick and dirty script to check what uname reports: https://gist.github.com/Tasssadar/7424860a2764e3ef42c7dcce7ecfd341

Here's the result on up-to-date (well, -ish) Debian testing:

tassadar@dorea:~/tmp$ go run gouname.go 
real uname
Linux dorea 5.4.0-3-amd64 #1 SMP Debian 5.4.13-1 (2020-01-19) x86_64 GNU/Linux

our uname
sysname Linux
nodename dorea
release 5.4.0-3-amd64  <-- used by go
version #1 SMP Debian 5.4.13-1 (2020-01-19)
machine x86_64
domainname (none)

Since Go is only using the release string, the patch version check basically does not work anywhere but on vanilla kernels - both Debian and RHEL/CentOS (which has too old kernel luckily) do it this way, they keep the .0 and specify the real patch version later. Unfortunately, they don't use the same format for version.

EDIT: and to make it even more awkward, Ubuntu does not put the patch number into uname at all, even though they probably have all the fixes incorporated. Perhaps the best course of action is to make this a warning instead of crash? At this point, most kernels are probably already updated anyway.

The crash is intentional because we considered that preferable to random memory corruption.

@aclements There is a small flaw in this argument: The problem are not the true positives that run into the ulimit, but the false positives that would not have memory corruption without the workaround.

@aclements

This is definitely not in a container?

I am 100% sure it is not a container. It is a VM though.

$ sudo virt-what
xen
xen-hvm

@ucirello, thanks for that data point. I don't have an Ubuntu box handy, but I did the original debugging on an Ubuntu install and could have sworn it didn't have a ulimit set. This is definitely not in a container?

As a different data point, my Ubuntu Eoan main system as 64MB set by default for ulimit -l.

@ucirello There is no problem on 4.4.x Linux kernels. The bug first appeared in kernel version 5.2.

@ianlancetaylor -- thanks for the information. I guess I misparsed the error message then. I didn't know of this fact, and the error message made believe that the new binaries would be incompatible with older kernels.

Go programs will only report the error message on kernels with a version number that indicates that they may have the problem.

Here is something we could do:

  1. Use uname to check the kernel version for a vulnerable kernel, as we do today.
  2. If the kernel is vulnerable according to the version, read /proc/version.
  3. If /proc/version contains the string "2020", assume that the kernel is patched.
  4. If /proc/version contains the string "gcc version 8" assume that the kernel works even if patched (as the bug only occurs when the kernel is compiled with GCC 9 or later).
  5. Otherwise, call mlock on signal stacks as we do today on vulnerable kernels.

The point of this is to reduce the number of times that Go programs run out of mlock space.

Does anybody know of any unpatched kernels that may have the string "2020" in /proc/version?

For safety we should probably try to identify the times when the kernel was patched for the major distros. Is there anybody who can identify that for any particular distro? Thanks.

I am not sure if this is at all helpful, but Ubuntu apparently does make the standard kernel version available to those that go looking:

$ cat /proc/version_signature
Ubuntu 5.3.0-1013.14-azure 5.3.18

@jrockway Thanks, the problem is not that we don't have the kernel version, it's that Ubuntu is using a kernel version that has the bug, but Ubuntu has applied a patch for the bug, so the kernel actually works, but we don't know how to detect that fact.

Adding to @ianlancetaylor's string matching heuristic, you could also check /proc/version for
5.3.x && x >= 15 || 5.4.x && x >= 2 (not real code, but you get the idea)

Thanks, we already check that in the uname results.

I'm referring to https://github.com/golang/go/issues/37436#issuecomment-591503305:

release 5.4.0-3-amd64  <-- used by go
version #1 SMP Debian 5.4.13-1 (2020-01-19) <-- has actual version

EDIT: On Ubuntu, you can check /proc/version_signature as I suggested.

Ah, sorry, missed that. So according to that comment on some systems the release field and the version field returned by uname differ at to the kernel version that they report. We currently check the release field but not the version field.

The /proc/version file has a different set of information again.

My Ubuntu system has /proc/version but does not have /proc/version_signature.

@bbarenblat helpfully collected uname and /proc/version from several versions of several distros to see what they look like in the wild (I added a few of my own):

Debian unstable:

$ uname -v
#1 SMP Debian 5.3.9-3 (2019-11-19)
$ uname -r
5.3.0-2-amd64
$ cat /proc/version
Linux version 5.3.0-2-amd64 ([email protected]) (gcc version 9.2.1 20191109 (Debian 9.2.1-19)) #1 SMP Debian 5.3.9-3 (2019-11-19)

Debian 10.3 (current stable):

$ uname -v
#1 SMP Debian 4.19.98-1 (2020-01-26)
$ uname -r
4.19.0-8-amd64
# cat /proc/version
Linux version 4.19.0-8-amd64 ([email protected]) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.98-1 (2020-01-26)

Debian 7 (very old):

$ uname -v
#1 SMP Debian 3.2.78-1
$ uname -r
3.2.0-4-amd64
$ cat /proc/version
Linux version 3.2.0-4-amd64 ([email protected]) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.78-1

Debian “somewhere past stable, with the occasional unstable package”:

$ uname -v
#1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20)
$ uname -r
4.19.0-6-amd64
$ cat /proc/version
Linux version 4.19.0-6-amd64 ([email protected]) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20)

Ubuntu 19.10:

$ uname -v
#32-Ubuntu SMP Fri Jan 31 20:24:34 UTC 2020
$ uname -r
5.3.0-40-generic
$ cat /proc/version  
Linux version 5.3.0-40-generic (buildd@lcy01-amd64-026) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #32-Ubuntu SMP Fri Jan 31 20:24:34 UTC 2020

Ubuntu 19.10 with GCP kernel (this really is 5.3.0):

$ uname -v
#9-Ubuntu SMP Mon Nov 11 09:52:23 UTC 2019
$ uname -r
5.3.0-1008-gcp
$ cat /proc/version
Linux version 5.3.0-1008-gcp (buildd@lgw01-amd64-038) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #9-Ubuntu SMP Mon Nov 11 09:52:23 UTC 2019

Ubuntu 19.10 with hand-built kernel:

$ uname -v
#36 SMP Wed Feb 26 20:55:52 UTC 2020
$ uname -r
5.3.10
$ cat /proc/version
Linux version 5.3.10 (austin@austin-dev-ubuntu) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #36 SMP Wed Feb 26 20:55:52 UTC 2020

CentOS 7.7:

$ uname -v
#1 SMP Fri Dec 6 15:49:49 UTC 2019
$ uname -r
3.10.0-1062.9.1.el7.x86_64
$ cat /proc/version
Linux version 3.10.0-1062.9.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019

@aclements I am a bit confused about which of the kernels you posted need the workaround and which are already patched. Maybe you could annotate your post.

The problem is that if you are inside the potentially bad version ranges you _might be impacted_, if you are outside of the version range you are safe.
When the initial corruption was found we immediately pulled the patch, tested it, and rolled it out. So our kernel was 100% within the bad range but patched. Being outside the range means you are safe, but inside the range you can't prove that the bug exists without testing for it. (In other words: false positives will happen by design)

I am right now building another patched ubuntu kernel with upstream version numbers

  • make kernelversion
  • patch debian{,.master}/{changelog,control} with this version

Software that relies on the kernel version is the reason why distributions like to keep it stable: if an update to the version number might break software that worked before then it's better not to change the version (but still apply the patches!).

I am not happy to roll a new kernel just because of golang 1.14, but I think it will work.
I would discourage everyone to ship golang 1.14 binaries outside of controlled environments.

@rtreffer's post again sounds like false positives are unavoidable. Currently I still vote for the solution of limiting the negative impact of a false positive.

I think it should be opt-in to avoid all false positives. For example add GOWORKAROUNDS env var boolean or with a list of work-arounds or to enable the heuristic to try to find them.

This would be the least intrusive solution IMO.

@fcuello-fudo the problem is that if the workaround is not enabled on a bad kernel, the symptoms are very obscure.

How about re-using the "tainted" concept from the Linux kernel? The Go runtime would keep detecting bad kernels and apply the mlock workaround, but marks itself tainted if mlock fails (and not crash). Then, make sure add a note to any panics and throws messages if the taint flag is set.

The upside is that false positive crashes are avoided, while still providing a clear indication in case a bad kernel causes a crash.

The downside is that a bad kernel may silently corrupt memory, not causing a visible crash.

@fcuello-fudo the problem is that if the workaround is not enabled on a bad kernel, the symptoms are very obscure.

And leaving it enabled by default but it would sill be useful to disable workarounds if the user wants?

@howardjohn (primary dev of Google's Istio) on https://github.com/istio/istio/issues/21672:

My kernel on my personal machine is not patched. But it doesn't really matter what my machine has, we ship Istio to 1000s of users. I cannot control what machine they run it on, and I would love to not have to restrict that to some subset of kernel versions

Prometheus project is also holding back its go1.14 upgrade: https://github.com/prometheus/golang-builder/pull/85#issuecomment-592082645

I encountered this issue when running go applications on a self-hosted Kubernetes cluster. I was able to allow the _mlock_ workaround to take effect by increasing the relevant ulimit. However, as the process for changing ulimits for Docker containers running in Kubernetes isn't exactly easy to find, it might help someone else to put the details here.

  1. Update /etc/security/limits.conf to include something like
* - memlock unlimited
  1. Update /etc/docker/daemon.json to include
"default-ulimits": { "memlock": { "name": "memlock", "hard": -1, "soft": -1 } }
  1. Restart docker/kubernetes, bring your pods back up.

  2. Enter a running container and verify that the ulimit has been increased:

$ kubectl exec -it pod-name -- /bin/sh
/ # ulimit -l
unlimited

You may be able to get away with using something more subtle than the _unlimited_ hammer; for me, a limit of 128KiB (131072) seemed to work.

also running into this issue trying to build a docker image for https://github.com/RTradeLtd/Temporal on go 1.14

And the pile keeps on piling. See, this is why I got angry at the beginning of this thread (which was a bad mistake from my part, I agree). Even though I made a lot of effort to explain and provide a repro that this is a blocker, I was shut down to not interfere with the release. Even after it was clear that it's not a docker issue.

Now we're in a much worse space since various projects are blacklisting Go 1.14. This bug is currently slated to be fixed in Go 1.15 only. Based on the above linked issues, are we confident that it's a good idea to postpone this by 8 months? I think it would be nice to acknowledge the messup and try to fix it in a patch release, not wait for more projects to be bitten.

Yes, I'm aware that I'm just nagging people here instead of fixing it myself. I'm sorry I can't contribute more meaningfully, I just don't want to fragment the ecosystem. Go modules were already a blow to many projects, let's not double down with yet another quirk that tools need to become aware of.

Ian Lance-Taylor has said that the fix will be backported once there is one: https://groups.google.com/d/msg/golang-dev/_FbRwBmfHOg/mmtMSjO1AQAJ

@lmb Oh that's very nice to hear, thank you for the link.

FYI: Ubuntu 20.04 will likely ship 5.4.0 meaning from end of next month the
most recent Ubuntu LTS and golang won't work together out of the box.

Lorenz Bauer notifications@github.com schrieb am Fr., 6. März 2020, 12:42:

Ian Lance-Taylor has said that the fix will be backported once there is
one: https://groups.google.com/d/msg/golang-dev/_FbRwBmfHOg/mmtMSjO1AQAJ


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/37436?email_source=notifications&email_token=AAAO66R6S5VOWQBKETLI5UDRGDORJA5CNFSM4K3GBRRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOBCIDI#issuecomment-595731469,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAAO66SH6J4M75WDJ5GO7HTRGDORJANCNFSM4K3GBRRA
.

@karalabe in as strong terms as I can muster, you were not "shut down to not interfere with the release". Josh, who originally closed the issue, is not on the Go team at Google (i.e. not a decision maker) nor am I. We initially assumed that the Docker project could (and should) mitigate the problem in their build. When it became clear they couldn't, I promptly raised this on golang-dev.

Furthermore, I was the first to note that the problem stems from your host kernel, not the Docker module. You didn't mention you're on Ubuntu until I pointed that out.

I think you owe us yet another apology, after that note.

EDIT: I also asked you to remove the goroutine stack traces (starting goroutine x [runnable]) from your report, as they make the page difficult to read/navigate. [Update: Russ has edited out the stacks.]

Everyone please keep in mind that successful communication is hard and a skill one needs to practice. Emotions can work against us and hinder our goal of successful communication, I've been there myself. Yes, there has been a violation of the code of conduct and pointing it out is good. A voluntary apology is also helpful. Now let's try to make sure that every post has a positive net impact on collaboration and solving this issue.

@rtreffer Do you know if the kernel version planned for Ubuntu 20.04 will include the patch? It looks like it was cherry picked into Ubuntu's 5.3.0 kernel (https://kernel.ubuntu.com/git/ubuntu/ubuntu-eoan.git/commit/?id=59e7e6398a9d6d91cd01bc364f9491dc1bf2a426). I would guess that it is also in their 5.4.0 kernel, but it would be nice to be certain. Thanks.

@rtreffer Do you know if the kernel version planned for Ubuntu 20.04 will include the patch? It looks like it was cherry picked into Ubuntu's 5.3.0 kernel (https://kernel.ubuntu.com/git/ubuntu/ubuntu-eoan.git/commit/?id=59e7e6398a9d6d91cd01bc364f9491dc1bf2a426). I would guess that it is also in their 5.4.0 kernel, but it would be nice to be certain. Thanks.

It will be included in the release kernel for 20.04, yes:

(master)mwhudson@anduril:~/src/upstream/linux-2.6$ git log -1 --oneline ad9325e9870914b18b14857c154a0fee0bd77287
ad9325e98709 x86/fpu: Don't cache access to fpu_fpregs_owner_ctx
(master)mwhudson@anduril:~/src/upstream/linux-2.6$ git tag --contains ad9325e9870914b18b14857c154a0fee0bd77287
Ubuntu-5.4-5.4.0-10.13
Ubuntu-5.4-5.4.0-11.14
Ubuntu-5.4-5.4.0-12.15
Ubuntu-5.4-5.4.0-13.16
Ubuntu-5.4-5.4.0-14.17
Ubuntu-5.4.0-15.18
Ubuntu-5.4.0-16.19
Ubuntu-5.4.0-17.20
Ubuntu-5.4.0-17.21
Ubuntu-5.4.0-8.11
Ubuntu-5.4.0-9.12
Ubuntu-raspi2-5.4-5.4.0-1002.2
Ubuntu-raspi2-5.4.0-1003.3
Ubuntu-raspi2-5.4.0-1004.4

Based on discussion with @aclements, @dr2chase, @randall77, and others, our plan for the 1.14.1 release is:

  • write a wiki page describing the problem
  • continue to use mlock on a kernel version that may be buggy
  • if mlock fails, silently note that fact and continue executing
  • if we see an unexpected SIGSEGV or SIGBUS, and mlock failed, then in the crash stack trace point people at the wiki page

The hope is that will provide a good combination of executing correctly in the normal case while directing people on potentially buggy kernels to information to help them decide whether the problem is their kernel or their program or a bug in Go itself.

This can also be combined with better attempts to identify whether a particular kernel has been patched, based on the uname version field (we currently only check the release field).

One other thing we discussed: if an mlock failed, and we're about to send ourselves a signal, touch the signal stack first.

@ianlancetaylor that sounds like a great idea!

If you haven't considered it yet: link to that wiki page from the error message (potentially with one more indirection)

E.g. iPXE does that via links like http://ipxe.org/1d0c6539 (their output is optimized for boot ROMs, limited real estate and alike)

  • write a wiki page describing the problem

  • continue to use mlock on a kernel version that may be buggy

  • if mlock fails, silently note that fact and continue executing

Wouldn't it be better to also disable async preemption? It's true that the problem can happen even with asyncpreemptoff=1, but the bug was rare enough that without it nobody noticed for months.

Is there a way to test (outside of the Go runtime) to see if your kernel is patched? We maintain our own kernels outside of the distro which makes this even workse.

@gopherbot please backport to go1.14. It is a serious problem with no workaround.

Backport issue(s) opened: #37807 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@nemith there is a C reproducer here: https://github.com/golang/go/issues/35326#issuecomment-558690446

@randall77 @aarzilli Upon consideration, I actually don't think it is a great idea to add additional partial mitigations like touching the signal stack page or disabling asynchronous preemption. It's a kernel level bug that can affect any program that receives a signal. People running with a buggy kernel should upgrade the kernel one way or another. Using mlock is a reliable mitigation that should always work, and as such it's reasonable to try it. Touching the signal stack before sending a preemption signal, or disabling signal preemption entirely, is not a reliable mitigation. I think that we should not fall back to an unreliable mitigation; we should tell the user to upgrade the kernel.

People who can't upgrade the kernel have the option of running with GODEBUG=asyncpreemptoff=1, which will be just as effective a partial mitigation as the other two.

I think that we should not fall back to an unreliable mitigation; we should tell the user to upgrade the kernel.

I would agree with you if the bug always manifested itself in a crash. But it doesn't - it just randomly corrupts memory. Maybe it crashes, but maybe it just randomly corrupts the program's data and keeps on running. I think it is incumbent upon us to take whatever measures we can to prevent data corruption, even if that means we won't message about the required kernel upgrade as often.

It is a now-vs-later decision. We can try to prevent data corruption now by touching the signal pages before use. Or we can try to prevent corruption later by messaging the user in the event of a crash. We can't choose both.

I wish there was a way we could message when we detect an mlock failure, without crashing, and then mitigate by touching pages. I don't think the Go program can do that on stdout/stderr, unfortunately. Maybe we could throw something in the syslog? It would only help those who glance at the syslog, which is probably not many.

Fair point.

How about posting a message on installation? That entails a change to install procedures for your binary releases and the distros', but that would let you run the reproducer to definitively test the bug.

@networkimprov That is a good idea, but since people send compiled programs across machines that may have different kernel versions, I think we also need the approach described above.

Change https://golang.org/cl/223121 mentions this issue: runtime: don't crash on mlock failure

I've sent out https://golang.org/cl/223121. It would be helpful if people having trouble with 1.14 could see if that change fixes their problem. Thanks.,

Well, @randall77 / @ianlancetaylor, I tend to disagree that this is a golang issue at all. Golang discovered the memory corruption issue, but it is a very severe kernel bug.

As such it should escalate through your kernel paths.
Distributions picked up the patch and shipped it. It was backported. Every new installation will get a non-affected kernel.
If you roll your own kernel you have to do that work yourself. As usual.

Be helpful for users that hit it, and be as helpful as possible.
But I don't think it is golang responsibility to fix a kernel bug or even force users to apply the patch.

@rtreffer That's what we're trying to do: be as helpful as possible.

On buggy kernels, Go programs built with Go 1.14 behaved unpredictably and badly. We don't want to do that even on a buggy kernel. If a program would just fail quickly and cleanly, that would be one thing. But what we saw was memory corruption leading to obscure errors. See #35326, among others.

Do you think we should take some action different than what we are doing now?

@rtreffer Well sorta. We have some production 5.2 kernels that are not affected as they aren't compiled with gcc9 and we could also easily patch in the fix into our kernel line without affecting anything else and be fine. Kernel bug doesn't exist in our environment and upgrading major versions take a lot more testing and careful roll out across the fleet so just "upgrade your kernel" isn't a good situation.

On the flip side the workaround based on kernel version numbers caused us to move to mlocks which DID fail due to ulimit issues. That isn't a kernel bug.

That being said I am not sure there is a better solution here and the Go team probably made the right call.

@ianlancetaylor maybe you could ship the C reproducer in the source & binary releases, and reference that on the wiki page as a way to vet any kernel.

How about posting a message on installation? That entails a change to install procedures for your binary releases and the distros', but that would let you run the reproducer to definitively test the bug.

Any distro that's paying enough attention to do this will just patch their kernel instead, surely.

@ianlancetaylor I totally agree with the way forward, the patch and wiki page look great.

I wanted to emphasize that the corruption it is not golangs fault or bug to begin with and distributions are shipping fixed kernels. It should already be fading away.

As a result I am don't think anything more than the suggested hints (wiki+panic) are needed.

@rtreffer Great, thanks.

Change https://golang.org/cl/223417 mentions this issue: [release-branch.go1.14]runtime: don't crash on mlock failure

Just to clarify, based on what the wiki says, if mlock fails on 1.14.1, does that mean the program is vulnerable to memory corruption?

Thanks

@smasher164 Not necessarily. We no longer print the "mlock fails" message when the mlock call fails. Instead, we just save the fact that it failed. if your program crashes, we print the fact that the mlock failed in the error text. Which means "we think your kernel might be buggy. We tried the mlock workaround, and it failed. We ran your program anyway, and it ended up crashing." Maybe it was due to the kernel bug, maybe it's just a bug in your program.

@randall77 Thanks for responding. So is it safe to say that if mlock failed and the program does not crash when touching the stack before sending a preemption signal, that async preemption related memory corruption does not exist in the program?

Unfortunately not. If mlock fails and you have a buggy kernel, then memory corruption might be occurring. Just because the program isn't crashing doesn't mean there wasn't corruption somewhere. Crashing is a side-effect of memory corruption - just the mlock failing will not cause a crash. (We used to do that in 1.14. That's one of the things we changed for 1.14.1.)
Even if you turn async preemption off, memory corruption might still be occurring. Just at a lower rate, as your program is probably still getting other signals (timers, etc.).

@smasher164 Let us know if you find the wiki page https://golang.org/wiki/LinuxKernelSignalVectorBug to be unclear. Thanks.

How can I opt out of this?

I am within an unprivileged lxc container created by lxd, so it has the same kernel as the host but is unable to set system-wide limits:

# container
$ uname -a
Linux runner-33-project-447-concurrent-0-job-25150 5.4.0-31-generic #35-Ubuntu SMP Thu May 7 20:20:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ /proc/version_signature
/proc/version_signature: Permission denied
$ ulimit -l 123456
ulimit: max locked memory: cannot modify limit: Operation not permitted
# host
$ uname -a
Linux gitlab-ci-runner-lxd-2 5.4.0-31-generic #35-Ubuntu SMP Thu May 7 20:20:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/version_signature 
Ubuntu 5.4.0-31.35-generic 5.4.34

According to https://github.com/golang/go/issues/37436#issuecomment-595836976 the host (focal) contains the kernel patch.

Go was built a few days ago using go get golang.org/dl/go1.14 and go1.14 download

$ go version
go version go1.14 linux/amd64

Again: How can I opt out of this?

I maybe can't/don't want to change the system-wide limits as other programs/pipelines are affected and a collective consensus has to be made for all these possibilities before changing system wide options.

It is good that we had such detection, but it is broken that we can't opt out easily. When we know, the kernels have the fixed patches, and the detection is prone to false positives.

@dionysius This was fixed in 1.14.1 as described in https://github.com/golang/go/issues/37436#issuecomment-597360484. I believe you need to upgrade to a newer version of go: https://godoc.org/golang.org/dl/go1.14.2.

Gonna try then explicitly loading the go version, I expected go get golang.org/dl/go1.14 loads the latest 1.14. Will report back.

Edit, seems 1.14.3 is latest 1.14 as of today

Update: looks good with go get golang.org/dl/go1.14.3, unexpected that without patch that is not loading the latest, good to know (I would've never landed into this issue otherwise)

Just a heads up - Go1.15 is about to be released, and a beta was already released, but the temporary patch was not yet removed (There are todo comments to remove at Go1.15).

I think it is important to remove the workaround since Ubuntu 20.04 LTS uses a patched 5.4.0 kernel. This means that any user on Ubuntu 20.04 will still unnecessarily mlock pages, and if he runs in a docker container, that warning will be displayed for every crash, disregarding the fact that his kernel is not really buggy. So those users might be sent on a wild goose chase trying to understand and read all this info, and it will have nothing to do with their bug, probably for the entirety of Ubuntu 20.04 life cycle.

@DanielShaulov thanks. Could you open a new issue for that? This one pertains to the problem in 1.14.

@networkimprov sure: #40184

Change https://golang.org/cl/243658 mentions this issue: runtime: let GODEBUG=mlock=0 disable mlock calls

Change https://golang.org/cl/244059 mentions this issue: runtime: don't mlock on Ubuntu 5.4 systems

When does go1.14.7 release with this modification?

The fix has been in every release since 1.14.1, which shipped months ago.

Change https://golang.org/cl/246200 mentions this issue: runtime: revert signal stack mlocking

Was this page helpful?
0 / 5 - 0 ratings