Moby: Docker 1.9.1 hanging at build step "Setting up ca-certificates-java"

Created on 24 Nov 2015  ·  258Comments  ·  Source: moby/moby

A few of us within the office upgraded to the latest version of docker toolbox backed by Docker 1.9.1 and builds are hanging as per the below build output.

docker version:

``` Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.3
Git commit: a34a1d5
Built: Fri Nov 20 17:56:04 UTC 2015
OS/Arch: darwin/amd64

Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.3
Git commit: a34a1d5
Built: Fri Nov 20 17:56:04 UTC 2015
OS/Arch: linux/amd64


`docker info`: 

Containers: 10
Images: 57
Server Version: 1.9.1
Storage Driver: aufs
Root Dir: /mnt/sda1/var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 77
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.13-boot2docker
Operating System: Boot2Docker 1.9.1 (TCL 6.4.1); master : cef800b - Fri Nov 20 19:33:59 UTC 2015
CPUs: 1
Total Memory: 1.956 GiB
Name: vbootstrap-vm
ID: LLM6:CASZ:KOD3:646A:XPRK:PIVB:VGJ5:JSDB:ZKAN:OUC4:E2AK:FFTC
Debug mode (server): true
File Descriptors: 13
Goroutines: 18
System Time: 2015-11-24T02:03:35.597772191Z
EventsListeners: 0
Init SHA1:
Init Path: /usr/local/bin/docker
Docker Root Dir: /mnt/sda1/var/lib/docker
Labels:
provider=virtualbox


`uname -a`: 

Darwin JRedl-MB-Pro.local 15.0.0 Darwin Kernel Version 15.0.0: Sat Sep 19 15:53:46 PDT 2015; root:xnu-3247.10.11~1/RELEASE_X86_64 x86_64


Here is a snippet from the docker build uppet that hangs on the Setting up ca-certificates-java line. Something to do with the latest version of docker and openjdk?

``` bash
update-alternatives: using /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/tnameserv to provide /usr/bin/tnameserv (tnameserv) in auto mode
update-alternatives: using /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/jexec to provide /usr/bin/jexec (jexec) in auto mode
Setting up ca-certificates-java (20140324) ...

Docker file example:

FROM gcr.io/google_appengine/base

# Prepare the image.
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y -qq --no-install-recommends build-essential wget curl unzip python python-dev php5-mysql php5-cli php5-cgi openjdk-7-jre-headless openssh-client python-openssl && apt-get clean

I can confirm that this is not an issue with Docker 1.9.0 or Docker Toolbox 1.9.0d. Let me know if I can provide any further information but this feels like a regression of some sort within the new version.

arekernel kinbug

Most helpful comment

Debian supported this issue.

LATEST QUICK WORKAROUNDS

| Distro | Workaround |
| --- | --- |
| General | Use devicemapper/overlay/btrfs (but it may cause another problem..).
If you can upgrade AUFS and build the kernel manually, you can also use AUFS v20160111 or later. |
| Boot2Docker | :white_check_mark: Upgrade to v1.10.0 or later |
| Ubuntu 14.04LTS | :white_check_mark: Upgrade kernel to 3.13.0-79.123 or later |
| Ubuntu 15.04 | :white_check_mark: Upgrade kernel to 3.19.0-51.57 or later |
| Ubuntu 15.10 | :white_check_mark: Upgrade kernel to 4.2.0-30.35 or later |
| Debian 7 | :white_check_mark: Upgrade kernel to 3.2.73-2+deb7u3 (of linux-image-3.2.0-4-amd64 package) or later |
| Debian 8 | :white_check_mark: Upgrade kernel to 3.16.7-ckt20-1+deb8u4 (of linux-image-3.16.0-4-amd64 package) or later |
| Debian 9 | :white_check_mark: (does not support AUFS since kernel 3.18-1~exp1) |
| Gentoo | :white_check_mark: Upgrade to recent ones (:warning: not tested) |
| RHEL/CentOS | :white_check_mark: (does not support AUFS) |
| openSUSE | :white_check_mark: (does not support AUFS) |

Distributors Issue Tickets

| Distro | Status | Issue URL |
| --- | --- | --- |
| Boot2Docker | :white_check_mark: Closed | https://github.com/boot2docker/boot2docker/pull/1113 |
| Ubuntu | :white_check_mark: Closed | https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 |
| Debian | :white_check_mark: Closed | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812207 |

All 258 comments

I am facing same problem. I am investigating.

We're facing the same problem as well.

Yep, it is a problem em docker 1.9. I had downgraded to 1.8.3 and all problems solved. Now i am investigating a workarround. will post here! Tks

I'm having the same issue with docker 1.9.1a

I have docker 1.8.3, so maybe the process of installing a different version of docker remedies the situation. @bsao.

having this same issue with docker version 1.9.1, build a34a1d5

Are you only seeing this on boot2docker?

I cannot repo on a stock ubuntu with aufs or on my machine. let me try with boot2docker to see if I can repo there.

+1 in Docker 1.9.1 for ubuntu:14.10 using OSX

This is an issue that started appearing after I turned on VPN for work. Even after I turned off VPN and restarted the docker machine on OSX it continued to have this problem. I re-installed Docker 1.9.1 and then 1.8.3, still seeing the issue. Blocks me from using most if not all of my dockers on the Mac.

+1 in Docker 1.9.1 for ubuntu 12.04 using OS X 10.11

Developers in my office came across this by accident too.

This version/build worked: Docker version 1.9.0, build 76d6bc9

This version/build hung:Docker version 1.9.1, build a34a1d5

@crosbymichael I unfortunately have not tried it on any other environment than Boot2Docker.

Someone with the know-how of git-bisecting and docker could use the build IDs provided by @chico1198!

I experienced the same problem with 1.9.1 on OSX El Capitan, downgrading to 1.9.0 didn't help.

Same issue here on OSX 10.9.3 with:
Docker version 1.9.1, build a34a1d5
Docker version 1.9.0, build 76d6bc9

@crosbymichael I logged in boot2docker and ran ps auxf, this is what I saw:

root      1290  0.4  1.8 1346656 75692 ?       Sl   Nov27   4:53 /usr/local/bin/docker daemon -D -g /var/lib/docker -H unix:// -H tcp://0.0.0.0:2376 [...]
root      8556  0.0  0.0      0     0 ?        Ss   05:12   0:00  \_ [sh]
root     24221 99.8  0.0      0     0 ?        Zl   05:33  64:17  |   \_ [java] <defunct>
root     24657  0.0  0.0      0     0 ?        Ss   06:07   0:00  \_ [sh]
root      6174 79.6  0.0      0     0 ?        Zl   06:22  12:33      \_ [java] <defunct>
root      7295 49.3  0.0      0     0 ?        Zl   06:32   2:49      \_ [java] <defunct>

+1 with docker 1.9.1 on OSX 10.11 with attempting to build image from ubuntu 14.04

+1
use DockerToolbox-1.9.1a.pkg

docker version                                                                                      2 master?
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      darwin/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

Downgrading to Docker 1.8.3 is my temporary workaround. Here's the target I use in my Makefile.

downgrade-docker:
  docker-machine ssh $(DOCKER_MACHINE_NAME) sudo /etc/init.d/docker stop
  docker-machine ssh $(DOCKER_MACHINE_NAME) "while sudo /etc/init.d/docker status ; do sleep 1; done"
  docker-machine ssh $(DOCKER_MACHINE_NAME) "sudo curl 'https://get.docker.com/builds/Linux/x86_64/docker-1.8.3' -o /usr/local/bin/docker-1.8.3"
  docker-machine ssh $(DOCKER_MACHINE_NAME) "sudo ln -sf /usr/local/bin/docker-1.8.3 /usr/local/bin/docker"
  # FIXME: Starting machine is not enough; always fails with message like "Need TLS certs for 127.0.0.1,10.0.2.15,192.168.99.100"
  #docker-machine ssh $(DOCKER_MACHINE_NAME) sudo /etc/init.d/docker start
  docker-machine stop $(DOCKER_MACHINE_NAME) 
  docker-machine start $(DOCKER_MACHINE_NAME) 

I couldn't reproduce this. Does it always hang at "setting up certificates" ? Did you try sending a ^D to close some pipe? Can you also try sending a SIGUSR1 to the daemon and paste the stack trace here when it's stuck?

+1 with docker 1.9.1 on OS X 10.10

I tried downgrading to 1.8.3 using @osterman 's Makefile and also had troubles with the SSH key:

ip-10-100-0-211:docker-dev leaf$ docker-machine start default
(default) OUT | Starting VM...
Too many retries waiting for SSH to be available.  Last error: Maximum number of retries (60) exceeded

Tested it by doing different openjdk installs inside debian:jessie and ubuntu
OSX 10.11.1, boot2docker 1.9.1: hangs
OSX 10.11.1, boot2docker 1.9.0: works
Ubuntu 14.04 with docker 1.9.1: works

The boot2docker vms were created with:
docker-machine create -d virtualbox --virtualbox-boot2docker-url=https://github.com/boot2docker/boot2docker/releases/download/v1.9.0/boot2docker.iso
and
docker-machine create -d virtualbox --virtualbox-boot2docker-url=https://github.com/boot2docker/boot2docker/releases/download/v1.9.1/boot2docker.iso

On Ubuntu 14.04 docker was installed following the documentation on https://docs.docker.com/engine/installation/ubuntulinux/

+1, running docker 1.9.1 build a34a1d5 on OSX Yosemite 10.10.5.

I can't reproduce this.

Same issue here.
Is there any way to downgrade to an earlier version on Windows?

+1, docker 1.9.1 @ El Capitan

+1, Docker 1.9.1 on OS X 10.11.1

+1, Docker 1.9.1a, OS X 10.10.5

+1, Docker 1.9.1 build a34a1d5, Windows 10

+1, Docker 1.9.1 build a34a1d5, OS X 10.11.1, Docker-Machine 0.5.1 build 7e8e38e

+1

Same on Docker-machine on OSX 10.11.1
Docker version 1.9.1, build a34a1d5
docker-machine version 0.5.1 (HEAD)

I'm able to reproduce this on docker-machine, OS X 10.10.5, so this may be something related to boot2docker. docker top also gives me <defunct> for a java process;

docker top dreamy_sammet                                                                  Tue Dec  1 15:58:47 2015
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                2538                1023                0                   14:44               ?                   00:00:00            /bin/sh -c apt-get update && apt-get install -y -qq --no-install-recommends build-essential wget curl unzip python python-dev php5-mysql php5-cli php5-cgi openjdk-7-jre-headless openssh-client python-openssl && apt-get clean
root                2566                2538                1                   14:44               ?                   00:00:16            apt-get install -y -qq --no-install-recommends build-essential wget curl unzip python python-dev php5-mysql php5-cli php5-cgi openjdk-7-jre-headless openssh-client python-openssl
root                4830                2566                0                   14:46               pts/0               00:00:00            /usr/bin/dpkg --status-fd 14 --configure libgdbm3:amd64 libjson-c2:amd64 libbsd0:amd64 libedit2:amd64 libkeyutils1:amd64 libkrb5support0:amd64 libk5crypto3:amd64 libkrb5-3:amd64 libgssapi-krb5-2:amd64 libidn11:amd64 libsasl2-modules-db:amd64 libsasl2-2:amd64 libldap-2.4-2:amd64 libmagic1:amd64 libsqlite3-0:amd64 libwrap0:amd64 libxml2:amd64 perl-modules:all perl:amd64 mime-support:all libexpat1:amd64 libpython2.7-stdlib:amd64 python2.7:amd64 libpython-stdlib:amd64 python:amd64 libasan1:amd64 libasyncns0:amd64 libatomic1:amd64 libavahi-common-data:amd64 libavahi-common3:amd64 libdbus-1-3:amd64 libavahi-client3:amd64 libcilkrts5:amd64 libisl10:amd64 libcloog-isl4:amd64 libcups2:amd64 librtmp1:amd64 libssh2-1:amd64 libcurl3:amd64 libogg0:amd64 libflac8:amd64 libpng12-0:amd64 libfreetype6:amd64 ucf:all fonts-dejavu-core:all fontconfig-config:all libfontconfig1:amd64 libglib2.0-0:amd64 libgomp1:amd64 x11-common:all libice6:amd64 libicu52:amd64 libitm1:amd64 liblcms2-2:amd64 liblsan0:amd64 libmpfr4:amd64 mysql-common:all libmysqlclient18:amd64 libnspr4:amd64 libnss3:amd64 libonig2:amd64 libpcsclite1:amd64 libsm6:amd64 libvorbis0a:amd64 libvorbisenc2:amd64 libsndfile1:amd64 libxau6:amd64 libxdmcp6:amd64 libxcb1:amd64 libx11-data:all libx11-6:amd64 libx11-xcb1:amd64 libxext6:amd64 libxi6:amd64 libxtst6:amd64 libpulse0:amd64 libpython2.7:amd64 libc-dev-bin:amd64 linux-libc-dev:amd64 libc6-dev:amd64 libexpat1-dev:amd64 libpython2.7-dev:amd64 libquadmath0:amd64 libsctp1:amd64 libtsan0:amd64 libubsan0:amd64 tzdata-java:all java-common:all libjpeg62-turbo:amd64 ca-certificates-java:all openjdk-7-jre-headless:amd64 libmpc3:amd64 libpsl0:amd64 wget:amd64 bzip2:amd64 libperl4-corelibs-perl:all lsof:amd64 openssh-client:amd64 patch:amd64 xz-utils:amd64 binutils:amd64 cpp-4.9:amd64 cpp:amd64 libgcc-4.9-dev:amd64 gcc-4.9:amd64 gcc:amd64 libstdc++-4.9-dev:amd64 g++-4.9:amd64 g++:amd64 make:amd64 libtimedate-perl:all libdpkg-perl:all dpkg-dev:all build-essential:amd64 curl:amd64 libpython-dev:amd64 libqdbm14:amd64 psmisc:amd64 php5-common:amd64 php5-json:amd64 php5-cli:amd64 php5-cgi:amd64 php5-mysql:amd64 python-ply:all python-pycparser:all python-cffi:amd64 python-pkg-resources:all python-six:all python-cryptography:amd64 python2.7-dev:amd64 python-dev:amd64 python-openssl:all unzip:amd64
root                6711                4830                0                   14:46               pts/0               00:00:00            /bin/bash /var/lib/dpkg/info/ca-certificates-java.postinst configure
root                6725                6711                97                  14:46               pts/0               00:12:25            [java] <defunct>

/cc @tianon @nathanleclaire @jeffdm perhaps any of you has an idea where to look, or what to debug, I couldn't really find something

How much RAM does your VM have? Could be OOM given that it looks like the
process dies unexpectedly. :disappointed:

Looks like memory is not the problem, however the <defunct> process does consume 100% CPU;

CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O               BLOCK I/O
d263da116bfd        99.51%              689.3 MB / 2.1 GB   32.82%              157.9 MB / 2.754 MB   25.15 MB / 130.4 MB

The container seems to be stuck as well, and I had to reboot the vm to get it killed

+1 Docker version 1.9.1, build a34a1d5, Win 7.

I've run into similar problems that turned out to be OOM, even though the stats command shows memory available to the container. The problem happened soon after task manager showed 0 free physical memory, while stats continued to show <100%.

Weird thing is, that the process kept running, so it was not killed. I can retry with a -m, however, it's strange that this happens on 1.9.x, but (following this discussion) not on 1.8. Also, running the same on a 1GB DigitalOcean droplet (also 1.9.1) succeeded. Perhaps that one uses swap, should check that

It actually kept happening to me after I uninstalled 1.9.1 and installed 1.8.3. Looked like the uninstall wasn't very thorough though on Mac because firing up the shell was without delay on 1.8.3, unlike a normal first run where it sets up ssh keys and stuff.

_USER POLL_

_The best way to get notified when there are changes in this discussion is by clicking the Subscribe button in the top right._

The people listed below have appreciated your meaningfull discussion with a random +1:

@mattes

31 participants on this issue and counting.

@bean5 please keep your comments constructive

@thaJeztah I didn't mean to offend nor be deconstructive. I mean to draw attention to the fact that github shows the number of people participating...and I gathered that @GordonTheTurtle wanted to construct a list of people who have done +1. Maybe I was confused by what he meant. In any case, I watch this issue with great anticipation since it has affected me on more than one occasion in the past weeks. I am glad we have information from various users.

I am able to duplicate this issue on my setup (using Docker Machine on Mac).

Here are my findings so far.

As noted by other posters, the simplest way to duplicate this has been to use the boot2docker 1.9.1 ISO with AUFS. This Dockerfile should minimally reproduce the problem fairly quickly:

FROM debian:jessie

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends openjdk-7-jre-headless

Looking at dmesg, I see some AUFS errors after attempting such a build, but I am not 100% sure they are related:

docker@default:~$ dmesg | tail
aufs au_opts_verify:1597:docker[14186]: dirperm1 breaks the protection by the permission bits on the lower branch
aufs au_opts_verify:1597:docker[14186]: dirperm1 breaks the protection by the permission bits on the lower branch
aufs au_opts_verify:1597:docker[14186]: dirperm1 breaks the protection by the permission bits on the lower branch
device veth955cc15 entered promiscuous mode
IPv6: ADDRCONF(NETDEV_UP): veth955cc15: link is not ready
eth0: renamed from vethc63e038
IPv6: ADDRCONF(NETDEV_CHANGE): veth955cc15: link becomes ready
docker0: port 2(veth955cc15) entered forwarding state
docker0: port 2(veth955cc15) entered forwarding state
docker0: port 2(veth955cc15) entered forwarding state

If I create a Docker 1.9.1 machine which uses overlay as the storage driver:

$ docker-machine create -d virtualbox --engine-storage-driver overlay overlay

The process does NOT hang and this line runs successfully! Looks like AUFS and/or kernel is the problem.

boot2docker/boot2docker _did_ bump both kernel versions and AUFS commit for the 1.9.1 release, so those are both factors which need to be ruled out or investigated further:

Currently trying 1.9.0 ISO with a 1.9.1 binary to see if the surface area of potential bug area can be reduced further.

The Dockerfile will build fine and not hang on a boot2docker 1.9.0 ISO with a Docker 1.9.1 binary. The issue seems not to lie with Docker 1.9.1, but rather the environment in which it is being run.

I am using the 1.9.1 release with no issue on aufs, but have significantly more cpu/ram/storage than the default machine config.

I just tried raising the memory to 4GB for my VM, but still able to reproduce

@cpuguy83 AUFS on boot2docker 1.9.1?

As noted above, b2d bundles a very specific version of AUFS.

Yep

Containers: 13
Images: 191
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 221
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.13-boot2docker
Operating System: Boot2Docker 1.9.1 (TCL 6.4.1); master : cef800b - Fri Nov 20 19:33:59 UTC 2015
CPUs: 1
Total Memory: 3.859 GiB
Name: default
ID: XMQH:4YAW:ZDSA:OWC7:GAPC:US5P:YQ4M:SVMQ:VXNL:RRZC:YNHT:ZBHE
Debug mode (server): true
 File Descriptors: 12
 Goroutines: 19
 System Time: 2015-12-01T23:05:28.760107918Z
 EventsListeners: 0
 Init SHA1:
 Init Path: /usr/local/bin/docker
 Docker Root Dir: /mnt/sda1/var/lib/docker
Labels:
 provider=virtualbox

I also see some java processes becoming defunct in a container. I am able to reproduce this issue with the following steps
run the container:

docker run --rm -it myJavaContainerFromCentos7 bash

create Foo.java with the following:

class Foo {
    public static void main (String[] a) {
        System.out.println("hello world");
    }
}

compile and run it results in a defunct java process, with 1 core using 100%cpu:
javac Foo.java && java Foo

however... if a System.exit(0); is added after the println everything is ok:

class Foo {
    public static void main (String[] a) {
        System.out.println("hello world");
        System.exit(0);  // clean exit, no hang
    }
}

version info:
osx 10.10.3
docker 1.9.1
boot2docker version 1.9.1 uname -a is "linux ci 4.1.13-boot2docker"
numproc = 1

strace output with System.exit(0);

open("/usr/java/jdk1.7.0_75/jre/lib/amd64/jvm.cfg", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0755, st_size=677, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f27b1dab000
read(3, "# Copyright (c) 2003, Oracle and"..., 4096) = 677
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x7f27b1dab000, 4096)            = 0
stat("/usr/java/jdk1.7.0_75/jre/lib/amd64/server/libjvm.so", {st_mode=S_IFREG|0755, st_size=15224066, ...}) = 0
futex(0x7f27b17580d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/usr/java/jdk1.7.0_75/jre/lib/amd64/server/libjvm.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\245\36\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=15224066, ...}) = 0
mmap(NULL, 15167976, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f27b031c000
mprotect(0x7f27b0e8f000, 2097152, PROT_NONE) = 0
mmap(0x7f27b108f000, 802816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb73000) = 0x7f27b108f000
mmap(0x7f27b1153000, 262632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f27b1153000
close(3)                                = 0
open("/usr/java/jdk1.7.0_75/bin/../lib/amd64/jli/libm.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=11922, ...}) = 0
mmap(NULL, 11922, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f27b1da9000
close(3)                                = 0
open("/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260T\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1141552, ...}) = 0
mmap(NULL, 3150168, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f27b001a000
mprotect(0x7f27b011b000, 2093056, PROT_NONE) = 0
mmap(0x7f27b031a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x100000) = 0x7f27b031a000
close(3)                                = 0
mprotect(0x7f27b031a000, 4096, PROT_READ) = 0
munmap(0x7f27b1da9000, 11922)           = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f27b1ca4000
mprotect(0x7f27b1ca4000, 4096, PROT_NONE) = 0
clone(child_stack=0x7f27b1da3fb0,                                                                                                    flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,  parent_tidptr=0x7f27b1da49d0, tls=0x7f27b1da4700, child_tidptr=0x7f27b1da49d0) = 118
futex(0x7f27b1da49d0, FUTEX_WAIT, 118, NULLhellowerld
 <unfinished ...>
 +++ exited with 0 +++

strace output _without_ System.exit(0);

open("/usr/java/jdk1.7.0_75/jre/lib/amd64/jvm.cfg", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0755, st_size=677, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fac9a490000
read(3, "# Copyright (c) 2003, Oracle and"..., 4096) = 677
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x7fac9a490000, 4096)            = 0
stat("/usr/java/jdk1.7.0_75/jre/lib/amd64/server/libjvm.so", {st_mode=S_IFREG|0755, st_size=15224066, ...}) = 0
futex(0x7fac99e3d0d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/usr/java/jdk1.7.0_75/jre/lib/amd64/server/libjvm.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\245\36\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=15224066, ...}) = 0
mmap(NULL, 15167976, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac98a01000
mprotect(0x7fac99574000, 2097152, PROT_NONE) = 0
mmap(0x7fac99774000, 802816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb73000) = 0x7fac99774000
mmap(0x7fac99838000, 262632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac99838000
close(3)                                = 0
open("/usr/java/jdk1.7.0_75/bin/../lib/amd64/jli/libm.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=11922, ...}) = 0
mmap(NULL, 11922, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fac9a48e000
close(3)                                = 0
open("/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260T\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1141552, ...}) = 0
mmap(NULL, 3150168, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac986ff000
mprotect(0x7fac98800000, 2093056, PROT_NONE) = 0
mmap(0x7fac989ff000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x100000) = 0x7fac989ff000
close(3)                                = 0
mprotect(0x7fac989ff000, 4096, PROT_READ) = 0
munmap(0x7fac9a48e000, 11922)           = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fac9a389000
mprotect(0x7fac9a389000, 4096, PROT_NONE) = 0
clone(child_stack=0x7fac9a488fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fac9a4899d0, tls=0x7fac9a489700, child_tidptr=0x7fac9a4899d0) = 142
futex(0x7fac9a4899d0, FUTEX_WAIT, 142, NULLhellowerld
) = 0
exit_group(0)                           = ?

the process is now hung but you can enter the container:

docker exec -it myContainer bash

and see the following:

ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 23:47 ?        00:00:00 bash
root       138     1  0 23:51 ?        00:00:00 strace java Foo
root       141   138 24 23:51 ?        00:01:21 [java] <defunct>
root       151     0  1 23:57 ?        00:00:00 bash
root       167   151  0 23:57 ?        00:00:00 ps -ef

quick look at stats:

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O               BLOCK I/O
myContainer                  24.72%              64.18 MB / 8.365 GB   0.77%               11.09 MB / 202.6 kB   8.192 kB / 14.99

Everything works fine in 1.8.3.

+1, Docker version 1.9.1, build a34a1d5, OS X

+1, Docker version 1.9.1, build a34a1d5, OS X 10.10.5, Docker Machine Version: 0.5.1 (HEAD)

+1

Docker version 1.9.1, build a34a1d5, OS X 10.11.1 (15B42)

+1

Docker version 1.9.1, build a34a1d5 on OS X 10.11.1

This issue really is quite bizarre. If I strace the failing apt-get command, the end of the output is:

stat("/etc/apt/sources.list", {st_mode=S_IFREG|0644, st_size=161, ...}) = 0
open("/etc/apt/sources.list", O_RDONLY) = 5
read(5, "deb http://httpredir.debian.org/"..., 8191) = 161
pipe([6, 7])                            = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc6fc88aa10) = 14
close(7)                                = 0
fcntl(6, F_GETFL)                       = 0 (flags O_RDONLY)
fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc6fc892000
lseek(6, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
read(6, Process 14 attached
 <unfinished ...>
[pid    14] rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, 8) = 0
[pid    14] rt_sigaction(SIGQUIT, {SIG_DFL, [QUIT], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_DFL, [], 0}, 8) = 0
[pid    14] rt_sigaction(SIGINT, {SIG_DFL, [INT], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_DFL, [], 0}, 8) = 0
[pid    14] rt_sigaction(SIGWINCH, {SIG_DFL, [WINCH], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {0x7fc6fc0e5750, [WINCH], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, 8) = 0
[pid    14] rt_sigaction(SIGCONT, {SIG_DFL, [CONT], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_DFL, [], 0}, 8) = 0
[pid    14] rt_sigaction(SIGTSTP, {SIG_DFL, [TSTP], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_DFL, [], 0}, 8) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(3, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(4, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(5, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(6, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(7, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(8, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(9, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(10, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(11, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)

Where those (Bad file descriptor) errors continue to loop indefinitely.

RLIMIT_NOFILE
              Specifies a value one greater than the maximum file descriptor
              number that can be opened by this process.  Attempts (open(2),
              pipe(2), dup(2), etc.)  to exceed this limit yield the error
              EMFILE.  (Historically, this limit was named RLIMIT_OFILE on
              BSD.)

SIGPIPE is failing? this might correspond to my previous post where I saw java "hello world" causing zombie processes without an explicit "System.exit(0);" -- or maybe thats a completely different issue. if so sorry for the noise.

what happens to your cpu while looping indefinitely?

@andrewgdavis It's at 100%

screen shot 2015-12-03 at 3 55 36 pm

java "hello world" causing zombie processes without an explicit "System.exit(0);"

That certainly sounds similar to the problem encountered here.

I can definitely confirm the b2d issue (even did the bisect to track it most positively to the 4.1.13 kernel bump). I can also reproduce on 4.2.6 with b2d.

As an additional kink, my Gentoo host is currently on 4.1.13 + AUFS patches also, and I'm seeing the same exact problem, so we've definitely ruled out anything b2d-specific.

I think it might be worth trawling through commits between 4.1.12 and 4.1.13 to see if anything that might be related jumps out.

(ie, https://www.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.1.13)

Yup, something breaks from kernel 4.1.12 => 4.1.13. I can confirm that baking a boot2docker ISO for the former doesn't trip this bug but the former does.

So, it's not specifically related to boot2docker, but seems to be related to the kernel version interacting with AUFS.

or perhaps the specific way the AUFS driver in Docker interacts with the
newer kernel -- TBD, probably with a linux-stable git bisect between 4.1.12
and 4.1.13 :cry:

i've got a hair brained theory...

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.1.13&id=6c0da28df5dac10672efe955eb89051a850008eb

the commit above makes a change to filemap.c to generic_perform_write(struct file *file, struct iov_iter *i, loff_t pos)

below is the chunk of code i personally want to test because the comment describes both deadlock and livelock race conditions and i see the cpu pegged at 100%. but thats just me and my jump-to-conclusions mat.

4.1.13 mm/filemap.c#l_2448

...
 2448 again:
 2449       /*
 2450        * Bring in the user page that we will copy from _first_.
 2451        * Otherwise there's a nasty deadlock on copying from the
 2452        * same page as we're writing to, without it being marked
 2453        * up-to-date.
 2454        *
 2455        * Not only is this an optimisation, but it is also required
 2456        * to check that the address is actually valid, when atomic
 2457        * usercopies are used, below.
 2458        */
 2459       if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
 2460           status = -EFAULT;
 2461           break;
 2462       }
 2463 
 2464       if (fatal_signal_pending(current)) {
 2465           status = -EINTR;
 2466           break;
 2467       }
 2468 
 2469       status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 2470                       &page, &fsdata);
 2471       if (unlikely(status < 0))
 2472           break;
 2473 
 2474       if (mapping_writably_mapped(mapping))
 2475           flush_dcache_page(page);
 2476 
 2477       copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 2478       flush_dcache_page(page);
 2479 
 2480       status = a_ops->write_end(file, mapping, pos, bytes, copied,
 2481                       page, fsdata);
 2482       if (unlikely(status < 0))
 2483           break;
 2484       copied = status;
 2485 
 2486       cond_resched();
 2487 
 2488       iov_iter_advance(i, copied);
 2489       if (unlikely(copied == 0)) {
 2490           /*
 2491            * If we were unable to copy any data at all, we must
 2492            * fall back to a single segment length write.
 2493            *
 2494            * If we didn't fallback here, we could livelock
 2495            * because not all segments in the iov can be copied at
 2496            * once without a pagefault.
 2497            */
 2498           bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 2499                       iov_iter_single_seg_count(i));
 2500           goto again;
 2501       }
 2502       pos += copied;
 2503       written += copied;
 2504 
 2505       balance_dirty_pages_ratelimited(mapping);
 2506   } while (iov_iter_count(i));

@andrewgdavis one could use that commit during git bisect as a specific testing point!

Seeing a similar hang when shutting down mongodb. Definitely present in 1.9.x. Not present in 1.8.x.

I've been able to solve this issue for myself by increasing the docker-machine VM's memory from 1024 to 2048 MB and assigning 2 CPUs instead of 1.

Works:

VM: Ubuntu 14.04 (2gb ram)
Docker Engine: 1.9.1
Docker base image: ubuntu:latest

Does not work:

VM: Ubuntu 15.10 (2 gb ram)
Docker Engine: 1.9.1,1.9.0,1.8.3
Docker base image: ubuntu:latest, ubuntu:14.04

@marsinvasion If possible, can you print the output of uname -a on both tested systems?

VM: Ubuntu 14.04
Linux ubuntu 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

VM: Ubuntu 15.10
Linux ubuntu 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

+1
Docker version 1.9.1, build a34a1d5 on OS X 10.11.1

Encountered on OS X 10.9.5 with docker 1.9.1.

Inspired by @marsinvasion, I got a successful workaround by giving my docker-machine 2 CPUs and 4096Mb RAM.

Oops, spoke to soon. It stopped working upon changing a Dockerfile I'm working on and re-running build.

Also seeing this hellacious bug (docker-machine boot2docker 1.9.1 on OS X), from a previously building ubuntu:15.04 image. It seems to require restarting my docker server to get those zombie containers to go away.

I thought https://github.com/docker-library/java/issues/19 was related but maybe not, here we're getting a hang, there they got an error about not finding "java".

Switched my server to overlay as a workaround. Before that it created a bunch of zombie containers as well.

Docker version 1.9.1, build a34a1d5 on OS X 10.11.1

Anyone know what's involved in migrating an existing boot2docker.iso system to https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/ or is it easier to do a full rebuild? That page has ominous warnings about CentOS image builds -- what are the "yum" workarounds, is it related to https://github.com/docker/docker/issues/10180?

It's fixed in 1.9.1a - install this if you're on OSX - https://github.com/docker/toolbox/releases/download/v1.9.1a/DockerToolbox-1.9.1a.pkg

Definitely not fixed by Docker Toolbox 1.9.1a. Suffering from this bug with that version. Looking back through the comments, it looks like I'm not the only one.

nope still not building

I had to delete the VM in virtualbox and start from scratch for it to work.

Also, tried deleting and creating a new VM several times to no avail.

Installed 1.9.1a, did docker-machine rm default and used Docker Quickstart Terminal to regenerate default machine. Rebuilt images (that derive from java:7-jre) and ran, still does not work. Continues to work just fine with overlay machine built as suggested above:

$ docker-machine create -d virtualbox --engine-storage-driver overlay overlay

^thanks! I can confirm the overlay machine is working.

Using overlay as the engine storage driver also worked for fixing the MongoDB shutdown hang.

You can workaround the Dockerfile build failure by installing Oracle java instead of OpenJDK:

# Oracle java is bulkier but avoids boot2docker/aufs docker issue 18180
RUN apt-get install -y software-properties-common python-software-properties && add-apt-repository -y ppa:webupd8team/java && apt-get update
RUN echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections
RUN apt-get install -y oracle-java8-installer && apt-get install -y oracle-java8-set-default

But I was underestimating the scope of the problem, boot2docker 1.9.1 leads to zombie java processes, even on CentOS containers where openjdk installs fine.
root 322 11.1 0.0 0 0 ? Zsl 18:43 29:48 [java] <defunct>

I'm unable to configure my docker server with --engine-storage-driver overlay because I build CentOS-based images, and overlayfs is not compatible with yum (https://github.com/docker/docker/issues/10180).

I'm sure Docker folks would _not_ recommend this, but the way I moved past this blocking issue is by building a boot2docker.iso that uses docker 1.9.1 with a slightly older AUFS. Instructions in https://github.com/boot2docker/boot2docker/issues/1099#issuecomment-163052066.

tried oracle jdk1.7.0_75 and jdk1.8.0_65; both hang and create a defunct java process.

FROM : https://github.com/docker/docker/issues/10589
@neverfox exactly the same problem here, with the same image +1

~ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.5.1
 Git commit:   a34a1d5
 Built:        Sat Nov 21 00:49:19 UTC 2015
 OS/Arch:      darwin/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64


~ docker-machine inspect default
{
    "ConfigVersion": 3,
    "Driver": {
        "Driver": {
            "VBoxManager": {},
            "IPAddress": "192.168.99.100",
            "MachineName": "default",
            "SSHUser": "docker",
            "SSHPort": 61012,
            "SSHKeyPath": "/Users/myuser/.docker/machine/machines/default/id_rsa",
            "StorePath": "/Users/myuser/.docker/machine",
            "SwarmMaster": false,
            "SwarmHost": "tcp://0.0.0.0:3376",
            "SwarmDiscovery": "",
            "CPU": 1,
            "Memory": 4096,
            "DiskSize": 20000,
            "Boot2DockerURL": "",
            "Boot2DockerImportVM": "",
            "HostOnlyCIDR": "192.168.99.1/24",
            "HostOnlyNicType": "82540EM",
            "HostOnlyPromiscMode": "deny",
            "NoShare": false
        },
        "Locker": {}
    },
    "DriverName": "virtualbox",
    "HostOptions": {
        "Driver": "",
        "Memory": 0,
        "Disk": 0,
        "EngineOptions": {
            "ArbitraryFlags": [],
            "Dns": null,
            "GraphDir": "",
            "Env": [],
            "Ipv6": false,
            "InsecureRegistry": [],
            "Labels": [],
            "LogLevel": "",
            "StorageDriver": "",
            "SelinuxEnabled": false,
            "TlsVerify": true,
            "RegistryMirror": [],
            "InstallURL": "https://get.docker.com"
        },
        "SwarmOptions": {
            "IsSwarm": false,
            "Address": "",
            "Discovery": "",
            "Master": false,
            "Host": "tcp://0.0.0.0:3376",
            "Image": "swarm:latest",
            "Strategy": "spread",
            "Heartbeat": 0,
            "Overcommit": 0,
            "ArbitraryFlags": [],
            "Env": null
        },
        "AuthOptions": {
            "CertDir": "/Users/myuser/.docker/machine/certs",
            "CaCertPath": "/Users/myuser/.docker/machine/certs/ca.pem",
            "CaPrivateKeyPath": "/Users/myuser/.docker/machine/certs/ca-key.pem",
            "CaCertRemotePath": "",
            "ServerCertPath": "/Users/myuser/.docker/machine/machines/default/server.pem",
            "ServerKeyPath": "/Users/myuser/.docker/machine/machines/default/server-key.pem",
            "ClientKeyPath": "/Users/myuser/.docker/machine/certs/key.pem",
            "ServerCertRemotePath": "",
            "ServerKeyRemotePath": "",
            "ClientCertPath": "/Users/myuser/.docker/machine/certs/cert.pem",
            "StorePath": "/Users/myuser/.docker/machine/machines/default"
        }
    },
    "Name": "default",
    "RawDriver": "eyJWQm94TWFuYWdlciI6e30sIklQQWRkcmVzcyI6IjE5Mi4xNjguOTkuMTAwIiwiTWFjaGluZU5hbWUiOiJkZWZhdWx0IiwiU1NIVXNlciI6ImRvY2tlciIsIlNTSFBvcnQiOjYxMDEyLCJTU0hLZXlQYXRoIjoiL1VzZXJzL2RhdmlkZnJhbmNvZXVyLy5kb2NrZXIvbWFjaGluZS9tYWNoaW5lcy9kZWZhdWx0L2lkX3JzYSIsIlN0b3JlUGF0aCI6Ii9Vc2Vycy9kYXZpZGZyYW5jb2V1ci8uZG9ja2VyL21hY2hpbmUiLCJTd2FybU1hc3RlciI6ZmFsc2UsIlN3YXJtSG9zdCI6InRjcDovLzAuMC4wLjA6MzM3NiIsIlN3YXJtRGlzY292ZXJ5IjoiIiwiQ1BVIjoxLCJNZW1vcnkiOjQwOTYsIkRpc2tTaXplIjoyMDAwMCwiQm9vdDJEb2NrZXJVUkwiOiIiLCJCb290MkRvY2tlckltcG9ydFZNIjoiIiwiSG9zdE9ubHlDSURSIjoiMTkyLjE2OC45OS4xLzI0IiwiSG9zdE9ubHlOaWNUeXBlIjoiODI1NDBFTSIsIkhvc3RPbmx5UHJvbWlzY01vZGUiOiJkZW55IiwiTm9TaGFyZSI6ZmFsc2V9"
}
➜  ~  docker inspect 74
[
{
    "Id": "7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04",
    "Created": "2015-11-27T13:23:11.515987776Z",
    "Path": "/docker-entrypoint.sh",
    "Args": [
        "cassandra",
        "-f"
    ],
    "State": {
        "Status": "running",
        "Running": true,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 1263,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2015-11-27T13:23:11.612899257Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "338a92b912e4d5a84c4f399a9475a1476f8226eff85c2592c8e80ba58b13d225",
    "ResolvConfPath": "/mnt/sda1/var/lib/docker/containers/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04/resolv.conf",
    "HostnamePath": "/mnt/sda1/var/lib/docker/containers/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04/hostname",
    "HostsPath": "/mnt/sda1/var/lib/docker/containers/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04/hosts",
    "LogPath": "/mnt/sda1/var/lib/docker/containers/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04-json.log",
    "Name": "/pensive_kalam",
    "RestartCount": 0,
    "Driver": "aufs",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": null,
        "ContainerIDFile": "",
        "LxcConf": [],
        "Memory": 0,
        "MemoryReservation": 0,
        "MemorySwap": 0,
        "KernelMemory": 0,
        "CpuShares": 0,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": -1,
        "Privileged": false,
        "PortBindings": {},
        "Links": null,
        "PublishAllPorts": false,
        "Dns": [],
        "DnsOptions": [],
        "DnsSearch": [],
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": [],
        "NetworkMode": "default",
        "IpcMode": "",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "no",
            "MaximumRetryCount": 0
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "",
        "ConsoleSize": [
            0,
            0
        ],
        "VolumeDriver": ""
    },
    "GraphDriver": {
        "Name": "aufs",
        "Data": null
    },
    "Mounts": [
        {
            "Name": "2249b03f9a598e5ac3f306983877292baa299c4499c9db77eb9bfcb88fd2f541",
            "Source": "/mnt/sda1/var/lib/docker/volumes/2249b03f9a598e5ac3f306983877292baa299c4499c9db77eb9bfcb88fd2f541/_data",
            "Destination": "/var/lib/cassandra",
            "Driver": "local",
            "Mode": "",
            "RW": true
        }
    ],
    "Config": {
        "Hostname": "7471b734d7e7",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": true,
        "AttachStderr": true,
        "ExposedPorts": {
            "7000/tcp": {},
            "7001/tcp": {},
            "7199/tcp": {},
            "9042/tcp": {},
            "9160/tcp": {}
        },
        "Tty": false,
        "OpenStdin": false,
        "StdinOnce": false,
        "Env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "CASSANDRA_VERSION=2.1.11",
            "CASSANDRA_CONFIG=/etc/cassandra"
        ],
        "Cmd": [
            "cassandra",
            "-f"
        ],
        "Image": "cassandra:2.1.11",
        "Volumes": {
            "/var/lib/cassandra": {}
        },
        "WorkingDir": "",
        "Entrypoint": [
            "/docker-entrypoint.sh"
        ],
        "OnBuild": null,
        "Labels": {},
        "StopSignal": "SIGTERM"
    },
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "e2f074e4b10e67cd7ac22d6e73d50304fc3f0a68d67c7fee6d7f8d647c9eb9b1",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": {
            "7000/tcp": null,
            "7001/tcp": null,
            "7199/tcp": null,
            "9042/tcp": null,
            "9160/tcp": null
        },
        "SandboxKey": "/var/run/docker/netns/e2f074e4b10e",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "63596aa5ec20516d477921fec4197d086b4dd4f1ad25014b5ddf027b82891966",
        "Gateway": "172.17.0.1",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "172.17.0.2",
        "IPPrefixLen": 16,
        "IPv6Gateway": "",
        "MacAddress": "02:42:ac:11:00:02",
        "Networks": {
            "bridge": {
                "EndpointID": "63596aa5ec20516d477921fec4197d086b4dd4f1ad25014b5ddf027b82891966",
                "Gateway": "172.17.0.1",
                "IPAddress": "172.17.0.2",
                "IPPrefixLen": 16,
                "IPv6Gateway": "",
                "GlobalIPv6Address": "",
                "GlobalIPv6PrefixLen": 0,
                "MacAddress": "02:42:ac:11:00:02"
            }
        }
    }
}
]

I simply ran docker run -it cassandra:2.1.11 and your terminal will be stuck, no way to stop the container. You have to stop the whole VM.

+1

Was able to duplicate issue earlier today on Docker 1.9.1 running Mac OSX 10.11.1 (15B42)

Was able to get around it by installing Docker 1.9.0

_Apologies for lack of information was on my work machine earlier during the day - will provide updated information at later time_

:+1:

Same here with Docker 1.9.1 and OS X 10.11.

For people having this issue

We've so far narrowed this down to not being a _docker_ bug but a kernel issue in combination with AUFS in the kernel that is used by the current boot2docker version; see https://github.com/docker/docker/issues/18180#issuecomment-161832035

  • If you want to stay informed on progress, use the subscribe button on this page. do not comment if you don't have new information that may help to resolve this issue.
  • if you want to help resolving this, performing a git-bissect of the kernel may help https://github.com/docker/docker/issues/18180#issuecomment-161834068
  • remember that each comment will send out more than 2000 e-mails to subscribers, and countless puppies will die :smile:

Just tested Storage Driver: devicemapper (with Server Version: 1.9.1 and kernel 4.2.6), and the bug does _not_ reproduce, so we're still in "strange interaction between some change in the newer kernel and the AUFS patches" land. :disappointed:

Tested, and bug is still present on the fresh 4.1.14 kernel, so we're still sitting on some commit that was backported to 4.1.13 interacting weird with the AUFS patches (and didn't get lucky with it being already fixed in the interim).

I decided to give it the old college try and cloned the boot2docker repo; then modified the aufs commit in the dockerfile to the previous version. So docker 1.9.1 kernel 4.1.13 + previous AUFS version that was shipped before 1.9.1. Compilation is slow on my machine ... is there a docker swarm setup that I can run in conjunction with a git bisect and aggregate the results? that would be sweet.

any way, I will post my results shortly if it works...

update:
4.1.13 + this AUFS commit still exhibit the problem.
ENV AUFS_COMMIT 1724fe65683d126a92c6baeea0b3c7d0306c63ef

I'm not aware of any easy setup to aggregate the results, although one could conceivably be built.

FWIW, https://sources.debian.net/src/ca-certificates-java/jessie/debian/postinst.in/ is the exact script that's running in that package, and https://sources.debian.net/src/ca-certificates-java/jessie/src/main/java/org/debian/security/UpdateCertificates.java/ is the exact Java source that's being executed when we get the hang + defunct + pegged CPU.

Got into related issue (java process hangs) today.

Host environment: Linux lenovo 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Distro: Ubuntu 15.10
Docker Engine: 1.9.1
Docker Machine: 0.5.0 (04cfa58)

I am following the network multi-host tutorial. Only difference is that I am playing with the oracle/nosql image. That image is based on Oracle Linux and uses OpenJDK.

@brunoborges yes, that could be the same issue, see https://github.com/docker/docker/issues/18500#issuecomment-163334612

@brunoborges just check your boot2docker.iso version – if it 1.9.1 you could try downgrade to 1.9.0 and recreate your machine and pull images once again.
If you go this way, could you write a short report here?

So i got to wondering why this only happens on java, and not any other languages. In one of my previous posts I was able to detail the most basic of reproductions by simply compiling and running

class Foo {
public static void main(String[] a) {
  System.out.println("hellowerld");
  }
}

for the failure case which resulted in defunct java process
and then

class Foo {
public static void main(String[] a) {
  System.out.println("hellowerld");
  System.exit(0);
  }
}

for the expected (non defunct) case.

I then tried to reproduce something similar using python. I was unsuccessful-- but I tried. For those interested I was trying to exhibit the last strace output exit_group(0) = ? that was seen from the zombie java process. (This link provided me with a lot of info about python threading / seccomp / etc http://stackoverflow.com/questions/25678139/how-do-you-cleanly-exit-after-enabling-seccomp-in-python )

So off to kernel land: After rebuilding the boot2docker iso, messing with the aufs verions and kernel versions (nothing of which really made a difference) I got fed up with how slow the compilation process was using numproc=1. So I changed it to 6. ==> note no longer 1 cpu (who only has 1 cpu now a days?). Suddenly the failure case

class Foo {
public static void main(String[] a) {
  System.out.println("hellowerld");
  }
}

started working.

Obviously, the next thing to try was to bump it back down to 1 cpu. ==> FAIL. back to a defunct java process.

So then I wanted to explore more about how java shuts down. It's not well defined. but with only 1 cpu this java process was able to be run successfully: (please don't make fun of my horrible java.)

import java.util.Iterator;
import java.util.Set;

class Foo {

static public final Object a = new Object();
static {
  final Object aa = a;
  Runtime.getRuntime().addShutdownHook(new Thread() {
        @Override
        public void run() {
                System.out.println("added one");
                if (aa == null)
                        { System.out.println("out"); }
        }
  });
  System.out.println("exit");
  Set<Thread> threadSet = Thread.getAllStackTraces().keySet();

  Thread[] threadArray = threadSet.toArray(new Thread[threadSet.size()]);
  for(Thread xxx : threadArray)
  {
    System.out.println(xxx.toString());
  }
////  System.exit(0);
}
static public void main(String[] a) {}

Can anyone else please confirm this behavior? << question is now moot

Update: Even with more than one core, a defunct java process can occur. (I was running cassandra-cli and it happened.)

docker-machine ssh myVM

ps -ef:
docker    6606  5863  0 Dec11 ?        00:00:00 /bin/sh /cassandra/bin/cassandra-cli -f /home/foo/my.cli -h 172.17.0.2
docker    6651  6606 99 Dec11 ?        00:41:29 [java] <defunct>
cat /proc/6606/stack
[<ffffffff8106e491>] do_wait+0x1ab/0x23f
[<ffffffff8106e5bc>] SYSC_wait4+0x97/0xb0
[<ffffffff8106d66b>] child_wait_callback+0x0/0x43
[<ffffffff8155466e>] system_call_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

cat /proc/6651/stack
[<ffffffff8106f06c>] do_exit+0x88f/0x8cc
[<ffffffff81075f8d>] signal_wake_up_state+0x23/0x36
[<ffffffff8106f104>] do_group_exit+0x36/0xa6
[<ffffffff8106f180>] __wake_up_parent+0x0/0x1d
[<ffffffff8155466e>] system_call_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

Having the same hanging issue building bitbucket-server - update-ca-certificates works fine but the jdk posthook hangs forever. Only a problem when using the 1.9.1 boot2docker. Switched to RancherOS image and had no problems. OSX 10.10.

With El Capitan, Docker 1.9.1 and Ubuntu 14.04.1 I get : Setting up ca-certificates-java which hangs infinitely.

@stremlenye rolled back to 1.9.0. Still hangs.

@brunoborges docker 1.9.0or boot2docker.iso 1.9.0?

@stremlenye Docker 1.9.0 ... what are the instructions to get boot2docker.iso 1.9.0 in my system?

@brunoborges Check out this comment above:

https://github.com/docker/docker/issues/18180#issuecomment-160660738

carsten-ulrich-saitow-ag explains how to create a new docker-machine with the 1.9.0 iso using the --virtualbox-boot2docker-url flag. That advice saved my bacon! Once I did that, I could once again install my JRE RPM in my containers.

@mobsy74 @stremlenye tried with boot2docker 1.9.0 and it hangs sometimes.

@brunoborges Thanks for trying that. So I'll stick to 1.8.3 till this bug would be fixed.

@stremlenye you mean boot2docker 1.8.3 ?

@brunoborges yes

hi,
i had the same issue.
problem got solved when downgrading the docker tools from 1.9.1 to 1.9.0
https://github.com/docker/toolbox/releases/download/v1.9.0/DockerToolbox-1.9.0.pkg

Enabling multiple cpus (in docker machine) solved this for me.

--virtualbox-cpu-count=4

@heiths can you please share versions of docker tools?

imho, boot2docker should step away from aufs to one of the other available storage drivers. There is a good reason why aufs never made it in to the linux kernel.

@robvanmieghem each driver has its limitations. Aufs is quite stable overall, and overlayfs has some blocking issues (depending on use)

@brunoborges DockerToolbox-1.9.1c -- this has worked for me on windows and osx.

@thaJeztah never said overlayfs is the perfect solution. I do think that btrfs is a good option for boot2docker though, boot2docker is dedicated for running docker containers and besides the fact that btrfs is fully supported in the linux kernel, it is really easy to look at the contents.

There are as many opinions on this as there are combinations of distro and filesystem :) No one solution is going to be perfect for all use-cases so in the spirit of open source and linux I think the best decision to make is to provide better support for multiple choices of distro. Already we have the choice of Boot2Docker or RancherOS and I believe some work was done to re-build boot2docker on a debian distro base. docker-machine will support ubuntu on cloud and bare metal so I'm sure an ubuntu-based vm iso wouldn't be hard to throw together as well as others such as one built on alpine or CoreOS etc. Then for each of those there's the choice of filesystem - again, RancherOS now offers ZFS as an optional install while CoreOS used to use BTRFS by default and I believe it's still an option and as of kernel 3.19 Ubuntu supports OverlayFS out of the box - anyone up for a Snappy Core based b2d image? ;)

Now, if only we could standardise the docker-machine parameter naming and remove the references to 'boot2docker' to reduce confusion - using boot2docker-url parameter to install RancherOS is somewhat unintuitive ;)

@far-blue +1

@heiths +1 . This solved it for me too on OSX with 1.9.1c

Setting CPU's to > 1 avoids the issue for me. 1.9.1c didn't help.

@heiths @fredriksvensson I fact I had this issue randomly appear on multiple containers environment and I also tried to increase amount of CPUs (memory also, but thats not a point). Couple of cycles of stop <all>/start <all> showed that problem has not gone. I would recommend you to check your environment same way to ensure the solution is stable for you.
/cc @timechanter

Oh, its definitely not gone. But 10% chance of hang versus 100% hang is at least manageable for the short term.

@heiths --virtualbox-cpu-count=4 also worked for me.

@timechanter +1 Setting CPUs to > 1 has avoided the issue for me at least once; looks like an effective workaround at the moment.

OSX 10.10.5

Uninstalled Docker toolbox 1.9.1. Downgraded to Docker toolbox 1.9.0 worked for me.

Same issue on El Capitan MacOSX

@heiths --virtualbox-cpu-count=4 works for me too.

Happened for me in Windows 7 with Docker Toolbox 1.9.1b and 1.9.1e.

"Setting up ca-certificates-java (20130815ubuntu1)..." - El Capitan MacOSX. Please help, guys!!! I can't fix it

@troian88 downgrade to boot2docker.iso 1.9.0 or 1.8.3.

@troian88, use a docker machine with multiple cpu's.

Can confirm that --virtualbox-cpu-count=2 is a temporary workaround on a hanging Setting up ca-certificates-java with Docker 1.9.1

For people having this issue

We've so far narrowed this down to not being a _docker_ bug but a kernel issue in combination with AUFS in the kernel that is used by the current boot2docker version; see https://github.com/docker/docker/issues/18180#issuecomment-161832035

  • If you want to stay informed on progress, use the subscribe button on this page. do not comment if you don't have new information that may help to resolve this issue.
  • if you want to help resolving this, performing a git-bissect of the kernel may help https://github.com/docker/docker/issues/18180#issuecomment-161834068
  • remember that each comment will send out more than 2000 e-mails to subscribers, and countless puppies will die :smile:

@externl @stremlenye Thanks.

Looking at LWP stacks, I found that the bug is caused by aufs_destroy_inode.

I'll look further into this.

It seems to be a deadlock related to inode->i_mutex.

# uname -a
Linux suda-PC2 4.2.0-21-generic #25-Ubuntu SMP Wed Dec 2 18:42:25 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

# ps -eLf | grep java
root     23358 23091 23358  0    2 10:48 ?        00:00:00 [java] <defunct>
root     23358 23091 23359 99    2 10:48 ?        00:53:41 [java] <defunct>
root     25679 28603 25679  0    1 11:42 pts/22   00:00:00 grep --color=auto java

# cat /proc/23358/stack # this is not so much helpful
[<ffffffff8107e002>] do_exit+0x822/0xb10
[<ffffffff8107e383>] do_group_exit+0x43/0xb0
[<ffffffff8107e404>] SyS_exit_group+0x14/0x20
[<ffffffff817f0232>] entry_SYSCALL_64_fastpath+0x16/0x75
[<ffffffffffffffff>] 0xffffffffffffffff

# cat /proc/23358/task/23359/stack # seems very helpful
[<ffffffff81183fe5>] generic_file_write_iter+0xf5/0x1e0
[<ffffffff811fc98b>] new_sync_write+0x9b/0xe0
[<ffffffffc061c273>] do_xino_fwrite+0x53/0x90 [aufs]
[<ffffffffc061c2fe>] xino_fwrite.part.27+0xe/0x10 [aufs]
[<ffffffffc061c388>] xino_fwrite+0x88/0xa0 [aufs]
[<ffffffffc063bf8f>] au_xigen_inc+0x5f/0xc0 [aufs]
[<ffffffffc061d0c7>] au_xino_delete_inode+0x177/0x1f0 [aufs]
[<ffffffffc062f336>] au_iinfo_fin+0xc6/0x1b0 [aufs]
[<ffffffffc0617c76>] aufs_destroy_inode+0x16/0x30 [aufs]
[<ffffffff812186ac>] destroy_inode+0x3c/0x60
[<ffffffff812187eb>] evict+0x11b/0x180
[<ffffffff81218a39>] iput+0x199/0x220
[<ffffffff81214155>] __dentry_kill+0x195/0x1f0
[<ffffffff812142e5>] dput+0x135/0x230
[<ffffffff811ff098>] __fput+0x188/0x220
[<ffffffff811ff17e>] ____fput+0xe/0x10
[<ffffffff81098b8b>] task_work_run+0x9b/0xb0
[<ffffffff8107db80>] do_exit+0x3a0/0xb10
[<ffffffff8107e337>] SyS_exit+0x17/0x20
[<ffffffff817f0232>] entry_SYSCALL_64_fastpath+0x16/0x75
[<ffffffffffffffff>] 0xffffffffffffffff



# gdb /usr/lib/debug/boot/vmlinux-4.2.0-21-generic -ex 'l *(generic_file_write_iter+0xf5)'
0xffffffff81183fe5 is in generic_file_write_iter (/build/linux-1vdNXv/linux-4.2.0/mm/filemap.c:2652).

# gdb /usr/lib/debug/lib/modules/4.2.0-21-generic/kernel/fs/aufs/aufs.ko -ex 'l *(aufs_destroy_inode+0x16)'
0xca6 is in aufs_destroy_inode (/build/linux-1vdNXv/linux-4.2.0/fs/aufs/super.c:56).

Note

People are saying that the bug can be avoided when running on multiple CPUs.
However, I can still hit the bug with my physical 4-CPU box. (although < 1 % probability.)
So --virtualbox-cpu-count=2 does _NOT_ guarantee that you can avoid the bug.

Note that the number of CPUs still matters.
I can deterministically hit the bug when I run taskset 0x1 java.(taskset assigns particular CPUs to the process).

@AkihiroSuda thanks so much for looking into this. Keep us posted! :heart:

Note that this issue also occurs on Windows 7 when using Docker 1.8.3.

We're seeing the same (exactly the same stack traces as in AkihiroSuda's comment above) on older kernels:
Linux 3.19.0-42-generic #48~14.04.1-Ubuntu SMP x86_64 with Docker version 1.9.1, build a34a1d5

I can confirm @AkihiroSuda's assertion about multiple CPUs -- I hit this on my host as well which has 8 cores.

That AUFS debugging looks really interesting -- perhaps it's worth filing an issue with AUFS and seeing if the AUFS maintainers can help debug? It'd probably be helpful for them if we could reproduce with just AUFS (no Docker), but that's not exactly trivial. :smile:

@AkihiroSuda, I have seen this hang with a non-Java use case. Namely trying to shutdown a forked MongoDB daemon. This does not occur on MongoDB startup or regular usage, but does occur reliably on shutdown.

@jakirkham Thanks, it seems that a specific thread configuration (tends to appear in Java, MongoDB, and perhaps other things else) is needed to trigger the bug.

BTW, on second thought, maybe AUFS hanging is a "result" of the bug rather than "the cause" of the bug.
I'm looking into this topic: http://www.serverphorums.com/read.php?12,673905
(I did not notice that zap_pid_ns_processes() was also hanging two days ago, because I had used bash as the init process at that time. https://github.com/docker/docker/issues/18180#issuecomment-166186061)

https://github.com/docker/docker/issues/18180#issuecomment-161843456
@andrewgdavis, you're quite right!

The bug seems to be a regression caused by the commit 296291cd to Linux kernel(mm/filemap.c).

I made a Boot2Docker ISO (boot2docker-v1.9.1-fix1.iso) that omits the commit 296291cd: https://github.com/AkihiroSuda/boot2docker/releases/tag/v1.9.1-fix1

Hope it works for everyone. :smiley:

The commit 296291cd produces infinite -EINTR loop in mm/filemap.c:generic_perform_write, which fs/aufs/xino.c:do_xino_fwrite() cannot tolerate:

static ssize_t do_xino_fwrite(vfs_writef_t func, struct file *file, void *kbuf,
                  size_t size, loff_t *pos)
{
..
    do {
         /* cannot escape from this loop 
            when func returns -EINTR infinitely! */
        err = func(file, buf.u, size, pos);
    } while (err == -EAGAIN || err == -EINTR);
..
}

As do_xino_fwrite() loops infinitely, zap_pid_ns_processes() (executed in another LWP) cannot return from schedule() when running on a single processor.
That's why we are hitting the bug.

@gilles-duboscq
You're hitting the bug because Canonical originally backported the commit 296291cd to the kernel 3.19 tree: http://kernel.ubuntu.com/git/ubuntu/ubuntu-vivid.git/commit/?id=6b08592b8acc677d5b9bb7986343fdd6e0ad3303

@AkihiroSuda wow, thank you for finding! What are the next steps? Should that patch be reverted, or is there a way to improve the patch? Are you considering sending a patch to the kernel upstream?

@AkihiroSuda , your fix works like a charm. thanks!

@thaJeztah

That's not an easy question.
Without commit 296291cd, sendfile(2) can be unkillable.
I fear this unkillable sendfile can cause a security issue (i.e., process exhaustion attack by an anonymous user) in some certain environments.

I'm trying to improve commit 296291cd, but it may take a while.

Anyway, I reported this bug to the kernel bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109971

I also made a Docker container akihirosuda/test18180 for ease of debugging: https://github.com/AkihiroSuda/test18180/tree/v0.0.1

$ docker run -it --rm akihirosuda/test18180
[INFO] Checking whether hitting docker#18180.
<-- hangs up here with commit 296291cd
[INFO] OK. not hitting docker#18180.
[INFO] Checking whether sendfile(2) is killable.
[INFO] If the container hangs up here, you are still facing the bug that linux@296291cd tried to fix.
<-- hangs up here without commit 296291cd
[INFO] OK. sendfile(2) is killable.
<-- No kernel can reach here

@AkihiroSuda hm, alright, sounds like an issue, yes. Thanks so much for the repro-container and research; at least there's a very specific task to work on, hopefully other people will join in, trying to assist finding a solution. Thank you sooooo much for your excellent work so far.

I was hit on OS X El Capitan Darwin Kernel Version 15.2.0
Docker version 1.9.1, build d12ea79c9de6d144ce6bc7ccfe41c507cca6fd35
boot2docker 1.9.1

The downgrade to boot2docker 1.9.0 with the commands below worked for me:

docker-machine rm default
docker-machine create -d virtualbox --virtualbox-boot2docker-url=https://github.com/boot2docker/boot2docker/releases/download/v1.9.0/boot2docker.iso default

@thaJeztah
AUFS is going to support 296291cd.
http://article.gmane.org/gmane.linux.file-systems.aufs.user/5343

So the next step is to wait for the update of AUFS.

You're a hero, @AkihiroSuda! Thanks for working with upstream to get this figured out! :heart:

If anyone wants to apply @AkihiroSuda fix, this works like a charm:

docker-machine rm default
docker-machine create -d virtualbox --virtualbox-boot2docker-url=https://github.com/AkihiroSuda/boot2docker/releases/download/v1.9.1-fix1/boot2docker-v1.9.1-fix1.iso default

For anyone on Ubuntu 14.04, downgrading to kernel 3.13.0-71 or older should solve the problem. 296291cd was backported after that.

@ebpitts thanks for the tip, here are the somewhat relative steps to downgrade Ubuntu 14.04 kernel to 3.13.0-71 with Docker 1.9.1 installed.

sudo apt-get install linux-image-3.13.0-71-generic
sudo apt-get install linux-generic linux-headers-generic linux-image-generic
sudo reboot

At this point you should have two kernels to pick from during bootloading. However, I was running in a remote Vagrant box over SSH, so no GRUB bootloader for me... so I removed the newer default kernel (3.13.0-74 in my case) as a boot option:

sudo apt-get remove linux-image-3.13.0-74-generic
sudo apt-get install linux-generic linux-headers-generic linux-image-generic

The output of the command had some things about Grub being updated, so you can inspect /boot/grub/grub.cfg and see what the default boot option will be on restart. I seem to have had to do this remove/re-add of the headers, but once the grub.cfg file looked good (3.13.0-71-generic was the only and first boot option) then go ahead and reboot:

sudo reboot

And, then, back SSH into my box:

$ uname -r
3.13.0-71-generic

But now it seemed docker was not working. So, full reinstall requires removal first:

sudo apt-get autoremove --purge docker-engine
rm -rf /var/lib/docker 

And, honestly, then my next attempt to run a docker build on the same container which was hanging on the OP title ("Setting up ca-certificates-java"), it _still_ died and locked up my machine this time, but now that I have no SSH access, I will go home and wait till 2016 to try to see if someone else has a better fix by then =/

So... I cannot affirm that even after downgrading the kernel to 3.13.0-71 after hitting this problem on Ubuntu 14.04 + Docker 1.9.1 is even effective. Yuck.

Yeah, just to double confirm, I ran $ docker run -it --rm akihirosuda/test18180 and it still hangs.

[INFO] Checking whether hitting docker#18180.
........................................................................................
[INFO] OK. not hitting docker#18180.
[INFO] Checking whether sendfile(2) is killable.
[INFO] If the container hangs up here, you are still 
       facing the bug that linux@296291cd tried to fix.

This is after downgrading Ubuntu 14.04 to kernel version 3.13.0-71 with AUFS

$ docker info
Containers: 3
Images: 18
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 24
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-71-generic
Operating System: Ubuntu 14.04.3 LTS
CPUs: 1
Total Memory: 490 MiB
Name: myrunner
ID: MLBL:bla:blah

@ebpitts are you sure the kernel downgrade on Ubuntu is really the fix?

Interesting, even when I set the storage driver to explicitly devicemapper in /var/default/docker:

DOCKER_OPTS="--dns 8.8.8.8 --dns 8.8.4.4 --storage-driver=devicemapper"

and restart the docker service, running docker info:

$ docker info
Containers: 1
Images: 16
Server Version: 1.9.1
Storage Driver: devicemapper
 Pool Name: docker-8:1-399761-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem:
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.817 GB
 Data Space Total: 107.4 GB
 Data Space Available: 35.25 GB
 Metadata Space Used: 2.74 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.145 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.77 (2012-10-15)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-71-generic
Operating System: Ubuntu 14.04.3 LTS
CPUs: 1
Total Memory: 490 MiB
Name: myrunner
ID: MLBL:bla:blah

I still hang in the akihirosuda/test18180:latest test image.

I downgraded to Docker 1.8.3 (not as easy without apt-get) on my Ubuntu 14 box using the raw binary, here are the steps... for anyone else... I'm back in business (please note I also downgraded the kernel to 3.13.0-71-generic previously as well, see above)

I installed the 1.8.3 binary from https://get.docker.com/builds/Linux/x86_64/docker-1.8.3, then moved it to /usr/bin/docker, gave it sudo chmod +x /usr/bin/docker executable permissions.

Then, I grabbed the raw sysvinit-debian script, commented out the check_init() body and replaced it with simply echo '' and dropped it into /etc/init.d. Then I set it to run on boot startup as root with ln -s /etc/init.d/docker /etc/rc2.d/S99docker, and ran sudo reboot. After that, I'm back running the docker 1.8.3 service on boot from a binary installation. I don't know why these steps aren't really documented on the binary-install page at docker's website. Anyways.

$ service docker status
 * Docker is running

$ docker version
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 18:01:15 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 18:01:15 UTC 2015
 OS/Arch:      linux/amd64

$ docker info
Containers: 4
Images: 38
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 46
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-71-generic
Operating System: Ubuntu 14.04.3 LTS
CPUs: 1
Total Memory: 490 MiB
Name: runner
ID: BLAH

Looks all good here - I can run $ docker run -it hello-world correctly. Running akihirosuda/test18180 still hangs actually (???) but I am able to build and run my original container without getting stuck on Setting up ca-certificates-java, which brought me here in the first place.

For reference, I am on Ubuntu 15.04 vivid, Linux 3.19.0-42-generic. Also affected.

Because I cannot use overlay (I need RHEL guests), I created a btrfs partition on a spare drive, and mounted it to /var/lib/docker (stop docker daemon before). Docker is now using btrfs happily, yet @akihirosuda's image still hangs at the second check (weird).

@mikeatlas
Thank you for testing my test18180 image.

The result is not weird.
The second check (Checking whether sendfile(2) is killable.) hangs up in old kernels.
You need a newer kernel with 296291cd to pass the second check.

| AUFS? | Kernel includes 296291cd? | Expected Result |
| --- | --- | --- |
| Y | Y | Hangs(1st check: Checking whether hitting docker#18180.) |
| Y | N | Hangs(2nd check: Checking whether sendfile(2) is killable.) |
| N | Y | Pass |
| N | N | Hangs(2nd check: Checking whether sendfile(2) is killable.) |

@cfstras Thank you, I'll look into that.

@cfstras, it is reproducible, and I opened #19073.

@mikeatlas RE: https://github.com/docker/docker/issues/18180#issuecomment-168111226

EDIT:
Earlier I was wrong about why docker didn't work after changing kernel versions, but I can confirm installing the extra package for my kernel and then re-installing docker solved this issue for me:

sudo apt-get remove docker-engine
sudo apt-get install linux-image-extra-3.13.0-71-generic
curl -sSL https://get.docker.com/ | sh

@lwcolton interesting, linux-image-extra-3.13.0-71-generic is not a package I thought to install (but I did install the generic extras packages afterward, as noted).... but still, I was under the impression AUFS module is far older than just the relatively recent 3.13.0-71 kernel. Regardless, downgrading to Docker 1.8.3 wasn't too painful either, and if I had to go through the process again, I'd prefer downgrading Docker over downgrading the linux kernel any day of the week.

@dschep Note for others that switching to OverlayFS requires kernel version 3.18+ on Linux, and as quoted on Docker's page, _As promising as OverlayFS is, it is still relatively young. Therefore caution should be taken before using it in production Docker environments._, and, comes with a warning that turning on OverlayFS should be preceded by effectively backing up any images you might have by pushing them _all_ to a registry beforehand.

@mikeatlas I guess this is the biggest limitation so far on OverlayFS: "_Therefore, using yum inside of a container on a Docker host using the overlay storage driver is unlikely to work without implementing workarounds._".

@brunoborges yum is currently being patched to work on overlayfs, so, soon, newer versions of yum should be able to work on overlayfs (there still will be some issues though, so it depends on your use-case/situation of you run into them). Excessive inode usage can be another issue.

I believe I'm experiencing this issue as well. The call trace mentions apparmor, but disabling apparmor makes no difference. Using devicemapper made the problem go away.

dmesg:

[ 2761.400178] INFO: task flake8:4231 blocked for more than 120 seconds.
[ 2761.403014]       Not tainted 3.13.0-74-generic #118-Ubuntu
[ 2761.405419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2761.408741] flake8          D ffff8807707d3180     0  4231   1798 0x00000000
[ 2761.408745]  ffff8806bcb07c70 0000000000000082 ffff880035b34800 ffff8806bcb07fd8
[ 2761.408748]  0000000000013180 0000000000013180 ffff880035b34800 ffff8806b95054f8
[ 2761.408750]  ffff8806b95054fc ffff880035b34800 00000000ffffffff ffff8806b9505500
[ 2761.408752] Call Trace:
[ 2761.408759]  [<ffffffff81729499>] schedule_preempt_disabled+0x29/0x70
[ 2761.408762]  [<ffffffff8172b305>] __mutex_lock_slowpath+0x135/0x1b0
[ 2761.408765]  [<ffffffff811c903e>] ? lookup_fast+0x14e/0x2c0
[ 2761.408767]  [<ffffffff8172b39f>] mutex_lock+0x1f/0x2f
[ 2761.408770]  [<ffffffff811ca9cd>] do_last+0x2bd/0x1200
[ 2761.408772]  [<ffffffff8131666b>] ? apparmor_file_alloc_security+0x5b/0x180
[ 2761.408776]  [<ffffffff812d8c86>] ? security_file_alloc+0x16/0x20
[ 2761.408779]  [<ffffffff811cde8b>] path_openat+0xbb/0x640
[ 2761.408782]  [<ffffffff8109ac3a>] ? try_to_wake_up+0x1fa/0x2c0
[ 2761.408785]  [<ffffffff811ce4af>] ? getname_flags+0x4f/0x190
[ 2761.408787]  [<ffffffff811cf27a>] do_filp_open+0x3a/0x90
[ 2761.408790]  [<ffffffff811dc0d7>] ? __alloc_fd+0xa7/0x130
[ 2761.408793]  [<ffffffff811bd839>] do_sys_open+0x129/0x280
[ 2761.408795]  [<ffffffff811bd9ae>] SyS_open+0x1e/0x20
[ 2761.408798]  [<ffffffff8173575d>] system_call_fastpath+0x1a/0x1f
root# docker info
Containers: 14
Images: 565
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 593
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.13.0-74-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 16
Total Memory: 29.44 GiB
Name: ...
ID: ...
Username: ...
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

ps auxfg:

9013       4195  0.0  0.0 175808 24012 ?        Ssl  Jan08   0:01  \_ /usr/local/bin/python3.4 /usr/local/bin/flake8 .
9013       4224 99.9  0.0      0     0 ?        Zl   Jan08 1042:10  |   \_ [flake8] <defunct>
9013       4230  0.0  0.0      0     0 ?        Z    Jan08   0:00  |   \_ [flake8] <defunct>
root      14058  0.0  0.0 171780 21960 ?        Ssl  03:33   0:00  \_ /usr/local/bin/python3.5 /usr/local/bin/flake8 .
root      14148 99.9  0.0      0     0 ?        Zl   03:33 639:25      \_ [flake8] <defunct>

This was fixed in AUFS upstream -- boot2docker has been updated to include the fix (which will go out with the next release), and any non-boot2docker users who are affected should apply the updated AUFS release. :+1:

@tianon do you have any references to the upstream bugs?

http://permalink.gmane.org/gmane.linux.file-systems.aufs.user/5345 is the upstream release announcement -- not sure if there's more discussion than that

http://comments.gmane.org/gmane.linux.file-systems.aufs.user/5337 has more of the background discussion for the issue

thank you!

This was fixed in AUFS upstream -- boot2docker has been updated to include the fix (which will go out with the next release), and any non-boot2docker users who are affected should apply the updated AUFS release. :+1:

Nice.

Was the buggy version of AUFS used on Docker Hub?

@tianon "Apply the updated AUFS release" means for non-boot2docker users (everyone running the docker engine _not_ in development on Mac OS X, which b2d is built for, primarily) that they'll have to wait for a Linux kernel update with this AUFS patch... Or ... considering how many installs/people appear to be affected, could simplified/minimal instructions be provided by anyone on how to patch AUFS to 4.1.13+? The guide for 4.1.13+ is definitely non-trivial to read through; patching the linux kernel oneself for this specific fix isn't exactly for the faint of heart.

Just so I understand if I build this boot2docker.iso and put it in ~/.docker/machine/cache and create a new VM with docker-machine that VM will use this new copy of boot2docker.

Just so I understand if I build this boot2docker.iso and put it in ~/.docker/machine/cache and create a new VM with docker-machine that VM will use this new copy of boot2docker.

Technically yes, that would work, but a better option might be to use --virtualbox-boot2docker-url, e.g.:

$ docker-machine create -d \
    --virtualbox-boot2docker-url file://$(pwd)/boot2docker.iso \
    newvm

Ah, ok, thanks for clarifying. Well, this seems to be working then.

Sent a request to update AUFS to Canonical: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043

Is there going to be a new release of boot2docker, as 1.9.1 doesn't seem to have the fix in it?

edit. with @AkihiroSuda boot2docker image, i was able to continue. Thanks everybody!

Yeah, this fix is after 1.9.1. @tianon said a release is planned. Normally, it goes out the same time docker goes out for release. As they tend to follow a 2 week cycle on releases, I expect a release is imminent. In the interim, @AkihiroSuda built a fix that you can use or you can build your own, which is really straightforward just takes time.

Thanks @AkihiroSuda for the image, it works! :)

+1 Debian Wheezy and Ubuntu 15.04

@jakirkham release is 2 month cycle, but we're releasing 1.10-rc1 very soon, boot2docker will have an rc1 version for that too.

Oh, sorry, must have mixed it up. Thanks for setting me straight, @tiborvass. Did you catch that, @shusso?

@jakirkham I did, thanks :+1:

I was able to create a docker-machine host virtualbox using this fix:

https://github.com/AkihiroSuda/boot2docker/releases/tag/v1.9.1-fix1

Currently, docker-machine hosts that I create on google compute engine, with the --driver google option, seem to have this issue. There is no option with the google driver to specify a different .iso, so I can't use the fix above on google compute.

Does anyone know of a workaround? or indeed if google are aware of the issue, or where I should file a bug report for them.

Is the google docker-machine driver maintained by docker, or by google?

Scanning through the above, a potential workaround looks like it might be the one suggested by @nathanleclaire

$ docker-machine create -d google --engine-storage-driver overlay overlay

There also appears to be a "--google-machine-image" option available for the google driver for docker-machine. The command:

$ gcloud compute images list

Lists the available public images. I notice that a new ubuntu wily has recently been put up.

Just to confirm:

$ docker-machine create -d google --engine-storage-driver overlay overlay

Worked. I will also investigate creating a custom machine image using the fixed boot2docker, and trying to connect that up with docker-machine.

To anyone hitting this on boot2docker, please give the RC over at
https://github.com/tianon/boot2docker-legacy/releases/tag/v1.10.0-rc1 a
shot. :+1:

@tianon I testest it with the following image trecloux/docker-java-zombie
And it looks good .... but it still hangs with akihirosuda/test18180 image

Seriously impressive work @AkihiroSuda You're one persistent bug zapper!!

@trecloux are you using btrfs and getting hang for sendfile?
If so, it is a known issue: https://github.com/docker/docker/issues/19073

@AkihiroSuda I'm using the v1.10.0-rc1 boot2docker image with aufs :

$ docker info
Containers: 1
Images: 2
Server Version: 1.10.0-rc1
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 35
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.15-boot2docker
Operating System: Boot2Docker 1.10.0-rc1 (TCL 6.4.1); master : c4985d5 - Fri Jan 15 19:29:39 UTC 2016
CPUs: 1
Total Memory: 996.2 MiB
Name: b2d10rc1
ID: 34JP:KEQA:O4QJ:U2SE:BO2V:43JG:NL57:ORK7:HHMY:2P4U:2E3V:7B4I
Debug mode (server): true
 File Descriptors: 10
 Goroutines: 22
 System Time: 2016-01-19T08:24:26.145616582Z
 EventsListeners: 0
 Init SHA1:
 Init Path: /usr/local/bin/docker
 Docker Root Dir: /mnt/sda1/var/lib/docker
Username: trecloux
Registry: https://index.docker.io/v1/
Labels:
 provider=virtualbox

Here is the output of your test image :

$ docker run -ti --rm akihirosuda/test18180
[INFO] Checking whether hitting docker#18180.
....................................................................................................
[INFO] OK. not hitting docker#18180.
[INFO] Checking whether sendfile(2) is killable.
[INFO] If the container hangs up here, you are still facing the bug that linux@296291cd tried to fix.
/test.sh: line 22:  1008 Killed                  /sendfile-test

@trecloux It's expected behavior. Nothing is hanging up.

@AkihiroSuda Ok, sorry. And thanks for your effort :-)
So @tianon : 1.10.0-rc1 looks good.

same problem here. Hangs on setting up ca-certificates, CPU goes nuts...

$ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      darwin/amd64

running MacOSX 10.11.2

@sgoendoer try to use the @AkihiroSuda's image with docker-machine create --driver virtualbox --virtualbox-boot2docker-url="file:/path_to_the_image" nameofmachine

for non boot2docker users, any idea which kernel version this will be fixed with? 3.13.0-71 seems to work, 3.13.0-74 and 3.13.0-76 seem to be broken...

Same issue here; So there is no easy fix for this?
Is this fixed in the RC version? Trying that one now. Although It probably causes other issues.

LATEST QUICK WORKAROUNDS (Update: Jan 21 15:33 UTC)

| Distro | Workaround |
| --- | --- |
| General | Use devicemapper/overlay/btrfs (but it may cause another problem..) |
| Boot2Docker | :white_check_mark: Upgrade to v1.10.0-rc1 |
| Ubuntu 14.04LTS | :arrow_down: Downgrade kernel to 3.13.0-71 or older |
| Ubuntu 15.04 | :arrow_down: Downgrade kernel to 3.19.0-39 or older (:warning: not tested) |
| Ubuntu 15.10 | :arrow_down: Downgrade kernel to 4.2.0-18 or older |
| Debian 7/8 | :arrow_down: Downgrade kernel to version 3.16.7-ckt11 of release 3.16.0 (apt-get install linux-image-3.16.0-4-amd64=3.16.7-ckt11-1+deb8u3) or older |
| Debian 9 | :white_check_mark: (does not support AUFS since kernel 3.18-1~exp1) |
| Gentoo | :white_check_mark: Upgrade to recent ones (:warning: not tested) |
| RHEL/CentOS | :white_check_mark: (does not support AUFS) |
| openSUSE | :white_check_mark: (does not support AUFS) |

Distributors Issue Tickets

| Distro | Status | Issue URL |
| --- | --- | --- |
| Boot2Docker | :white_check_mark: Closed | https://github.com/boot2docker/boot2docker/pull/1113 |
| Ubuntu | :white_medium_square: In Progress | https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 |
| Debian | Not confirmed yet | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812207 |

@AkihiroSuda v1.10.0-rc1 does not fixed the zombies for me, anyone who also got the problem?

root     21996  0.0  0.0      0     0 ?        Ss   08:47   0:00  \_ [bash]
root     23810 99.7  0.0      0     0 ?        Zl   08:50   7:42  |   \_ [phantomjs] <defunct>
wait4(-1, 
[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 469
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 28
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f60c9ec3d40}, {0x4438a0, [], SA_RESTORER, 0x7f60c9ec3d40}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
close(3)                                = -1 EBADF (Bad file descriptor)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
wait4(-1, 0x7ffc5ec19e58, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn(0xffffffffffffffff)        = 0
read(0, "", 1)                          = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(0)   

After starting strace on the defunct process after some time this appeared, but after this the zombie still exists:

root     21996  0.0  0.0      0     0 ?        Ss   08:47   0:00  \_ [bash]
root     23810 99.9  0.0      0     0 ?        Zl   08:50  26:06      \_ [phantomjs] <defunct>

This time it is not accessable via strace anymore, I guess it is still attached even if it is not.
:~# strace -p 23810

attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted

@wzrdtales
Could you please get LWP stacks as in https://github.com/docker/docker/issues/18180#issuecomment-166186061 ?
also, what's your cmdline and version of phantomjs?

Sure, phantomjs version: 1.9

cmdline

$builddir/node_modules/phantomjs/lib/phantom/bin/phantomjs $builddir/node_modules/testem/assets/phantom.js http://localhost:7357/3891
:~# cat /proc/21996/stack # bash
[<ffffffff8106fee9>] do_wait+0x1e9/0x260
[<ffffffff81071042>] SyS_wait4+0xa2/0x110
[<ffffffff8106ecd0>] child_wait_callback+0x0/0x70
[<ffffffff810f945a>] zap_pid_ns_processes+0xfa/0x190
[<ffffffff81070b26>] do_exit+0x8e6/0xa80
[<ffffffff81070d46>] do_group_exit+0x46/0xb0
[<ffffffff81070dc7>] SyS_exit_group+0x17/0x20
[<ffffffff8154e50d>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

:~# cat /proc/23810/stack #phantomjs
[<ffffffff81070935>] do_exit+0x6f5/0xa80
[<ffffffff81070d46>] do_group_exit+0x46/0xb0
[<ffffffff8107ffd3>] get_signal_to_deliver+0x233/0x610
[<ffffffff81014507>] do_signal+0x67/0xad0
[<ffffffff811bcf38>] new_sync_read+0x78/0xb0
[<ffffffff8101e045>] read_tsc+0x5/0x20
[<ffffffff810d2442>] ktime_get_ts+0x42/0xd0
[<ffffffff811d091e>] poll_select_copy_remaining+0xfe/0x150
[<ffffffff8101501b>] do_notify_resume+0xab/0xc0
[<ffffffff8154e7ca>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff

And also the task stack:

:~# cat /proc/23810/task/23839/stack 
[<ffffffff8114ef6a>] __generic_file_write_iter+0x14a/0x360
[<ffffffff8114f1ca>] generic_file_write_iter+0x4a/0xd0
[<ffffffff811bd0ec>] new_sync_write+0x6c/0xb0
[<ffffffff811bd080>] new_sync_write+0x0/0xb0
[<ffffffff811bd0fb>] new_sync_write+0x7b/0xb0
[<ffffffffa050c377>] xino_fwrite.part.28+0x67/0xb0 [aufs]
[<ffffffffa050c4b5>] xino_fwrite+0x75/0x90 [aufs]
[<ffffffff811fa97a>] fsnotify_clear_marks_by_inode+0x2a/0x110
[<ffffffff811d84b8>] iput+0x48/0x1b0
[<ffffffffa052b780>] au_xigen_inc+0x50/0xa0 [aufs]
[<ffffffffa050d33d>] au_xino_delete_inode+0x1ad/0x220 [aufs]
[<ffffffff811e5143>] __inode_wait_for_writeback+0x63/0xc0
[<ffffffffa051f485>] au_iinfo_fin+0xc5/0x1d0 [aufs]
[<ffffffffa0507cae>] aufs_destroy_inode+0xe/0x30 [aufs]
[<ffffffff811cab10>] do_unlinkat+0x170/0x2c0
[<ffffffff8108d4f1>] task_work_run+0xa1/0xc0
[<ffffffff81015025>] do_notify_resume+0xb5/0xc0
[<ffffffff8154e50d>] system_call_fast_compare_end+0x10/0x15
[<ffffffffffffffff>] 0xffffffffffffffff

Also to add System information:

uname -a
Linux mg_build_server_12 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u2~bpo70+1 (2016-01-03) x86_64 GNU/Linux

:~# docker info
Containers: 34
 Running: 9
 Paused: 0
 Stopped: 25
Images: 1058
Server Version: 1.10.0-rc1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 1197
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins: 
 Volume: local
 Network: null host bridge
Kernel Version: 3.16.0-0.bpo.4-amd64
Operating System: Debian GNU/Linux 7 (wheezy)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 23.58 GiB

@wzrdtales Are you using Docker (not Boot2Docker) v1.10.0-rc1 on Debian?
It doesn't work because the issue is a bug of kernel rather than that of Docker.

I'm looking into Debian kernels and I'll update the list https://github.com/docker/docker/issues/18180#issuecomment-173436661.

@AkihiroSuda y, it is directly on debian. I have overseen the debian point on your list, thanks for pointing out :)

@wzrdtales I updated the list, please use ckt11 kernel.

@AkihiroSuda: FWIW, I believe the recommendation to downgrade to v3.16.7-ckt11 puts you at risk for CVE-2016-0728, which has a known public root escalation. I just pinged https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812207, but FYI before you downgrade.

@zmerlynn Thank you for pointing that out!

LATEST QUICK WORKAROUNDS (Update: Jan 26 1:49 UTC)

| Distro | Workaround |
| --- | --- |
| General | Use devicemapper/overlay/btrfs (but it may cause another problem..).
If you can upgrade AUFS and build the kernel manually, you can also use AUFS v20160111 or later. |
| Boot2Docker | :white_check_mark: Upgrade to v1.10.0-rc1 |
| Ubuntu 14.04LTS | :arrow_down: Downgrade kernel to 3.13.0-71 or older |
| Ubuntu 15.04 | :arrow_down: Downgrade kernel to 3.19.0-39 or older (:warning: not tested) |
| Ubuntu 15.10 | :arrow_down: Downgrade kernel to 4.2.0-18 or older |
| Debian 7/8 | :arrow_down: Downgrade kernel to version 3.16.7-ckt11 of release 3.16.0 (apt-get install linux-image-3.16.0-4-amd64=3.16.7-ckt11-1+deb8u3) or older |
| Debian 9 | :white_check_mark: (does not support AUFS since kernel 3.18-1~exp1) |
| Gentoo | :white_check_mark: Upgrade to recent ones (:warning: not tested) |
| RHEL/CentOS | :white_check_mark: (does not support AUFS) |
| openSUSE | :white_check_mark: (does not support AUFS) |

:warning: Downgrading kernel can be a security risk (e.g., CVE-2016-0728)

Distributors Issue Tickets

| Distro | Status | Issue URL |
| --- | --- | --- |
| Boot2Docker | :white_check_mark: Closed | https://github.com/boot2docker/boot2docker/pull/1113 |
| Ubuntu | :white_medium_square: In Progress | https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 |
| Debian | :white_medium_square: In Progress | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812207 |

We also met this problem , mongod is stuck in R state running with 100% CPU.

Here's the real trick to get correct stack trace which leads me here:

echo "l" > /proc/sysrq-trigger

from there, you can see CPU 2 is stuck in a inifinity loop by AUFS

[38841.947453] CPU: 2 PID: 25084 Comm: mongod Not tainted 4.2.0-25-generic #30~14.04.1-Ubuntu
[38841.947454] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[38841.947455] task: ffff88037383cb00 ti: ffff880097afc000 task.ti: ffff880097afc000
[38841.947456] RIP: 0010:[<ffffffff813b6fe0>]  [<ffffffff813b6fe0>] iov_iter_init+0x0/0x40
[38841.947457] RSP: 0018:ffff880097aff920  EFLAGS: 00000246
[38841.947458] RAX: 0000000000002cd0 RBX: ffff88037b289700 RCX: 0000000000000001
[38841.947458] RDX: ffff880097aff928 RSI: 0000000000000001 RDI: ffff880097aff960
[38841.947459] RBP: ffff880097aff998 R08: 0000000000000004 R09: 0000000000000000
[38841.947460] R10: 0000000000000006 R11: 0000000000000005 R12: ffff880097affa70
[38841.947461] R13: ffff880097affa6c R14: ffff88037b289700 R15: ffffffff811ea830
[38841.947462] FS:  00007f7f2acf2b80(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
[38841.947463] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[38841.947463] CR2: 00007f47dbdb0000 CR3: 00000000a3995000 CR4: 00000000000006e0
[38841.947464] Stack:
[38841.947465]  ffffffff811ea8a9 ffff880097affa6c 0000000000000004 ffff88037b289700
[38841.947466]  0000000000002cd0 0000000000000000 0000000000000000 0000000000000000
[38841.947467]  0000000000000003 0000000000000000 0000000000000004 ffff880097aff928
[38841.947467] Call Trace:
[38841.947468]  [<ffffffff811ea8a9>] ? new_sync_write+0x79/0xb0
[38841.947469]  [<ffffffffc03fbbe3>] do_xino_fwrite+0x53/0x90 [aufs]
[38841.947470]  [<ffffffffc03fc05e>] xino_fwrite.part.27+0xe/0x10 [aufs]
[38841.947471]  [<ffffffffc03fc15a>] xino_fwrite+0x6a/0x80 [aufs]
[38841.947471]  [<ffffffffc041a634>] au_xigen_inc+0x54/0xa0 [aufs]
[38841.947472]  [<ffffffffc03fceab>] au_xino_delete_inode+0x17b/0x200 [aufs]
[38841.947473]  [<ffffffffc040e167>] au_iinfo_fin+0xc7/0x1c0 [aufs]
[38841.947474]  [<ffffffffc03f7c26>] aufs_destroy_inode+0x16/0x30 [aufs]
[38841.947475]  [<ffffffff8120529c>] destroy_inode+0x3c/0x60
[38841.947476]  [<ffffffff812053db>] evict+0x11b/0x180
[38841.947476]  [<ffffffff81205cb5>] iput+0x175/0x1e0
[38841.947477]  [<ffffffff81200c4d>] __dentry_kill+0x19d/0x1f0
[38841.947478]  [<ffffffff81200e39>] dput+0x199/0x200
[38841.947479]  [<ffffffff811f449a>] path_put+0x1a/0x30
[38841.947480]  [<ffffffff8174dfbd>] unix_release_sock+0x17d/0x2a0
[38841.947480]  [<ffffffff8174e101>] unix_release+0x21/0x40
[38841.947481]  [<ffffffff8169370f>] sock_release+0x1f/0x80
[38841.947482]  [<ffffffff81693782>] sock_close+0x12/0x20
[38841.947483]  [<ffffffff811ecb14>] __fput+0xe4/0x210
[38841.947483]  [<ffffffff811ecc8e>] ____fput+0xe/0x10
[38841.947484]  [<ffffffff8109360b>] task_work_run+0x9b/0xb0
[38841.947485]  [<ffffffff81085a45>] get_signal+0x565/0x600
[38841.947486]  [<ffffffff81014438>] do_signal+0x28/0x9a0
[38841.947487]  [<ffffffff8105d00e>] ? kvm_clock_get_cycles+0x1e/0x20
[38841.947487]  [<ffffffff810e5ede>] ? ktime_get_ts64+0x4e/0xf0
[38841.947488]  [<ffffffff811fe5f9>] ? poll_select_copy_remaining+0xd9/0x120
[38841.947489]  [<ffffffff811ff3bd>] ? SyS_select+0xbd/0xf0
[38841.947490]  [<ffffffff81014e15>] do_notify_resume+0x65/0x80
[38841.947491]  [<ffffffff817bacc4>] int_signal+0x12/0x17
[38841.947492] Code: 6c 83 ea 04 48 83 c7 04 e9 58 ff ff ff b9 6c 6c 00 00 48 83 c7 02 83 ea 02 66 89 4f fe e9 39 ff ff ff 66 0f 1f 84 00 00 00 00 00 <55> 65 48 8b 04 25 44 3b 01 00 48 83 b8 18 c0 ff ff ff 48 89 e
5

searching by do_xino_fwrite leads me right here!

I seem to be encountering this issue as well on Debian Stretch where it hangs on setting up certs.

Here's the error message: https://gist.github.com/clball/738feb46094802a1bcf7
Here's version info: https://gist.github.com/clball/494fe8598dd0cdfd6d10
Here's the dockerfile: https://gist.github.com/8778f8db143478d6c8ab

So what is the solution for OSX here, is there a docker update already?

Not yet, but there is a release candidate. ( https://github.com/tianon/boot2docker-legacy/releases/tag/v1.10.0-rc2 )

The solution for me was change the storage backend.

Add a line to /etc/default/docker
¡¡BE CAREFUL you'll lose your container data!!

DOCKER_OPTS="--storage-driver=devicemapper"

I recommend stop docker service, erase docker folder on /var/lib and then add the line and restart docker service.

@referup-tarantegui just as a heads up, the devicemapper driver is considered terribly poor for performance unless mounted directly to real physical disks. See
https://jpetazzo.github.io/assets/2015-03-03-not-so-deep-dive-into-docker-storage-drivers.html#43 https://jpetazzo.github.io/assets/2015-03-03-not-so-deep-dive-into-docker-storage-drivers.html#44
and
https://docs.docker.com/engine/userguide/storagedriver/device-mapper-driver/ "Device Mapper and Docker performance"

There is a version B for the release candidate2

https://github.com/boot2docker/boot2docker/releases/tag/v1.10.0-rc2-b

Updated the table about Ubuntu.

LATEST QUICK WORKAROUNDS (Update: Feb 3 6:30 UTC)

| Distro | Workaround |
| --- | --- |
| General | Use devicemapper/overlay/btrfs (but it may cause another problem..).
If you can upgrade AUFS and build the kernel manually, you can also use AUFS v20160111 or later. |
| Boot2Docker | :white_check_mark: Upgrade to v1.10.0-rc3 |
| Ubuntu 14.04LTS | :white_check_mark: Upgrade kernel to 3.13.0-77.121hf1533043v20160201b1 (PPA) |
| Ubuntu 15.04 | :white_check_mark: Upgrade kernel to 3.19.0-49.55hf1533043v20160201b1 (PPA) |
| Ubuntu 15.10 | :white_check_mark: Upgrade kernel to 4.2.0-27.32hf1533043v20160201b1 (PPA) |
| Debian 7/8 | :arrow_down: Downgrade kernel to version 3.16.7-ckt11 of release 3.16.0 (apt-get install linux-image-3.16.0-4-amd64=3.16.7-ckt11-1+deb8u3) or older |
| Debian 9 | :white_check_mark: (does not support AUFS since kernel 3.18-1~exp1) |
| Gentoo | :white_check_mark: Upgrade to recent ones (:warning: not tested) |
| RHEL/CentOS | :white_check_mark: (does not support AUFS) |
| openSUSE | :white_check_mark: (does not support AUFS) |

Distributors Issue Tickets

| Distro | Status | Issue URL |
| --- | --- | --- |
| Boot2Docker | :white_check_mark: Closed | https://github.com/boot2docker/boot2docker/pull/1113 |
| Ubuntu | :white_medium_square: In Progress (ETA: Feb 20) | https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 |
| Debian | :white_medium_square: In Progress | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812207 |

@AkihiroSuda, I think you meant v1.10.0-rc2 for boot2docker or perhaps the link was suppose to go here ( https://github.com/boot2docker/boot2docker/releases/tag/v1.10.0-rc3 ).

@jakirkham Thank you, fixed the link.

rc3 is on the official repo instead of my fork:
https://github.com/boot2docker/boot2docker/releases/tag/v1.10.0-rc3

We investigated the tech debt that forced us to use my fork and found that
it had been fixed since docker-machine 0.5, so we're moving onwards and
upwards. :smile:

btw, for those looking to switch kernels to the chiluk-patched versions listed above for Ubuntu PPAs. For example, for Ubuntu 14.04, linux-image-3.13.0-77-generic, your steps would be:

$ sudo apt-get update
$ sudo apt-get install software-properties-common -y
$ sudo add-apt-repository ppa:chiluk/1533043
$ sudo apt-get update
$ sudo apt-get install linux-image-3.13.0-77-generic \
                       linux-image-extra-3.13.0-77-generic -y

Then you'll need to update your /etc/default/grub configuration, then run sudo update-grub, then reboot into the patched new kernel build. If you haven't done this before, here's a guide on how to set a different default kernel in grub.

I can confirm that this issue does not exist on Docker 1.10.0, which fixed my situation on OS X 10.11 as well. Otherwise I was going to downgrade to 1.9.0.

I'm still getting the java hung container/process problem on docker 1.10:

root     30480  0.1  0.0      0     0 ?        Z    16:15   0:00 [update-hosts] <defunct>

@AkihiroSuda I'm trying your Quick Workarounds (thanks!) but I'm not able to install the older kernel on my Debian 8 (jessie) server, I get:

E: Version '3.16.7-ckt11-1+deb8u3' for 'linux-image-3.16.0-4-amd64' was not found

When I try @mikeatlas suggestions (btw had to sudo apt-get install software-properties-common to get sudo add-apt-repository ppa:chiluk/1533043 to work) I get an update failure, which I guess is why the install doesn't work

$ sudo add-apt-repository ppa:chiluk/1533043
You are about to add the following PPA to your system:
 This ppa contains the proposed fix for 1533043, and I would appreciate testing and results reported back to  LP#1533043.

Thank you,
 More info: https://launchpad.net/~chiluk/+archive/ubuntu/1533043
Press [ENTER] to continue or ctrl-c to cancel adding it

gpg: keyring `/tmp/tmp_j6e2_s5/secring.gpg' created
gpg: keyring `/tmp/tmp_j6e2_s5/pubring.gpg' created
gpg: requesting key E2B6D4A9 from hkp server keyserver.ubuntu.com
gpg: /tmp/tmp_j6e2_s5/trustdb.gpg: trustdb created
gpg: key E2B6D4A9: public key "Launchpad PPA for Dave Chiluk" imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
OK
$ sudo apt-get update
Ign http://ftp.us.debian.org jessie InRelease
Hit http://security.debian.org jessie/updates InRelease
...
Get:15 https://apt.dockerproject.org debian-jessie/main Translation-en [454 B]
Ign https://apt.dockerproject.org debian-jessie/main Translation-en
Err http://ppa.launchpad.net jessie/main amd64 Packages
  404  Not Found
Ign http://ppa.launchpad.net jessie/main Translation-en_US
Ign http://ppa.launchpad.net jessie/main Translation-en
Fetched 8,877 B in 3s (2,935 B/s)
W: Failed to fetch http://ppa.launchpad.net/chiluk/1533043/ubuntu/dists/jessie/main/binary-amd64/Packages  404  Not Found

E: Some index files failed to download. They have been ignored, or old ones used instead.

$ sudo apt-get install linux-image-3.13.0-77-generic \
>                        linux-image-extra-3.13.0-77-generic -y
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package linux-image-3.13.0-77-generic
E: Couldn't find any package by regex 'linux-image-3.13.0-77-generic'
E: Unable to locate package linux-image-extra-3.13.0-77-generic
E: Couldn't find any package by regex 'linux-image-extra-3.13.0-77-generic'

My docker info:

$ docker info
Containers: 98
 Running: 9
 Paused: 0
 Stopped: 89
Images: 1415
Server Version: 1.10.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 1371
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.6 GiB
Name: r62
ID: VUJF:KPXB:UXL6:TP3G:75CE:WQND:PJGJ:GG45:MCMI:JTV5:Q3IR:6FHC
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
Labels:
 provider=generic

@jamshid thanks for the tip about needing software-properties-common, I updated my post above.

@jamshid: After you add the PPA and do apt-get update, check and see what kernels are available to your machine... It looks like there's a newer build (3.13.0-78) but I don't see it available after running an update myself here. However, here's how you _can_ figure out what kernels are available to install:

$ apt-cache search linux-image-3.13.0-7
[... snip older builds ...]
linux-image-3.13.0-77-generic - Linux kernel image for version 3.13.0 on 64 bit x86 SMP

If you don't see something along the lines of linux-image-3.13.0-77-generic or greater, something else must not be right.

Oh, @jamshid you're running Debian 8? Note above: Downgrade kernel to version 3.16.7-ckt11 of release 3.16.0 (apt-get install linux-image-3.16.0-4-amd64=3.16.7-ckt11-1+deb8u3) or older

apt-get install linux-image-3.16.0-4-amd64=3.16.7-ckt11-1+deb8u3 on Debian 8 gives

Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Version '3.16.7-ckt11-1+deb8u3' for 'linux-image-3.16.0-4-amd64' was not found

An hour of active googlening to find the ckt11 package didn't help.
Please, any suggestions how to downgrade recent Debian 8 kernel?

apt-cache policy linux-image-3.16.0-4-amd64
linux-image-3.16.0-4-amd64:
  Installed: 3.16.7-ckt20-1+deb8u3
  Candidate: 3.16.7-ckt20-1+deb8u3
  Version table:
 *** 3.16.7-ckt20-1+deb8u3 0
        500 http://security.debian.org/ jessie/updates/main amd64 Packages
        100 /var/lib/dpkg/status
     3.16.7-ckt20-1+deb8u2 0
        500 http://ftp.debian.org/debian/ jessie/main amd64 Packages
        500 http://httpredir.debian.org/debian/ jessie/main amd64 Packages

@davojan You can find the packages installed previously in /var/cache/apt/archives. You should be able to downgrade with dpkg -i <old_package>.deb.

Confirmed that installing the new kernel from the PPA fixed the issue for me (Ubuntu 14.04.3 / Kernel 3.13.0-78-generic / Docker 1.9.1 )
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043

Has anyone gotten the downgrade instructions to work with Debian (not Ubuntu)? I'm wondering if the reason my apt-get update fails with below error (after adding the ppa repo):

W: Failed to fetch http://ppa.launchpad.net/chiluk/1533043/ubuntu/dists/jessie/main/binary-amd64/Packages  404  Not Found

is that only ubuntu packages are available on https://launchpad.net/~chiluk/+archive/ubuntu/1533043? Hmm I'm confused, I thought Ubuntu was based on Debian.

@jamshid Ubuntu ppa does not support Debian.

If 3.16.7-ckt11-1+deb8u3 is unfortunately no longer available, you can patch the latest kernel: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812207#47

(I can upload the deb package, if someone needs it)

If someone needs the kernel, I have uploaded it here, it is a bit newer than deb8u3, but it does not seem like it would have the bug, at least I did not run into it running this for quite a while, but patching the latest kernel is probably the better solution. However if you need it:
https://wizardtales.com/linux-image-3.16.0-0.bpo.4-amd64_3.16.7-ckt11-1+deb8u6~bpo70+1_amd64.deb

@wzrdtales :+1:

For those like me still struggling to get this sorted I'd like to point you out to TINI as a nice workaround :
https://github.com/krallin/tini

with few lines in the dockerfile I got a decent init process capable of removing zombies.
This will allow to avoid the transition to devicemapper.

Cheers,
Francesco

So, I use tini. However, that didn't help me here as the problem I encountered was while the image was being built.

Also, when running a container, I use tini, but this still affected me.

@fflatorre
Thank you for information, but the zombie issue which tini can solve seems different from this issue.
https://github.com/docker/docker/issues/18180#issuecomment-167042078

Actually, even with tini, we can get a zombie:

FROM java:7
ENV TINI_VERSION v0.9.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini
ENTRYPOINT ["/tini", "--"]
CMD ["taskset", "0x1", "java"]
$ docker build -t foobar .
$ docker run -it --rm foobar
Usage: java [-options] class [args...]
           (to execute a class)
...
See http://www.oracle.com/technetwork/java/javase/documentation/index.html for more details.
(hangs up here and becomes a zombie)

@AkihiroSuda @jakirkham I forget to mention we are not experiencing this issue when building the image. We build a very basic image and then the provisioning logic is delegated to a bunch of ansible scripts. During the provisioning one of the process (kafka) used to hang. TINI so far seems to have mitigated that issue.
I acknowledge it might not be a solution for you, indeed I'd suggest to downgrade it from workaround to placebo :)
Hope we can get is sorted soon.

I had the same issue running Docker 1.9.1 on OSX 10.11.3:

$ docker -v
Docker version 1.9.1, build a34a1d5

Upgrading to the latest Docker Toolbox release fixed:

$ docker -v
Docker version 1.10.1, build 9e83765

For information, I listed up some issues and workarounds related to AUFS/Overlay/BtrFS/ZFS/devicemapper storage drivers: https://github.com/AkihiroSuda/docker-issues/

Hope this can help those who are interested in #18180 and others..

@AkihiroSuda I tried to follow the link https://launchpad.net/%7Echiluk/+archive/ubuntu/1533043/+packages but I am not allow to view the page.

See also: https://github.com/docker/toolbox/issues/318#issuecomment-184143546

@schmunk42
I also could not access the page.
I think chiluk is remaking packages. You can ask to him at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043

For people who are having issues with this, here's how to resolve the issue on Ubuntu 14.04 using the -proposed kernel. This will of course not be relevant once the kernel graduates into the main branch.

Before we begin, we're can confirm that we're running on an affected kernel by running (i.e. <3.19.0-50 on Ubuntu 14.04):

$ uname -r
3.19.0-49-generic

Since we know, this, we first need to Enable Proposed packages by running:

$ echo "deb http://archive.ubuntu.com/ubuntu/ trusty-proposed restricted main multiverse universe" | sudo tee -a /etc/apt/sources.list
$ echo -e "Package: *\nPin: release a=trusty-proposed\nPin-Priority: 400" | sudo tee -a  /etc/apt/preferences.d/proposed-updates

With that done, let's install the updated kernel:

$ sudo apt-get update
$ sudo apt-get install linux-image-3.19.0-50-generic/trusty-proposed linux-image-extra-3.19.0-50-generic/trusty-proposed

And the let's reboot

$ sudo shutdown -r now

After the reboot, we can confirm that the latest are now running on the latest kernel:

$ uname -r
3.19.0-50-generic

Thanks @vpetersson am trying to find out what will happen when this version of the kernel is released, will it just overwrite the proposed install or do you have to do something to go back to normal please?

@IainColledge Yes, I would imagine that would be the case, but I'm not entirely sure.

Updated the table about Ubuntu and Debian.

LATEST QUICK WORKAROUNDS

| Distro | Workaround |
| --- | --- |
| General | Use devicemapper/overlay/btrfs (but it may cause another problem..).
If you can upgrade AUFS and build the kernel manually, you can also use AUFS v20160111 or later. |
| Boot2Docker | :white_check_mark: Upgrade to v1.10.0 or later |
| Ubuntu 14.04LTS | :white_check_mark: Upgrade kernel to 3.13.0-79.123 or later |
| Ubuntu 15.04 | :white_check_mark: Upgrade kernel to 3.19.0-51.57 or later |
| Ubuntu 15.10 | :white_check_mark: Upgrade kernel to 4.2.0-30.35 or later |
| Debian 7/8 | :arrow_down: Downgrade kernel to version 3.16.7-ckt11 of release 3.16.0 or older (dpkg archive by @wzrdtales) |
| Debian 9 | :white_check_mark: (does not support AUFS since kernel 3.18-1~exp1) |
| Gentoo | :white_check_mark: Upgrade to recent ones (:warning: not tested) |
| RHEL/CentOS | :white_check_mark: (does not support AUFS) |
| openSUSE | :white_check_mark: (does not support AUFS) |

Distributors Issue Tickets

| Distro | Status | Issue URL |
| --- | --- | --- |
| Boot2Docker | :white_check_mark: Closed | https://github.com/boot2docker/boot2docker/pull/1113 |
| Ubuntu | :white_check_mark: Closed | https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 |
| Debian | :white_medium_square: In Progress | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812207 |

One more thing; I listed up some known bugs about storage drivers: https://github.com/AkihiroSuda/docker-issues

Just in case someone wants to have latest "linux_3.16.7-ckt20-1+deb8u3" debian kernel with patches, mentioned earlier - I've built it manually, and it's at https://fxposter.org/linux-image-3.16.0-4-amd64_3.16.7-ckt20-1+deb8u3a~test_amd64.deb.

amazing! I've been having this problem for a few weeks now, and I guess the fix for Ubuntu was just released yesterday :P

Confirming that the latest 14.04LTS kernel update to 3.19.0-51 puts an end to my java zombies. Thanks!

Debian supported this issue.

LATEST QUICK WORKAROUNDS

| Distro | Workaround |
| --- | --- |
| General | Use devicemapper/overlay/btrfs (but it may cause another problem..).
If you can upgrade AUFS and build the kernel manually, you can also use AUFS v20160111 or later. |
| Boot2Docker | :white_check_mark: Upgrade to v1.10.0 or later |
| Ubuntu 14.04LTS | :white_check_mark: Upgrade kernel to 3.13.0-79.123 or later |
| Ubuntu 15.04 | :white_check_mark: Upgrade kernel to 3.19.0-51.57 or later |
| Ubuntu 15.10 | :white_check_mark: Upgrade kernel to 4.2.0-30.35 or later |
| Debian 7 | :white_check_mark: Upgrade kernel to 3.2.73-2+deb7u3 (of linux-image-3.2.0-4-amd64 package) or later |
| Debian 8 | :white_check_mark: Upgrade kernel to 3.16.7-ckt20-1+deb8u4 (of linux-image-3.16.0-4-amd64 package) or later |
| Debian 9 | :white_check_mark: (does not support AUFS since kernel 3.18-1~exp1) |
| Gentoo | :white_check_mark: Upgrade to recent ones (:warning: not tested) |
| RHEL/CentOS | :white_check_mark: (does not support AUFS) |
| openSUSE | :white_check_mark: (does not support AUFS) |

Distributors Issue Tickets

| Distro | Status | Issue URL |
| --- | --- | --- |
| Boot2Docker | :white_check_mark: Closed | https://github.com/boot2docker/boot2docker/pull/1113 |
| Ubuntu | :white_check_mark: Closed | https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1533043 |
| Debian | :white_check_mark: Closed | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=812207 |

upgrading kernel of 14.04LTS worked for me :+1:

I'm on OSX on Boot2Docker version 1.10.2, build master : 611be10, Docker version 1.10.2, build c3959b1 and first got this from docker-compose:

Recreating docker_preview_1
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

Then tried docker kill 38e1e2590dfa but process hangs forever. docker.log:

time="2016-03-09T14:49:13.053004077Z" level=debug msg="Calling POST /v1.21/containers/38e1e2590dfa5d77482b8fbf6b14f01e8d5278622b8e5d7262cd2cdeb777690b/stop"
time="2016-03-09T14:49:13.053058084Z" level=debug msg="POST /v1.21/containers/38e1e2590dfa5d77482b8fbf6b14f01e8d5278622b8e5d7262cd2cdeb777690b/stop?t=10"
time="2016-03-09T14:49:13.053097711Z" level=debug msg="Sending 15 to 38e1e2590dfa5d77482b8fbf6b14f01e8d5278622b8e5d7262cd2cdeb777690b"
time="2016-03-09T14:49:23.053530062Z" level=info msg="Container 38e1e2590dfa5d77482b8fbf6b14f01e8d5278622b8e5d7262cd2cdeb777690b failed to exit within 10 seconds of SIGTERM - using the force"
time="2016-03-09T14:49:23.053720529Z" level=debug msg="Sending 9 to 38e1e2590dfa5d77482b8fbf6b14f01e8d5278622b8e5d7262cd2cdeb777690b"
time="2016-03-09T14:49:33.054082100Z" level=info msg="Container 38e1e2590dfa failed to exit within 10 seconds of kill - trying direct SIGKILL"
time="2016-03-09T14:49:34.254353402Z" level=debug msg="Calling GET /v1.22/containers/json"
time="2016-03-09T14:49:34.254413283Z" level=debug msg="GET /v1.22/containers/json"
time="2016-03-09T14:49:54.293708866Z" level=debug msg="Calling POST /v1.22/containers/38e1e2590dfa/kill"
time="2016-03-09T14:49:54.293752784Z" level=debug msg="POST /v1.22/containers/38e1e2590dfa/kill?signal=KILL"
time="2016-03-09T14:49:54.293802705Z" level=debug msg="Sending 9 to 38e1e2590dfa5d77482b8fbf6b14f01e8d5278622b8e5d7262cd2cdeb777690b"
time="2016-03-09T14:50:04.294276946Z" level=info msg="Container 38e1e2590dfa failed to exit within 10 seconds of kill - trying direct SIGKILL"
time="2016-03-09T14:50:26.678957119Z" level=debug msg="clean 3 unused exec commands"

Just as a note (I know this is closed but not sure if it makes sense to open as a new issue). I was having the same issue on a later version until I switched to devmapper.

$ docker info
Containers: 4
 Running: 3
 Paused: 0
 Stopped: 1
Images: 81
Server Version: 1.12.1
Storage Driver: devicemapper
 Pool Name: docker-8:1-9044034-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.726 GB
 Data Space Total: 107.4 GB
 Data Space Available: 96.43 GB
 Metadata Space Used: 4.387 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.143 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Use `--storage-opt dm.thinpooldev` to specify a custom block storage device.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.77 (2012-10-15)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.13.0-77-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.56 GiB
Name: ravn
ID: L2WX:3RQ7:W6IC:7MY3:M3ZC:7MP2:3ZMP:VHW4:TLXM:VLYO:NNZ5:2FVW
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

@einhverfr The issue is fixed in kernel 3.13.0-79.123 (your one seems to be 3.13.0-77)

Can this issue really be solved with a Kernel upgrade? We are encountering the same problem with Docker 1.9.1 on Ubuntu 14.04 with Kernel 3.13.0-83-generic.

Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:12:04 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:12:04 UTC 2015
 OS/Arch:      linux/amd64

@martinm82 yes, this issue was a kernel issue. It's possible something else can result in a similar behavior, or if there's a regression in the kernel. However, please open a new issue if you're having issues on the current release; keep in mind that docker 1.9.1 is EOL, so won't be receiving updates anymore.

I am locking the discussion on this issue, because the original issue here was resolved, and I want to prevent this issue from collecting possibly unrelated issues. See this comment; https://github.com/docker/docker/issues/18180#issuecomment-193708192 for the kernel versions needed to fix this issue

Was this page helpful?
0 / 5 - 0 ratings