Moby: Docker Daemon Hangs under load

Created on 11 Jun 2015  ·  174Comments  ·  Source: moby/moby

Docker Daemon hangs under heavy load. Scenario is starting/stopping/killing/removing many containers/second - high utilization. Containers contain one port, and are run without logging and in detached mode. The container receives an incoming TCP connection, does some work, sends a response, and then exits. An outside process cleans up by killing/removing and starting a new container.

I cannot get docker info from an actual hung instance, as once it is hung I can't get docker to run without a reboot. The below info is from one of the instances that has had the problem after a reboot.

We also have instances where something completely locks up and the instance cannot even be ssh'd into. This usually happens sometime after the docker lockup occurs.

Docker Info

Containers: 8
Images: 65
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Kernel Version: 3.18.0-031800-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 2
Total Memory: 3.675 GiB
Name:
ID: FAEG:2BHA:XBTO:CNKH:3RCA:GV3Z:UWIB:76QS:6JAG:SVCE:67LH:KZBP
WARNING: No swap limit support

Docker Version

Client version: 1.6.0
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 4749651
OS/Arch (client): linux/amd64
Server version: 1.6.0
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 4749651
OS/Arch (server): linux/amd64

uname -a
Linux 3.18.0-031800-generic #201412071935 SMP Mon Dec 8 00:36:34 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 14972
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 14972
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

aredaemon areruntime kinbug

Most helpful comment

Today i meet problems like this - docker ps frozen, docker eat 100% cpu.
And now i see that my datadog-agent asked docker in loop about it's containers with this option:

collect_container_size: true

So i have infinity loop with very hard operation (i have more than 10k containers). After i stopped datadog docker integration, system runnig ok - docker ps worked, docker eat 0% cpu.

All 174 comments

would you happen to have a reduced testcase to show this? Perhaps a small Dockerfile for what's in the container and a bash script that does the work of starting/stopping/... the containers?

Container is in dockerhub kinvey/blrunner#v0.3.8

Using the remote API with the following options:

CREATE
Image: 'kinvey/blrunner#v0.3.8'
AttachStdout: false
AttachStderr: false
ExposedPorts: {'7000/tcp': {}}
Tty: false
HostConfig:
PublishAllPorts: true
CapDrop: [
"CHOWN"
"DAC_OVERRIDE"
"FOWNER"
"KILL"
"SETGID"
"SETPCAP"
"NET_BIND_SERVICE"
"NET_RAW"
"SYS_CHROOT"
"MKNOD"
"SETFCAP"
"AUDIT_WRITE"
]
LogConfig:
Type: "none"
Config: {}

START

container.start

REMOVE
force:true

Hmm, are you seeing excessive resource usage?
The fact that ssh isn't even working makes me suspect.
Could very well be inode issue with overlay, or too many open FD's, etc.

Not particularly in regards to excessive resource usage... But the early symptom is docker completely hanging while other processes hum along happily...

Important to note, we only have 8 containers running at any one time on any one instance.

Captured some stats where docker is no longer resposive:

lsof | wc -l

shows 1025.

However, an error appears several times:
lsof: WARNING: can't stat() overlay file system /var/lib/docker/overlay/aba7820e8cb01e62f7ceb53b0d04bc1281295c38815185e9383ebc19c30325d0/merged
Output information may be incomplete.

Sample output of top:

top - 00:16:53 up 12:22, 2 users, load average: 2.01, 2.05, 2.05
Tasks: 123 total, 1 running, 121 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.3 us, 0.0 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 3853940 total, 2592920 used, 1261020 free, 172660 buffers
KiB Swap: 0 total, 0 used, 0 free. 1115644 cached Mem

24971 kinvey 20 0 992008 71892 10796 S 1.3 1.9 9:11.93 node
902 root 20 0 1365860 62800 12108 S 0.3 1.6 30:06.10 docker
29901 ubuntu 20 0 27988 6720 2676 S 0.3 0.2 3:58.17 tmux
1 root 20 0 33612 4152 2716 S 0.0 0.1 14:22.00 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.03 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:04.40 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 2:21.81 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:01.91 rcu_bh

@mjsalinger The setup you're using is unsupported. The reason why it's unsupported is that you're using Ubuntu 14.04 with a custom kernel.

Where does that 3.18.0-031800 kernel come from? Did you notice that this kernel build is outdated? The kernel you're using has been built in December last year.

I'm sorry, but there's nothing to be debugged here. This issue might actually be a kernel bug related to overlay or to some other already fixed kernel bug which is no longer an issue in the latest version of kernel 3.18.

I'm going to close this issue. Please try again with an up to date 3.18 or newer kernel and check if you're running into the problem. Please keep in mind that there are multiple issues opened against overlay and that you'll probably experience problems with overlay even after updating to the latest kernel version and to the latest Docker version.

@unclejack @cpuguy83 @LK4D4 Please reopen this issue. This was advice for the configuration that we are using was specifically given by the docker team and experimentation. We've tried newer kernels (3.19 +) and they have a kernel panic bug of some kind that we were running into - so the advice was to go with 3.18 pre-December because there was a known kernel bug introduced after that which caused a Kernel panic that we were running into, which to my knowledge, has not yet been fixed.

As far as OverlayFS, that was also presented to me as the ideal FS for Docker after experiencing _numerous_ performance problems with AUFS. If this isn't a supported configuration, can someone help me to find a performant, stable configuration that will work for this use case? We've been pushing to get this stable for several months.

@mjsalinger Can you provide the inode usage for the volume that overlay is running on? (df -i /var/lib/docker ought to do).

Thanks for reopening. If the answer is a different kernel, that's fine, I just want to get to a stable scenario.

df -i /ver/lib/docker

Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/xvda1 1638400 643893 994507 40% /

Overlay still has a bunch of problems: https://github.com/docker/docker/issues?q=is%3Aopen+is%3Aissue+label%3A%2Fsystem%2Foverlay

I wouldn't use overlay in production. Others have commented on this issue tracker that they're using AUFS in production and that it's been stable for them.

Kernel 3.18 is unsupported on Ubuntu 14.04. Canonical doesn't provide support for that.

AUFS in production is not performant at all and has been anything _but_ stable for us - I would routinely run into I/O Bottlenecks, freezes, etc.

See:

11228, #12758, #12962

Switching to Overlay resolved all of the above issues - we only have this one issue remaining.

also:

http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/
http://qconlondon.com/system/files/presentation-slides/Docker%20Clustering.pdf

It seems that Overlay is being presented as the driver of choice by the community in general. Fine if it's not stable yet, but neither is AUFS and there has to be _some_ way to get the performance and stability I need with Docker. I'm all for trying new things, but in our previous configuration (AUFS on Ubuntu 12.04 and AUFS on Ubuntu 14.04) we could not get either stability or performance. At least with Overlay we get good performance and better stability - but we just need to resolve this one problem.

I also experienced symptoms similar to these running default Ubuntu instances under AUFS (12.04, 14.04, and 15.04).

10355 #13940

Both of these issues disappeared after switching to the current Kernel and OverlayFS.

@mjsalinger I'd recommend using Ubuntu 14.04 with the latest kernel 3.13 packages. I'm using that myself and I haven't run into any of those problems at all.

@unclejack Tried that and ran into the issues specified above with heavy, high volume usage (creating/destroying lots of containers) and AUFS was incredibly non-performant. So that's not an option.

@mjsalinger Are you using upstart to start docker? What does /etc/init/docker.conf look like?

Yes, using upstart.

/etc/init/docker.conf

description "Docker daemon"

start on (local-filesystems and net-device-up IFACE!=lo)
stop on runlevel [!2345]
limit nofile 524288 1048576
limit nproc 524288 1048576

respawn

pre-start script
        # see also https://github.com/tianon/cgroupfs-mount/blob/master/cgroupfs-mount
        if grep -v '^#' /etc/fstab | grep -q cgroup \
                || [ ! -e /proc/cgroups ] \
                || [ ! -d /sys/fs/cgroup ]; then
                exit 0
        fi
        if ! mountpoint -q /sys/fs/cgroup; then
                mount -t tmpfs -o uid=0,gid=0,mode=0755 cgroup /sys/fs/cgroup
        fi
        (
                cd /sys/fs/cgroup
                for sys in $(awk '!/^#/ { if ($4 == 1) print $1 }' /proc/cgroups); do
                        mkdir -p $sys
                        if ! mountpoint -q $sys; then
                                if ! mount -n -t cgroup -o $sys cgroup $sys; then
                                        rmdir $sys || true
                                fi
                        fi
                done
        )
end script

script
        # modify these in /etc/default/$UPSTART_JOB (/etc/default/docker)
        DOCKER=/usr/bin/$UPSTART_JOB
        DOCKER_OPTS=
        if [ -f /etc/default/$UPSTART_JOB ]; then
                . /etc/default/$UPSTART_JOB
        fi
        exec "$DOCKER" -d $DOCKER_OPTS
end script

# Don't emit "started" event until docker.sock is ready.
# See https://github.com/docker/docker/issues/6647
post-start script
        DOCKER_OPTS=
        if [ -f /etc/default/$UPSTART_JOB ]; then
                . /etc/default/$UPSTART_JOB
        fi
        if ! printf "%s" "$DOCKER_OPTS" | grep -qE -e '-H|--host'; then
                while ! [ -e /var/run/docker.sock ]; do
                        initctl status $UPSTART_JOB | grep -qE "(stop|respawn)/" && exit 1
                        echo "Waiting for /var/run/docker.sock"
                        sleep 0.1
                done
                echo "/var/run/docker.sock is up"
        fi
end script

We are also now seeing the below when running any Docker command on some instances, for example...

sudo docker ps

FATA[0000] Get http:///var/run/docker.sock/v1.18/containers/json?all=1: dial unix /var/run/docker.sock: resource temporarily unavailable. Are you trying to connect to a TLS-enabled
daemon without TLS?

@mjsalinger It's just crappy error message. In most cases it means that daemon crashed.

@mjsalinger What do the docker daemon logs say during this? /var/log/upstart/docker.log should be the location.

Frozen, no new entries coming in. Here are the last entries in the log:

INFO[46655] -job log(create, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46655] -job create() = OK (0)
INFO[46655] POST /containers/48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a/start
INFO[46655] +job start(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] +job allocate_interface(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] -job allocate_interface(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46655] +job allocate_port(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46655] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[46655] +job create()
INFO[46655] DELETE /containers/4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187?force=true
INFO[46655] +job rm(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187)
INFO[46656] -job allocate_port(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] +job log(start, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job log(create, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(create, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job create() = OK (0)
INFO[46656] +job log(die, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187)
INFO[46656] POST /containers/7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f/start
INFO[46656] +job start(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] +job allocate_interface(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job allocate_interface(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job allocate_port(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job release_interface(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187) = OK (0)
INFO[46656] DELETE /containers/cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b?force=true
INFO[46656] +job rm(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b)
INFO[46656] +job log(destroy, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, 4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(4d447093f522f1d74f482b2f76c91adfd38b5ad264202b1c8262f05a0edaf187) = OK (0)
INFO[46656] +job log(die, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b)
INFO[46656] DELETE /containers/1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20?force=true
INFO[46656] +job rm(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20)
INFO[46656] -job allocate_port(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job log(start, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job start(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[46656] +job create()
INFO[46656] +job log(create, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(create, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job create() = OK (0)
INFO[46656] +job log(die, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(die, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] +job release_interface(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20)
INFO[46656] GET /containers/48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a/json
INFO[46656] +job container_inspect(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a)
INFO[46656] -job container_inspect(48abb699bb6b8aefe042c010d06268d5e13515d616c5ca61f3a4930a325de26a) = OK (0)
INFO[46656] POST /containers/1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830/start
INFO[46656] +job start(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] +job allocate_interface(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] -job allocate_interface(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830) = OK (0)
INFO[46656] +job allocate_port(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830)
INFO[46656] -job release_interface(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b) = OK (0)
INFO[46656] -job start(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] GET /containers/7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f/json
INFO[46656] +job container_inspect(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f)
INFO[46656] -job release_interface(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20) = OK (0)
INFO[46656] -job container_inspect(7ef9e347b762b4fb34a85508e5d259a609392decf9ffc8488730dbe8e731c84f) = OK (0)
INFO[46656] +job log(destroy, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(cb03fc14e5eab2acf01d1d42dec2fc1990cccca69149de2dc97873f87474db9b) = OK (0)
INFO[46656] +job log(destroy, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(destroy, 1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20, kinvey/blrunner:v0.3.8) = OK (0)
INFO[46656] -job rm(1e8ddec281bd9b5bfe239d0e955874f83d51ffec95c499f88c158639f7445d20) = OK (0)
INFO[46656] DELETE /containers/4cfeb48701f194cfd40f71d7883d82906d54a538084fa7be6680345e4651aa60?force=true
INFO[46656] +job rm(4cfeb48701f194cfd40f71d7883d82906d54a538084fa7be6680345e4651aa60)
INFO[46656] -job allocate_port(1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830) = OK (0)
INFO[46656] +job log(start, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8)
INFO[46656] -job log(start, 1ae5798d7aeec4857944b40a27f1a69789323bbe8edb8d67a250150241484830, kinvey/blrunner:v0.3.8) = OK (0)

@cpuguy83 Was the log helpful at all?

@mjsalinger Makes me think there's a deadlock somewhere since there's nothing else indicating an issue.

@cpuguy83 That would make sense given the symptoms. Is there anything I can do to help further trace this issue and where it comes from?

Maybe we can get an strace to see that it's actually hanging on a lock.

Ok. Will work to see if we can get that. We wanted to try 1.7 first - did that but did not notice any improvement.

@cpuguy83 From one of the affected hosts:

root@<host>:~# strace -q -y -v -p 899
futex(0x134f1b8, FUTEX_WAIT, 0, NULL^C <detached ...>

@cpuguy83 Any ideas?

Seeing the following in 1.7 with containers not being killed/started. This seems to be a precurser to the problem (note didn't see these errors in 1.6 but did see a volume of dead containers start to build up, even though a command was issued to kill/remove)

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 09c12c9f72d461342447e822857566923d5532280f9ce25659d1ef3e54794484: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 5e29407bb3154d6f5778676905d112a44596a23fd4a1b047336c3efaca6ada18: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container be22e8b24b70e24e5269b860055423236e4a2fca08969862c3ae3541c4ba4966: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container c067d14b67be8fb81922b87b66c0025ae5ae1ebed3d35dcb4052155afc4cafb4: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 7f21c4fd8b6620eba81475351e8417b245f86b6014fd7125ba0e37c6684e3e42: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container b31531492ab7e767158601c438031c8e9ef0b50b9e84b0b274d733ed4fbe03a0: Link not found

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container 477dc3c691a12f74ea3f07af0af732082094418f6825f7c3d320bda0495504a1: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32822 -j DNAT --to-destination 172.17.0.44:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 6965eec076db173f4e3b9a05f06e1c87c02cfe849821bea4008ac7bd0f1e083a: Link not found

Error spawning container: Error: HTTP code is 404 which indicates error: no such container - Cannot start container 7c721dd2b8d31b51427762ac1d0fe86ffbb6e1d7462314fdac3afe1f46863ff1: Link not found

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container c908d059631e66883ee1a7302c16ad16df3298ebfec06bba95232d5f204c9c75: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32837 -j DNAT --to-destination 172.17.0.47:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)

Error spawning container: Error: HTTP code is 500 which indicates error: server error - Cannot start container c3e11ffb82fe08b8a029ce0a94e678ad46e3d2f3d76bed7350544c6c48789369: iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 32847 -j DNAT --to-destination 172.17.0.48:7000 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)

Tail of docker.log from a recent incident:

INFO[44089] DELETE /containers/4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25?force=true
INFO[44089] +job rm(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25)
INFO[44089] DELETE /containers/a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96?force=true
INFO[44089] +job rm(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96)
INFO[44089] +job log(die, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8)
INFO[44089] -job log(die, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44089] +job release_interface(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25)
INFO[44089] -job release_interface(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25) = OK (0)
INFO[44089] +job log(die, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8)
INFO[44089] -job log(die, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44089] +job release_interface(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96)
INFO[44089] -job release_interface(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96) = OK (0)
INFO[44092] +job log(destroy, 285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6, kinvey/blrunner:v0.3.8)
INFO[44092] -job log(destroy, 285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44092] -job rm(285274ee9c5b3bfa9fcea4d93b75e7e51949752b8d0eb101a31ea4f9aec5dad6) = OK (0)
INFO[44092] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44092] +job create()
INFO[44097] +job log(destroy, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8)
INFO[44097] -job log(destroy, 4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44097] -job rm(4e455d01da8453688dd27cad41fea158757311c0c89f27619a728f272591ef25) = OK (0)
INFO[44097] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44097] +job create()
INFO[44098] +job log(destroy, c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4, kinvey/blrunner:v0.3.8)
INFO[44098] -job log(destroy, c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44098] -job rm(c80a39f060f200f1aff8ae52538779542437745e4184ed02793f8873adcb9cd4) = OK (0)
INFO[44098] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44098] +job create()
INFO[44098] +job log(create, 3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9, kinvey/blrunner:v0.3.8)
INFO[44098] -job log(create, 3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44098] -job create() = OK (0)
INFO[44098] POST /containers/3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9/start
INFO[44098] +job start(3b9a4635c068989ddb1983aa12460083e874d50fd42c743033ed3a08000eb7e9)
INFO[44102] +job log(destroy, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8)
INFO[44102] -job log(destroy, a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96, kinvey/blrunner:v0.3.8) = OK (0)
INFO[44102] -job rm(a608bc1014317b083ac2f32a4c6c85dda65445420775e21d6406ca9146723c96) = OK (0)
INFO[44102] POST /containers/create?Image=kinvey%2Fblrunner%3Av0.3.8&AttachStdout=false&AttachStderr=false&ExposedPorts=&Tty=false&HostConfig=
INFO[44102] +job create()

@cpuguy83 @LK4D4 Any ideas/updates?

Have no idea, what to think. It isn't software deadlock, because even docker info hanging. Can this be memory leak?

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
root 20 0 1314832 89688 11568 S 0.3 2.3 65:47.56 docker

That's what the Docker process looks to be using; doesn't look too much like a memory leak to me...

I'll comment on that one to say "us too". We're using Docker for our testing infrastructure, and we're hit by that within the same conditions / scenarios. This is going to impact our ability to use Docker in a meaningful manner if it locks up under load.

I'd be happy to contribute troubleshooting results if you would point me at what you'd need as information.

I just ran into the same issue on a CentOS7 host.

$ docker version
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): ba1f6c3/1.6.2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): ba1f6c3/1.6.2
OS/Arch (server): linux/amd64
$ docker info
Containers: 7
Images: 672
Storage Driver: devicemapper
 Pool Name: docker-253:2-67171716-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/mapper/vg01-docker--data
 Metadata file: /dev/mapper/vg01-docker--metadata
 Data Space Used: 54.16 GB
 Data Space Total: 59.06 GB
 Data Space Available: 4.894 GB
 Metadata Space Used: 53.88 MB
 Metadata Space Total: 5.369 GB
 Metadata Space Available: 5.315 GB
 Udev Sync Supported: true
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Kernel Version: 3.10.0-229.7.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 8
Total Memory: 31.25 GiB

This happend on our CI system. I changed the number of parallel jobs and then startet 4 jobs in parallel. So there were 4-6 containers running. Load was about 10 (with some of the 8 cores being still idle).

While all containers were running fine, the dockerd itself was stuck. docker images would still show images, but all deamon commands like docker ps or docker info have stalled.

My strace was similar to the one above:

strace -p 1194
Process 1194 attached
futex(0x11d2838, FUTEX_WAIT, 0, NULL

After a while all containers were done with their jobs (compiling, testing..), but neither would "return". It looks as if they were waiting themselves for the dockerd.

When I finally killed dockerd, the containers terminated with a message like this:

time="2015-08-05T15:59:32+02:00" level=fatal msg="Post http:///var/run/docker.sock/v1.18/containers/9117731bd16a451b89fd938f569c39761b5f8d6df505256e172738e0593ba125/wait: EOF. Are you trying to connect to a TLS-enabled daemon without TLS?" 

We are seeing the same thing as @filex on Centos 7 in our CI environment

Containers: 1
Images: 19
Storage Driver: devicemapper
 Pool Name: docker--vg-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: xfs
 Data file:
 Metadata file:
 Data Space Used: 2.611 GB
 Data Space Total: 32.17 GB
 Data Space Available: 29.56 GB
 Metadata Space Used: 507.9 kB
 Metadata Space Total: 54.53 MB
 Metadata Space Available: 54.02 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.11.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 2
Total Memory: 7.389 GiB
Name: ip-10-1-2-234
ID: 5YVL:O32X:4NNA:ICSJ:RSYS:CNCI:6QVC:C5YR:XGR4:NQTW:PUSE:YFTA
Client version: 1.7.1
Client API version: 1.19
Package Version (client): docker-1.7.1-108.el7.centos.x86_64
Go version (client): go1.4.2
Git commit (client): 3043001/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Package Version (server): docker-1.7.1-108.el7.centos.x86_64
Go version (server): go1.4.2
Git commit (server): 3043001/1.7.1
OS/Arch (server): linux/amd64

Same here: docker was not responding to ps, rm, stop, run, info and probably more. A few reboots later, everything was back to normal.

docker info
Containers: 25
Images: 1739
Storage Driver: devicemapper
 Pool Name: docker-9:2-62521632-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: extfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 96.01 GB
 Data Space Total: 107.4 GB
 Data Space Available: 11.36 GB
 Metadata Space Used: 110.5 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.037 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.11.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 8
Total Memory: 31.2 GiB
Name: CentOS-70-64-minimal
ID: EM3I:PELO:SBH6:JRVL:AM6C:UM7W:XJWJ:FI5N:JO77:7PMF:S57A:PLAT
docker version
Client version: 1.7.1
Client API version: 1.19
Package Version (client): docker-1.7.1-108.el7.centos.x86_64
Go version (client): go1.4.2
Git commit (client): 3043001/1.7.1
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Package Version (server): docker-1.7.1-108.el7.centos.x86_64
Go version (server): go1.4.2
Git commit (server): 3043001/1.7.1
OS/Arch (server): linux/amd64

I'm on 1.6.2 and 1.8.2 and my docker setup is also melting under load. It looks like docker job queue is slowly filling up to the point where any new calls take minutes. Tuning filesystem and keeping small number of containers makes things a bit better. I'm still looking into this searching for some patterns.

$ docker version
Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   44b3b67
 Built:        Mon Sep 14 23:56:40 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   44b3b67
 Built:        Mon Sep 14 23:56:40 UTC 2015
 OS/Arch:      linux/amd64

Same here. I sure am happy to have found this thread, i was loosing my mind for the past two weeks trying to see what the problem might be.
I think its the amount of containers that are being started simultaneously, it look like a classic case of a deadlock.

in my case, it was only a hand full of containers being started simultaneously (maybe 3-5). but all of them receive a large input stream through STDIN (of docker run), which may have put extra stress on the daemon.

@filex have a similar use case. Although we did not face these freeze before we migrated to 1.8.2

Today i meet problems like this - docker ps frozen, docker eat 100% cpu.
And now i see that my datadog-agent asked docker in loop about it's containers with this option:

collect_container_size: true

So i have infinity loop with very hard operation (i have more than 10k containers). After i stopped datadog docker integration, system runnig ok - docker ps worked, docker eat 0% cpu.

I'm experimenting with cadvisor and it's also melting docker.. Try to reduce number of dead, exited containers and see if that reduces the load.

Can you please elaborate more on your findings?

On Thu, Oct 1, 2015, 00:23 Alexander Vagin [email protected] wrote:

Today i meet problems like this - docker ps frozen, docker eat 100% cpu.
And now i see that my datadog-agent asked docker in loop about it's
containers with this option:

collect_container_size: true

So i have infinity loop with very hard operation (i have more than 10k
containers). After i stopped datadog docker integration, system runnig ok -
docker ps worked, docker eat 0% cpu.


Reply to this email directly or view it on GitHub
https://github.com/docker/docker/issues/13885#issuecomment-144546581.

@ohade my docker works without any freezes 24/7. On every time it has 60-80 started containers. I have about 1500 new containers in one day. So when i saw that with this load it has freezes and only from docker (system have many free memory, io, cpu), i thinked that it's not a common problem.
I found someone sends requests to my docker in logs (/var/log/upstart/docker.logs). And i found who it was :)
What tell you more?

:) I ment, is that agent a part of the docker and I can turn it size
checking too, or is it external agent?

On Thu, Oct 1, 2015, 14:03 Alexander Vagin [email protected] wrote:

@ohade https://github.com/ohade my docker works without any freezes
24/7. On every time it has 60-80 started containers. I have about 1500 new
containers in one day. So when i saw that with this load it has freezes and
only from docker (system have many free memory, io, cpu), i thinked that
it's not a common problem.
I found someone sends requests to my docker in logs
(/var/log/upstart/docker.logs). And i found who it was :)
What tell you more?


Reply to this email directly or view it on GitHub
https://github.com/docker/docker/issues/13885#issuecomment-144697619.

@ohade no, it isn't a part of the docker. My opinion that it isn't docker issue. You must look your environment for heavy requests to docker.

I'm having this issue as well since I updated to docker 1.8.2

Any progress on this? In our testing infrastructure, I am spinning up a large number or short lived containers and I am running into this issue more than once a day.

docker version
Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:12:52 UTC 2015
 OS/Arch:      linux/amd64
Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:12:52 UTC 2015
 OS/Arch:      linux/amd64

Linux version:

$ uname -a
Linux grpc-interop1 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3~bpo70+1 (2015-08-08) x86_64 GNU/Linux

We moved to 1.8.3:

# docker version 
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 06:06:01 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 06:06:01 UTC 2015
 OS/Arch:      linux/amd64

but nevertheless it continued crashing, at first once in a few days, and later up to ten times a day. Migrating from CentOS 7 with device-mapper/loopback to Ubuntu 14.04 LTS with AUFS sorted it.

Also seen this issue https://github.com/docker/docker/issues/12606#issuecomment-149659953

I can reliably reproduce the issue by running this kubernetes's e2e tests in a loop.

I caught the trace log of the daemon, seems there's a deadlock, a lot of goroutines hang on semacquire

goroutine 8956 [semacquire, 8 minutes]:
sync.(*Mutex).Lock(0xc208961650)
/usr/lib/go/src/sync/mutex.go:66 +0xd3
github.com/docker/docker/daemon.func·028(0xc20861c1e0, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/daemon/list.go:84 +0xfc
github.com/docker/docker/daemon.(*Daemon).Containers(0xc2080a50a0, 0xc209ec4820, 0x0, 0x0, 0x0, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/daemon/list.go:187 +0x917
github.com/docker/docker/api/server.(*Server).getContainersJSON(0xc208032540, 0xc3f700, 0x4, 0x7f214584f110, 0xc20a306d20, 0xc209d4a9c0, 0xc20a09fa10, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/api/server/server.go:562 +0x3ba
github.com/docker/docker/api/server.*Server.(github.com/docker/docker/api/server.getContainersJSON)·fm(0xc3f700, 0x4, 0x7f214584f110, 0xc20a306d20, 0xc209d4a9c0, 0xc20a09fa10, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/api/server/server.go:1526 +0x7b
github.com/docker/docker/api/server.func·008(0x7f214584f110, 0xc20a306d20, 0xc209d4a9c0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/api/server/server.go:1501 +0xacd
net/http.HandlerFunc.ServeHTTP(0xc208033380, 0x7f214584f110, 0xc20a306d20, 0xc209d4a9c0)
/usr/lib/go/src/net/http/server.go:1265 +0x41
github.com/gorilla/mux.(*Router).ServeHTTP(0xc2080b1090, 0x7f214584f110, 0xc20a306d20, 0xc209d4a9c0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/vendor/src/github.com/gorilla/mux/mux.go:98 +0x297
net/http.serverHandler.ServeHTTP(0xc20804f080, 0x7f214584f110, 0xc20a306d20, 0xc209d4a9c0)
/usr/lib/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc208743680)
/usr/lib/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
/usr/lib/go/src/net/http/server.go:1751 +0x35e

Full trace here:
https://gist.github.com/yifan-gu/ac0cbc2a59a7b8c3fe2d

Will try to test on latest version

Tried another run, still with 1.7.1, but this time found something more interesting:

goroutine 114 [syscall, 50 minutes]:
syscall.Syscall6(0x3d, 0x514, 0xc2084e74fc, 0x0, 0xc208499950, 0x0, 0x0, 0x44199c, 0x441e22, 0xb28140)
/usr/lib/go/src/syscall/asm_linux_amd64.s:46 +0x5
syscall.wait4(0x514, 0xc2084e74fc, 0x0, 0xc208499950, 0x90, 0x0, 0x0)
/usr/lib/go/src/syscall/zsyscall_linux_amd64.go:124 +0x79
syscall.Wait4(0x514, 0xc2084e7544, 0x0, 0xc208499950, 0x41a768, 0x0, 0x0)
/usr/lib/go/src/syscall/syscall_linux.go:224 +0x60
os.(*Process).wait(0xc2083e2b20, 0xc20848a860, 0x0, 0x0)
/usr/lib/go/src/os/exec_unix.go:22 +0x103
os.(*Process).Wait(0xc2083e2b20, 0xc2084e7608, 0x0, 0x0)
/usr/lib/go/src/os/doc.go:45 +0x3a
os/exec.(*Cmd).Wait(0xc2083c9a40, 0x0, 0x0)
/usr/lib/go/src/os/exec/exec.go:364 +0x23c
github.com/docker/libcontainer.(*initProcess).wait(0xc20822cf30, 0x1b6, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/vendor/src/github.com/docker/libcontainer/process_linux.go:194 +0x3d
github.com/docker/libcontainer.Process.Wait(0xc208374a30, 0x1, 0x1, 0xc20839b000, 0x47, 0x80, 0x127e348, 0x0, 0x127e348, 0x0, ...)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/vendor/src/github.com/docker/libcontainer/process.go:60 +0x11d
github.com/docker/libcontainer.Process.Wait·fm(0xc2084e7ac8, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/daemon/execdriver/native/driver.go:164 +0x58
github.com/docker/docker/daemon/execdriver/native.(*driver).Run(0xc20813c230, 0xc20832a900, 0xc20822cc00, 0xc208374900, 0x0, 0x41a900, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/daemon/execdriver/native/driver.go:170 +0xa0a
github.com/docker/docker/daemon.(*Daemon).Run(0xc2080a5880, 0xc2080a21e0, 0xc20822cc00, 0xc208374900, 0x0, 0xc2084b6000, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/daemon/daemon.go:1068 +0x95
github.com/docker/docker/daemon.(*containerMonitor).Start(0xc20853d180, 0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/daemon/monitor.go:138 +0x457
github.com/docker/docker/daemon.*containerMonitor.Start·fm(0x0, 0x0)
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/daemon/container.go:732 +0x39
github.com/docker/docker/pkg/promise.func·001()
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/pkg/promise/promise.go:8 +0x2f
created by github.com/docker/docker/pkg/promise.Go
/build/amd64-usr/var/tmp/portage/app-emulation/docker-1.7.1-r4/work/docker-1.7.1/.gopath/src/github.com/docker/docker/pkg/promise/promise.go:9 +0xfb

I saw a bunch of such goroutines hanging here. If the Container.Start() hangs, it will cause the container lock not being released, and all following docker ps will hang. (seems this is also true for the v.1.9.0-rc1)

Though I am not sure why Container.Start() hangs, and whether this is the only cause for docker ps to hang.

https://github.com/docker/docker/blob/v1.9.0-rc1/daemon/container.go#L243
https://github.com/docker/docker/blob/v1.9.0-rc1/daemon/container.go#L304
https://github.com/docker/docker/blob/v1.9.0-rc1/daemon/list.go#L113

I think we should try not to hold a mutex before such syscalls anyway...

Thats a major issue for me, is anyone from Docker looking into this?

We have the similar issue under docker 1.8.2 , I suspect there are go routine or other leaks going in the daemon.

When put under fast creation/deletion stress, docker ps would take forever to finish, and eventually docker daemon will lost repose.

I'm having similar issues with 1.8.2. I tried 1.9 rc2 and had similar issues, lots of hangs and restarting the docker daemon, which sometimes corrected things and more often didn't.

I'd be curious to know when something hangs, how long is it hanging for before it's killed?
Does it ever return if you just let it wait it out?

Though I didn't time it, I'd say it was at least over twenty minutes. I've started using docker-compose kill vs. stop and it seems better, but time will tell. I don't see anything obvious from the logs either.

We see this too on centos 7.1, docker 1.8.2. My gut tells me it's a datamapper/loopback problem.

Next on our list is to try this: https://github.com/projectatomic/docker-storage-setup (h/t @ibuildthecloud )

Also experiencing this and it is not under heavy load.

Centos 7

Docker version

Client:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:08:45 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   0a8c2e3
 Built:        Thu Sep 10 19:08:45 UTC 2015
 OS/Arch:      linux/amd64


Docker info

Containers: 4
Images: 40
Storage Driver: devicemapper
 Pool Name: docker-202:1-25190844-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.914 GB
 Data Space Total: 107.4 GB
 Data Space Available: 81.05 GB
 Metadata Space Used: 4.03 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.143 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-123.8.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 6.898 GiB
Name: ip-10-50-185-203
ID: EOMR:4G7Y:7H6O:QOXE:WW4Z:3R2G:HLI4:ZMXY:GKF3:YKSK:TIHC:MHZF
[centos@ip-10-50-185-203 ~]$ uname -a
Linux ip-10-50-185-203 3.10.0-123.8.1.el7.x86_64 #1 SMP Mon Sep 22 19:06:58 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

docker images works but docker ps hangs.

Last few lines of strace output:

clock_gettime(CLOCK_MONOTONIC, {2393432, 541406232}) = 0
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/socket"}, 30) = 0
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2856433208, u64=139812631852600}}) = 0
getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/socket"}, [30]) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 542834304}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 542897330}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 543010460}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 543090332}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 543157219}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 543208557}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 543306537}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 543364486}) = 0
clock_gettime(CLOCK_REALTIME, {1446718224, 108316907}) = 0
mmap(0xc208100000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc208100000
mmap(0xc207fe0000, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc207fe0000
clock_gettime(CLOCK_MONOTONIC, {2393432, 543814528}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 543864338}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 543956865}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 544018495}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 544402150}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 544559595}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 544607443}) = 0
clock_gettime(CLOCK_REALTIME, {1446718224, 109474379}) = 0
epoll_wait(4, {{EPOLLOUT, {u32=2856433208, u64=139812631852600}}}, 128, 0) = 1
clock_gettime(CLOCK_MONOTONIC, {2393432, 544968692}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 545036728}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 545095771}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 545147947}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 545199057}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 545251039}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 545308858}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 545756723}) = 0
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
clock_gettime(CLOCK_REALTIME, {1446718224, 112187655}) = 0
clock_gettime(CLOCK_REALTIME, {1446718224, 112265169}) = 0
clock_gettime(CLOCK_REALTIME, {1446718224, 112345304}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 547677486}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 547743669}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 547801800}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 547864215}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 547934364}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 548042167}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 548133574}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 548209405}) = 0
clock_gettime(CLOCK_REALTIME, {1446718224, 113124453}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 548493023}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 548566472}) = 0
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
futex(0xc208020ed8, FUTEX_WAKE, 1)      = 1
clock_gettime(CLOCK_MONOTONIC, {2393432, 549410983}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 549531015}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 549644468}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 549713961}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 549800266}) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 549864335}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
stat("/root/.docker/config.json", 0xc208052900) = -1 ENOENT (No such file or directory)
stat("/root/.dockercfg", 0xc208052990)  = -1 ENOENT (No such file or directory)
clock_gettime(CLOCK_REALTIME, {1446718224, 115099477}) = 0
clock_gettime(CLOCK_REALTIME, {1446718224, 115153125}) = 0
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 5
setsockopt(5, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(5, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 550603891}) = 0
clock_gettime(CLOCK_REALTIME, {1446718224, 115517051}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2856433016, u64=139812631852408}}) = 0
getsockname(5, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(5, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
clock_gettime(CLOCK_MONOTONIC, {2393432, 550961223}) = 0
read(5, 0xc208131000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {2393432, 551138398}) = 0
write(5, "GET /v1.20/containers/json HTTP/"..., 108) = 108
epoll_wait(4, {{EPOLLOUT, {u32=2856433016, u64=139812631852408}}}, 128, 0) = 1
epoll_wait(4, 

@chbatey can you edit your comment, and add the output of docker version?

As for me and mine, color my face red. I ran the daemon in debug and discovered a shared volume that was a cifs mount hanging. Once I took care of that, things are working fine for me now.

@pspierce no problem, thank you for reporting back!

I'm interested to hear if this is still an issue on 1.9 for anyone here

we're also seeing this with 1.8.2 on centos 7.1 quite frequently, but only on our high-churn hosts (~2,100 containers/hour). It doesn't seem to impact our identically configured, but lower volume (~300 containers/hour) hosts, so it seems to be some kind of deadlock triggered by lots of concurrent operations? We see it ~/6 hours of high-churn activity, and so far the only solution we've found is to run regular (~/3 hrs) daemon restarts that nuke '/var/lib/docker/containers', and '/var/lib/docker/volumes' while the daemon's down in order to bring the host up "fresh" without tossing our images, after which docker ps and api-driven container-start calls get snappy again.

we'll give 1.9 a try this coming week and see if it helps at all (fingers crossed!); fwiw, we did not encounter this gradual responsiveness degredation (and eventual deadlock) on 1.6.2.

here's the details re: our current setup:

PRODUCTION [[email protected] ~]$ docker -D info
Containers: 8
Images: 41
Storage Driver: overlay
 Backing Filesystem: xfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.3.0-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 4
Total Memory: 7.796 GiB
Name: cc-docker01.prod.iad01.treehouse
ID: AB4S:SO4Z:AEOS:MC3H:XPRV:SIH4:VE2N:ELYX:TA5S:QQWP:WDAP:DUKE
Username: xxxxxxxxxxxxxx
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

PRODUCTION [[email protected] ~]$ docker -D version
Client:
 Version:      1.8.2
 API version:  1.20
 Package Version: docker-1.8.2-7.el7.centos.x86_64
 Go version:   go1.4.2
 Git commit:   bb472f0/1.8.2
 Built:        
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.2
 API version:  1.20
 Package Version: 
 Go version:   go1.4.2
 Git commit:   bb472f0/1.8.2
 Built:        
 OS/Arch:      linux/amd64

Also seeing this on ubuntu 14.04, only 8 containers running but high disk and cpu load. The containers are restarted as they complete so there are always 8 running. The daemon is dying after anything from a few hundred to a couple of thousand cumulative container runs. I did not see the daemon hang when I was running over loopback but it's happened twice in the last couple of days using the thinpool. This is on a 40 core workstation with 64gb ram.

Docker Version:
~$ docker version
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:12:04 UTC 2015
OS/Arch: linux/amd64

Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:12:04 UTC 2015
OS/Arch: linux/amd64

--- Docker image works but docker ps hangs. docker info also hangs. Here's the end of the strace for docker ps:
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
epoll_create1(EPOLL_CLOEXEC) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3565442144, u64=140517715498080}}) = 0
getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
read(3, 0xc208506000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
write(3, "GET /v1.21/containers/json HTTP/"..., 108) = 108
epoll_wait(4, {{EPOLLOUT, {u32=3565442144, u64=140517715498080}}}, 128, 0) = 1
epoll_wait(4,

We had this under 1.7.1, and worked around it very effectively (we saw it every few hours under 1.7.1, but never for 1+ month post workaround):

udevadm control --timeout=300

We are running RHEL7.1. We just did an upgrade to Docker 1.8.2 without any other changes, and our app locked it up within a few hours. Strace:

``
[pid 4200] open("/sys/fs/cgroup/freezer/system.slice/docker-bcb29dad6848d250df7508f85e78ca9b83d40f0e22011869d89a176e27b7ef87.scope/freezer.state", O_RDONLY|O_CLOEXEC) = 36
[pid 4200] fstat(36, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 4239] <... select resumed> ) = 0 (Timeout)
[pid 4200] read(36, [pid 4239] select(0, NULL, NULL, NULL, {0, 20} [pid 4200] <... read resumed> "FREEZINGn", 512) = 9
[pid 4200] read(36, "", 1527) = 0
[pid 4200] close(36) = 0
[pid 4200] futex(0x1778e78, FUTEX_WAKE, 1 [pid 4239] <... select resumed> ) = 0 (Timeout)
[pid 5252] <... futex resumed> ) = 0
[pid 4200] <... futex resumed> ) = 1
[pid 5252] futex(0xc2085daed8, FUTEX_WAIT, 0, NULL [pid 4239] select(0, NULL, NULL, NULL, {0, 20} [pid 4200] futex(0xc2085daed8, FUTEX_WAKE, 1 [pid 5252] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 4200] <... futex resumed> ) = 0
[pid 5252] epoll_wait(4, [pid 4200] futex(0x1778e78, FUTEX_WAIT, 0, {0, 958625} [pid 5252] <... epoll_wait resumed> {}, 128, 0) = 0
[pid 5252] futex(0xc2085daed8, FUTEX_WAIT, 0, NULL [pid 4239] <... select resumed> ) = 0 (Timeout)


UPDATE: It seems my hypothesis about interleaved docker run / docker rm was unsound. I tried to only do many simultaneous docker runs and that was sufficient to trigger the behavior. I also tried to switch to a dm thinpool, but that didn't help either. My workaround was simply ensuring that multiple containers are never started at the "same" time, i.e. I add at least ~10-30 second gaps between the starts. Now the daemon is running without any problems for over 2 weeks. End of update.

Just to add one more confirmation, we are seeing this in 1.9.0. We are not even spawning a lot, max 8 containers (1 per core) at a time, and no more than ~40 per hour. One common thing with the original report is that we also end up doing a few interleaved invocations of docker run and docker rm -f. My gut feeling is that it's the combination of many concurrent container creates and deletes that triggers the deadlock.

$ docker version
Client:
 Version:      1.9.0
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   76d6bc9
 Built:        Tue Nov  3 18:00:05 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.0
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   76d6bc9
 Built:        Tue Nov  3 18:00:05 UTC 2015
 OS/Arch:      linux/amd64
$ docker -D info
Containers: 10
Images: 119
Server Version: 1.9.0
Storage Driver: devicemapper
 Pool Name: docker-253:1-2114818-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 7.234 GB
 Data Space Total: 107.4 GB
 Data Space Available: 42.64 GB
 Metadata Space Used: 10.82 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.137 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.20.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 8
Total Memory: 15.51 GiB

@mjsalinger We have the same problem as you, with the same setup, did you solve the problem?

Same issue here

~ # docker info
Containers: 118
Images: 2239
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.2-coreos-r1
Operating System: CoreOS 835.8.0
CPUs: 16
Total Memory: 29.44 GiB
Username: util-user
Registry: https://index.docker.io/v1/
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   cedd534-dirty
 Built:        Tue Dec  1 02:00:58 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   cedd534-dirty
 Built:        Tue Dec  1 02:00:58 UTC 2015
 OS/Arch:      linux/amd64

Our logs are riddled with things like:

time="2016-01-08T21:38:34.735146159Z" level=error msg="Failed to compute size of container rootfs e4565ff148266389cbf70af9f22f9c62aa255bacaec0925a72eab5d0dca9da5f: Error getting container e4565ff148266389cbf70af9f22f9c62aa255bacaec0925a72eab5d0dca9da5f from driver overlay: stat /var/lib/docker/overlay/e4565ff148266389cbf70af9f22f9c62aa255bacaec0925a72eab5d0dca9da5f: no such file or directory"

and

time="2016-01-08T21:42:34.846701169Z" level=error msg="Handler for GET /containers/json returned error: write unix @: broken pipe"
time="2016-01-08T21:42:34.846717812Z" level=error msg="HTTP Error" err="write unix @: broken pipe" statusCode=500

But I'm not sure if any of it has anything to do with the hanging issue.

Including dump of all docker goroutines stacks by sending SIGUSR1 to a daemon might be helpful ..

@whosthatknocking didn't even know I could do that, cool.

time="2016-01-08T22:20:16.181468178Z" level=info msg="=== BEGIN goroutine stack dump ===
goroutine 11 [running]:
github.com/docker/docker/pkg/signal.DumpStacks()
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/signal/trap.go:60 +0x7a
github.com/docker/docker/daemon.func·025()
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/debugtrap_unix.go:18 +0x6d
created by github.com/docker/docker/daemon.setupDumpStackTrap
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/debugtrap_unix.go:20 +0x18e

goroutine 1 [chan receive, 33262 minutes]:
main.(*DaemonCli).CmdDaemon(0xc20807d1a0, 0xc20800a020, 0x1, 0x1, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/docker/daemon.go:289 +0x1781
reflect.callMethod(0xc208142060, 0xc20842fce0)
    /usr/lib/go/src/reflect/value.go:605 +0x179
reflect.methodValueCall(0xc20800a020, 0x1, 0x1, 0x1, 0xc208142060, 0x0, 0x0, 0xc208142060, 0x0, 0x452ecf, ...)
    /usr/lib/go/src/reflect/asm_amd64.s:29 +0x36
github.com/docker/docker/cli.(*Cli).Run(0xc20810df80, 0xc20800a010, 0x2, 0x2, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/cli/cli.go:89 +0x38e
main.main()
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/docker/docker.go:69 +0x428

goroutine 5 [syscall]:
os/signal.loop()
    /usr/lib/go/src/os/signal/signal_unix.go:21 +0x1f
created by os/signal.init·1
    /usr/lib/go/src/os/signal/signal_unix.go:27 +0x35

goroutine 17 [syscall, 33262 minutes, locked to thread]:
runtime.goexit()
    /usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 13 [IO wait, 33262 minutes]:
net.(*pollDesc).Wait(0xc2081eb100, 0x72, 0x0, 0x0)
    /usr/lib/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc2081eb1
00, 0x0, 0x0)
    /usr/lib/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).readMsg(0xc2081eb0a0, 0xc2081bd9a0, 0x10, 0x10, 0xc208440020, 0x1000, 0x1000, 0xffffffffffffffff, 0x0, 0x0, ...)
    /usr/lib/go/src/net/fd_unix.go:296 +0x54e
net.(*UnixConn).ReadMsgUnix(0xc2080384b8, 0xc2081bd9a0, 0x10, 0x10, 0xc208440020, 0x1000, 0x1000, 0x51, 0xc2081bd6d4, 0x4, ...)
    /usr/lib/go/src/net/unixsock_posix.go:147 +0x167
github.com/godbus/dbus.(*oobReader).Read(0xc208440000, 0xc2081bd9a0, 0x10, 0x10, 0xc208440000, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/transport_unix.go:21 +0xc5
io.ReadAtLeast(0x7f9ad52e9310, 0xc208440000, 0xc2081bd9a0, 0x10, 0x10, 0x10, 0x0, 0x0, 0x0)
    /usr/lib/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f9ad52e9310, 0xc208440000, 0xc2081bd9a0, 0x10, 0x10, 0x51, 0x0, 0x0)
    /usr/lib/go/src/io/io.go:316 +0x6d
github.com/godbus/dbus.(*unixTransport).ReadMessage(0xc2081df900, 0xc208115170, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/transport_unix.go:85 +0x1bf
github.com/godbus/dbus.(*Conn).inWorker(0xc208418480)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/conn.go:248 +0x58
created by github.com/godbus/dbus.(*Conn).Auth
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/auth.go:118 +0xe84

goroutine 10 [chan receive, 33262 minutes]:
github.com/docker/docker/api/server.(*Server).ServeApi(0xc208037800, 0xc20807d3d0, 0x1, 0x1, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/api/server/server.go:111 +0x74f
main.func·007()
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/docker/daemon.go:239 +0x5b
created by main.(*DaemonCli).CmdDaemon
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-
1.8.3/work/docker-1.8.3/docker/daemon.go:245 +0xce9

goroutine 14 [chan receive, 33262 minutes]:
github.com/godbus/dbus.(*Conn).outWorker(0xc208418480)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/conn.go:370 +0x58
created by github.com/godbus/dbus.(*Conn).Auth
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/auth.go:119 +0xea1

goroutine 15 [chan receive, 33262 minutes]:
github.com/docker/libnetwork/iptables.signalHandler()
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/docker/libnetwork/iptables/firewalld.go:92 +0x57
created by github.com/docker/libnetwork/iptables.FirewalldInit
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/docker/libnetwork/iptables/firewalld.go:48 +0x185

goroutine 50 [chan receive, 33262 minutes]:
database/sql.(*DB).connectionOpener(0xc2081aa0a0)
    /usr/lib/go/src/database/sql/sql.go:589 +0x4c
created by database/sql.Open
    /usr/lib/go/src/database/sql/sql.go:452 +0x31c

goroutine 51 [IO wait]:
net.(*pollDesc).Wait(0xc2081dcd10, 0x72, 0x0, 0x0)
    /usr/lib/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc2081dcd10, 0x0, 0x0)
    /usr/lib/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).readMsg(0xc2081dccb0, 0xc20b38c970, 0x10, 0x10, 0xc20bf3b220, 0x1000, 0x1000, 0xffffffffffffffff, 0x0, 0x0, ...)
    /usr/lib/go/src/net/fd_unix.go:296 +0x54e
net.(*UnixConn).ReadMsgUnix(0xc208038618, 0xc20b38c970, 0x10, 0x10, 0xc20bf3b220, 0x1000, 0x1000, 0x35, 0xc20b38c784, 0x4, ...)
    /usr/lib/go/src/net/unixsock_posix.go:147 +0x167
github.com/godbus/dbus.(*oobReader).Read(0xc20bf3b200, 0xc20b38c970, 0x10, 0x10, 0xc20bf3b200, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/transport_unix.go:21 +0xc5
io.ReadAtLeast(0x7f9ad52e9310, 0xc20bf3b200, 0x
c20b38c970, 0x10, 0x10, 0x10, 0x0, 0x0, 0x0)
    /usr/lib/go/src/io/io.go:298 +0xf1
io.ReadFull(0x7f9ad52e9310, 0xc20bf3b200, 0xc20b38c970, 0x10, 0x10, 0x35, 0x0, 0x0)
    /usr/lib/go/src/io/io.go:316 +0x6d
github.com/godbus/dbus.(*unixTransport).ReadMessage(0xc208176950, 0xc208471470, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/transport_unix.go:85 +0x1bf
github.com/godbus/dbus.(*Conn).inWorker(0xc2080b0fc0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/conn.go:248 +0x58
created by github.com/godbus/dbus.(*Conn).Auth
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/auth.go:118 +0xe84

goroutine 52 [chan receive]:
github.com/godbus/dbus.(*Conn).outWorker(0xc2080b0fc0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/conn.go:370 +0x58
created by github.com/godbus/dbus.(*Conn).Auth
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/godbus/dbus/auth.go:119 +0xea1

goroutine 53 [chan receive]:
github.com/coreos/go-systemd/dbus.func·001()
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/coreos/go-systemd/dbus/subscription.go:70 +0x64
created by github.com/coreos/go-systemd/dbus.(*Conn).initDispatch
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/coreos/go-systemd/dbus/subscription.go:94 +0x11c

goroutine 54 [chan receive]:
github.com/docker/docker/daemon.(*statsCollector).run(0xc20844dad0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/stats_collector_unix.go:91 +0xb2
created by github.com/docker/docker/daemon.newStatsCollector
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/wo
rk/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/stats_collector_unix.go:31 +0x116

goroutine 55 [chan receive, 2 minutes]:
github.com/docker/docker/daemon.(*Daemon).execCommandGC(0xc2080908c0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/exec.go:256 +0x8c
created by github.com/docker/docker/daemon.NewDaemon
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/daemon.go:736 +0x2358

goroutine 774 [semacquire, 33261 minutes]:
sync.(*Cond).Wait(0xc208460030)
    /usr/lib/go/src/sync/cond.go:62 +0x9e
io.(*pipe).read(0xc208460000, 0xc208c0bc00, 0x400, 0x400, 0x0, 0x0, 0x0)
    /usr/lib/go/src/io/pipe.go:52 +0x303
io.(*PipeReader).Read(0xc208b52750, 0xc208c0bc00, 0x400, 0x400, 0x1f, 0x0, 0x0)
    /usr/lib/go/src/io/pipe.go:134 +0x5b
github.com/docker/docker/pkg/ioutils.(*bufReader).drain(0xc2084600c0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/ioutils/readers.go:116 +0x10e
created by github.com/docker/docker/pkg/ioutils.NewBufReader
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/ioutils/readers.go:86 +0x2f3

goroutine 784 [semacquire, 33261 minutes]:
sync.(*Cond).Wait(0xc208460570)
    /usr/lib/go/src/sync/cond.go:62 +0x9e
io.(*pipe).read(0xc208460540, 0xc208963000, 0x400, 0x400, 0x0, 0x0, 0x0)
    /usr/lib/go/src/io/pipe.go:52 +0x303
io.(*PipeReader).Read(0xc208b529d8, 0xc208963000, 0x400, 0x400, 0x1f, 0x0, 0x0)
    /usr/lib/go/src/io/pipe.go:134 +0x5b
github.com/docker/docker/pkg/ioutils.(*bufReader).drain(0xc208460600)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/ioutils/readers.go:116 +0x10e
created by github.com/docker/docker/pkg/ioutils.NewBufReader
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/wo
rk/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/ioutils/readers.go:86 +0x2f3

goroutine 757 [semacquire, 33261 minutes]:
sync.(*Cond).Wait(0xc208141cb0)
    /usr/lib/go/src/sync/cond.go:62 +0x9e
github.com/docker/docker/pkg/ioutils.(*bufReader).Read(0xc208141c80, 0xc2089f2000, 0x1000, 0x1000, 0x0, 0x7f9ad52d4080, 0xc20802a0d0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/ioutils/readers.go:210 +0x158
bufio.(*Reader).fill(0xc208956ae0)
    /usr/lib/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).ReadSlice(0xc208956ae0, 0x41d50a, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go/src/bufio/bufio.go:295 +0x257
bufio.(*Reader).ReadBytes(0xc208956ae0, 0xc208141c0a, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go/src/bufio/bufio.go:374 +0xd2
github.com/docker/docker/daemon/logger.(*Copier).copySrc(0xc2084def40, 0xf8ac00, 0x6, 0x7f9ad003e420, 0xc208141c80)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/logger/copier.go:47 +0x96
created by github.com/docker/docker/daemon/logger.(*Copier).Run
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/logger/copier.go:38 +0x11c

goroutine 842 [chan receive, 33261 minutes]:
github.com/docker/docker/daemon.(*Container).AttachWithLogs(0xc208cc90e0, 0x0, 0x0, 0x7f9ad003e398, 0xc208471e30, 0x7f9ad003e398, 0xc208471e00, 0x100, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/container.go:934 +0x40d
github.com/docker/docker/daemon.(*Daemon).ContainerAttachWithLogs(0xc2080908c0, 0xc208cc90e0, 0xc208471dd0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/attach.go:39 +0x42c
github.com/docker/docker/api/server.(*Server).postContainersAttach(0xc208037800, 0xc208b36007, 0x4, 0x7f9ad52f2870, 0xc20
87fde00, 0xc2087d4410, 0xc208471b90, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/api/server/server.go:1169 +0x5d1
github.com/docker/docker/api/server.*Server.(github.com/docker/docker/api/server.postContainersAttach)·fm(0xc208b36007, 0x4, 0x7f9ad52f2870, 0xc2087fde00, 0xc2087d4410, 0xc208471b90, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/api/server/server.go:1671 +0x7b
github.com/docker/docker/api/server.func·008(0x7f9ad52f2870, 0xc2087fde00, 0xc2087d4410)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/api/server/server.go:1614 +0xc8f
net/http.HandlerFunc.ServeHTTP(0xc2081cc580, 0x7f9ad52f2870, 0xc2087fde00, 0xc2087d4410)
    /usr/lib/go/src/net/http/server.go:1265 +0x41
github.com/gorilla/mux.(*Router).ServeHTTP(0xc20813cd70, 0x7f9ad52f2870, 0xc2087fde00, 0xc2087d4410)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/vendor/src/github.com/gorilla/mux/mux.go:98 +0x297
net/http.serverHandler.ServeHTTP(0xc2082375c0, 0x7f9ad52f2870, 0xc2087fde00, 0xc2087d4410)
    /usr/lib/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc2087fdd60)
    /usr/lib/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
    /usr/lib/go/src/net/http/server.go:1751 +0x35e

goroutine 789 [semacquire, 33261 minutes]:
sync.(*WaitGroup).Wait(0xc20882de60)
    /usr/lib/go/src/sync/waitgroup.go:132 +0x169
github.com/docker/docker/daemon.func·017(0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/container.go:1035 +0x42
github.com/docker/docker/pkg/promise.func·001()
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/promise/promise.go:8 +0x2f
created by github.com/docker/docker/pkg
/promise.Go
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/promise/promise.go:9 +0xfb

goroutine 788 [semacquire, 33261 minutes]:
sync.(*Cond).Wait(0xc2084607b0)
    /usr/lib/go/src/sync/cond.go:62 +0x9e
github.com/docker/docker/pkg/ioutils.(*bufReader).Read(0xc208460780, 0xc208a70000, 0x8000, 0x8000, 0x0, 0x7f9ad52d4080, 0xc20802a0d0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/pkg/ioutils/readers.go:210 +0x158
io.Copy(0x7f9ad003e398, 0xc208845860, 0x7f9ad003e420, 0xc208460780, 0x0, 0x0, 0x0)
    /usr/lib/go/src/io/io.go:362 +0x1f6
github.com/docker/docker/daemon.func·016(0xf8abc0, 0x6, 0x7f9ad003e398, 0xc208845860, 0x7f9ad003e3f0, 0xc208460780)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/container.go:1021 +0x245
created by github.com/docker/docker/daemon.attach
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.8.3/work/docker-1.8.3/.gopath/src/github.com/docker/docker/daemon/container.go:1032 +0x597

goroutine 769 [IO wait, 33261 minutes]:
net.(*pollDesc).Wait(0xc20881de20, 0x72, 0x0, 0x0)
    /usr/lib/go/src/net/fd_poll_runtime.go:84 +0x47
net.(*pollDesc).WaitRead(0xc20881de20, 0x0, 0x0)
    /usr/lib/go/src/net/fd_poll_runtime.go:89 +0x43
net.(*netFD).Read(0xc20881ddc0, 0xc2088ac000, 0x1000, 0x1000, 0x0, 0x7f9ad52d8340, 0xc20880d758)
    /usr/lib/go/src/net/fd_unix.go:242 +0x40f
net.(*conn).Read(0xc208b52360, 0xc2088ac000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
    /usr/lib/go/src/net/net.go:121 +0xdc
net/http.(*liveSwitchReader).Read(0xc208792c28, 0xc2088ac000, 0x1000, 0x1000, 0x2, 0x0, 0x0)
    /usr/lib/go/src/net/http/server.go:214 +0xab
io.(*LimitedReader).Read(0xc20889f960, 0xc2088ac000, 0x1000, 0x1000, 0x2, 0x0, 0x0)
    /usr/lib/go/src/io/io.go:408 +0xce
bufio.(*Reader).fill(0xc2082544e0)
    /usr/lib/go/src/bufio/bufio.go:97 +0x1ce
bufio.(*Reader).ReadSlice(0xc208254
4e0, 0xc20889db0a, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go/src/bufio/bufio.go:295 +0x257
bufio.(*Reader).ReadLine(0xc2082544e0, 0x0, 0x0, 0x0, 0xc208bc2700, 0x0, 0x0)
    /usr/lib/go/src/bufio/bufio.go:324 +0x62
net/textproto.(*Reader).readLineSlice(0xc208845560, 0x0, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go/src/net/textproto/reader.go:55 +0x9e
net/textproto.(*Reader).ReadLine(0xc208845560, 0x0, 0x0, 0x0, 0x0)
    /u
=== END goroutine stack dump ==="

I just had the hang, docker go routines attached docker-hang.txt

I'm also seeing this repeatedly in the logs after a hang
kernel: unregister_netdevice: waiting for veth2fb10a9 to become free. Usage count = 1

@thaJeztah possibly however #5618 seems to only apply to lo and eth0 not the interface that the container was bound to.

Same issue happened when heavy load of Kubernetes jobs were scheduled. docker ps never returns. Actually curl --unix-socket /var/run/docker.sock http:/containers/json hangs. I have to restart docker daemon to recover the issue. What's worse is it takes couple minutes to restart docker daemon!

$ docker version
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.3
 Git commit:   cedd534-dirty
 Built:        Thu Nov 19 23:12:57 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.3
 Git commit:   cedd534-dirty
 Built:        Thu Nov 19 23:12:57 UTC 2015
 OS/Arch:      linux/amd64
$ docker info
Containers: 57
Images: 100
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.2-coreos-r2
Operating System: CoreOS 870.2.0
CPUs: 32
Total Memory: 251.9 GiB
Name: CNPVG50853311
ID: TV5P:6OHQ:QTSN:IF6K:5LPX:TSHS:DEAW:TQSF:QTOT:45NO:JH2X:DMSE
Http Proxy: http://proxy.wdf.sap.corp:8080
Https Proxy: http://proxy.wdf.sap.corp:8080
No Proxy: hyper.cd,anywhere.cd

Seeing something similar on my our environment, dumped the info out below to aid any debugging. Stats are from two high spec physical rancheros4.2 hosts running around 70 containers each. We see the performance of docker degrade, i've not managed to track down the trigger, but a restart of the docker daemon does resolve the issue.

Docker version

Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   994543b
 Built:        Mon Nov 23 06:08:57 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   994543b
 Built:        Mon Nov 23 06:08:57 UTC 2015
 OS/Arch:      linux/amd64

docker info

Containers: 35
Images: 1958
Server Version: 1.9.1
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.3-rancher
Operating System: RancherOS (containerized)
CPUs: 64
Total Memory: 251.9 GiB
Name: PTL-BCA-07
ID: Q5WF:7MOB:37YR:NRZR:P2FE:DVWV:W7XY:Z6OL:TAVC:4KCM:IUFU:LO4C

Stacktrace:
debug2.txt

+1 to this for us as well, definitely seeing it more often under higher load, running on centos 6

Docker Version

Client version: 1.7.1
Client API version: 1.19
Go version (client): go1.4.2
Git commit (client): 786b29d
OS/Arch (client): linux/amd64
Server version: 1.7.1
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 786b29d
OS/Arch (server): linux/amd64

Docker Info

Containers: 1
Images: 55
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.18.23-74.el6.x86_64
Operating System: <unknown>
CPUs: 40
Total Memory: 157.4 GiB
Name: lively-frost
ID: SXAC:IJ45:UY45:FIXS:7MWD:MZAE:XSE5:HV3Z:DU4Z:CYK6:LXPB:A65F

@ssalinas even if this is resolved, it wouldn't help your situation, given that you're running on a distribution we no longer support (centos 6), and on a custom kernel (CentOS 6 ships with 2.6.x). This issue won't be resolved for Docker 1.7, because we don't back port changes

@theJeztah, Obviously there is some issue which is causing blockage under heavy load. Can you please confirm if someone at Docker's team has identified this issue and this has been resolved in version 1.9?

Apart from this, is there any way if I can identify whether Docker is hanging and not creating instances anymore? If yes then I can automate at least to restart the daemon and continue the operation.

We see the issue as well with the following configuration:

docker 1.8.3 and coreos 835.10

docker ps hangs on high churn rates. You can trigger it 100% by creating 200 or more containers as fast as possible (Kubernetes does this).

Same here
repro:

A_LOT=300 # whatever
for i in `seq 1 $A_LOT`; 
  do docker run --rm alpine sh -c "echo $i; sleep 30" & 
done
sleep 5   # give it some time to start spawning containers
docker ps # hangs
core@pph-01 ~ $ docker version
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   cedd534-dirty
 Built:        Tue Feb  2 13:28:10 UTC 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   cedd534-dirty
 Built:        Tue Feb  2 13:28:10 UTC 2016
 OS/Arch:      linux/amd64
Containers: 291
Images: 14
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.2-coreos-r2
Operating System: CoreOS 835.12.0
CPUs: 2
Total Memory: 1.958 GiB
Name: pph-01
ID: P4FX:IMCD:KF5R:MG3Y:XECP:JPKO:IRWM:3MGK:IAIW:UG5S:6KCR:IK2J
Username: pmoust
Registry: https://index.docker.io/v1/

For the record, I tried the above bash script to try to reproduce the issue on my laptop, but nothing bad happened, all was fine. I run Ubuntu 15.10:

$ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:20:08 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:20:08 UTC 2015
 OS/Arch:      linux/amd64

$ docker info
Containers: 1294
Images: 1231
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 3823
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.0-27-generic
Operating System: Ubuntu 15.10
CPUs: 8
Total Memory: 5.809 GiB
Name: pochama
ID: 6VMD:Z57I:QDFF:TBSU:FNSH:Y433:6KDS:2AXU:CVMH:JUKU:TVQQ:7UMR
Username: tomfotherby
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

We have seen multiple cases of this issue on Docker 1.8.3 / CentOS 7. We're using kubernetes which can create a large number of containers. Occasionally our docker daemon will become unresponsive (docker ps hang for > 30 mins) and only a restart of the daemon fixes the issue.

However I am unable to reproduce the issue using @pmoust's script.

Containers: 117
Images: 105
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3 kB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 5.559 GB
 Data Space Total: 21.45 GB
 Data Space Available: 15.89 GB
 Metadata Space Used: 7.234 MB
 Metadata Space Total: 54.53 MB
 Metadata Space Available: 47.29 MB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-327.4.5.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 2
Total Memory: 3.451 GiB
Name: server
ID: MK35:YN3L:KULL:C4YU:IOWO:6OK2:FHLO:WIYE:CBVE:MZBL:KG3T:OV5T
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Client:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 06:06:01 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.8.3
 API version:  1.20
 Go version:   go1.4.2
 Git commit:   f4bf5c7
 Built:        Mon Oct 12 06:06:01 UTC 2015
 OS/Arch:      linux/amd64

In @tomfotherby case and yours @andrewmichaelsmith the storage driver is aufs / devicemapper-xfs respectively.
I can consistently reproduce it when storage driver is overlay

I can replicate the hang using the script by @pmoust on CoreOS 835.12

Containers: 227
Images: 1
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.2-coreos-r2
Operating System: CoreOS 835.12.0
CPUs: 2
Total Memory: 3.864 GiB
Name: core-01
ID: WIYE:CGI3:2C32:LURO:MNHU:YQXU:NEU4:LPZG:YJM5:ZN3S:BBPF:3SXM

FWIW we think we may have pinned this down to this devicemapper kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=1292481

@andrewmichaelsmith is their an AWS AMI available that has the patch in?

@pmoust yum update on our CentOS 7 AWS boxes has pulled down the new kernel from that bug report (kernel-3.10.0-327.10.1.el7).

edit: Though if you're using overlayfs I don't think this is relevant for you?

+1 on this issue. I am facing this issue with Docker 1.10.1 on kernel 4.4.1-coreos on CoreOS Alpha
970.1.0 on AWS. This is causing a serious instability in our kubernetes cluster and my team is contemplating dropping dockers altogether.

Is there any active investigation of this issue?

@gopinatht The people in this issue seem to be affected by an array of issues. For us it was a bug in devicemapper that has fixed the issue, not to do with the docker project at all.

@andrewmichaelsmith Thanks for the response. Is there any guidance on how to get to the bottom of the issue? strace of docker ps call gives this:

read(5, 0xc20807d000, 4096) = -1 EAGAIN (Resource temporarily unavailable) write(5, "GET /v1.22/containers/json HTTP/"..., 109) = 109 futex(0x1fe97a0, FUTEX_WAKE, 1) = 1 futex(0x1fe9720, FUTEX_WAKE, 1) = 1 epoll_wait(4, {}, 128, 0) = 0 futex(0x1fea698, FUTEX_WAIT, 0, NULL

This is very similar to several other reports here. What else could I do to get to the bottom of this?

@andrewmichaelsmith tried with patched kernel

Here are the results https://github.com/coreos/bugs/issues/1117#issuecomment-191190839

@pmoust Thanks for the update.

@andrewmichaelsmith @pmoust Do you have any insights on what we can do next to investigate further? A fix for this is absolutely critical for our team to move forward with docker based kubernetes cluster.

@gopinatht I don't work for docker/google or have a particularly vested interest in you using docker or kubernetes, no matter how critical it is for your team ;-)

I'm not even clear what storage driver you're using? To reiterate: We have found an issue whereby docker would hang completely when using the devicemapper storage driver. Upgrading our kernel to kernel-3.10.0-327.10.1.el7 resolved this issue. I can't really add any more to this.

If you're not using devicemapper as your docker storage driver then my findings probably don't mean much for you.

@andrewmichaelsmith Apologies if I seemed to imply that you need to do something about this.

I was just looking for direction of investigation as I am obviously lost here. Thanks for your help so far anyway.

We just got this on Ubuntu with a more recent version of Kernel 3.13.0-83-generic. Every other operation seems to be fine - the only issues are related to docker ps and some random docker inspect

Docker info:

Containers: 51
 Running: 48
 Paused: 0
 Stopped: 3
Images: 92
Server Version: 1.10.3
Storage Driver: devicemapper
 Pool Name: docker-data-tpool
 Pool Blocksize: 65.54 kB
 Base Device Size: 5.369 GB
 Backing Filesystem: ext4
 Data file:
 Metadata file:
 Data Space Used: 19.14 GB
 Data Space Total: 102 GB
 Data Space Available: 82.86 GB
 Metadata Space Used: 33.28 MB
 Metadata Space Total: 5.369 GB
 Metadata Space Available: 5.335 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.77 (2012-10-15)
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: host bridge null
Kernel Version: 3.13.0-83-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 29.96 GiB
Name: apocalypse.servers.lair.io
ID: Q444:V3WX:RIQJ:5B3T:SYFQ:H2TR:SPVF:U6YE:XNDX:HV2Z:CS7F:DEJJ
WARNING: No swap limit support

strace tail:

futex(0x20844d0, FUTEX_WAIT, 0, NULL)   = 0
futex(0xc82002ea10, FUTEX_WAKE, 1)      = 1
clock_gettime(CLOCK_MONOTONIC, {216144, 494093555}) = 0
clock_gettime(CLOCK_MONOTONIC, {216144, 506740959}) = 0
futex(0xc82002ea10, FUTEX_WAKE, 1)      = 1
clock_gettime(CLOCK_MONOTONIC, {216144, 506835134}) = 0
clock_gettime(CLOCK_MONOTONIC, {216144, 506958105}) = 0
futex(0x20844d0, FUTEX_WAIT, 0

Also, after some time - maybe a couple of minutes - I got the following output in the blocked strace, while it never ended either:

futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0x20844d0, FUTEX_WAIT, 0, NULL

After waiting some more time, the same futex blocks, but this time there's also a 'Resource temporarily unavailable error':

futex(0x20844d0, FUTEX_WAIT, 0, NULL)   = 0
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
clock_gettime(CLOCK_MONOTONIC, {216624, 607690506}) = 0
clock_gettime(CLOCK_MONOTONIC, {216624, 607853434}) = 0
futex(0x2083970, FUTEX_WAIT, 0, {0, 100000}) = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(5, {}, 128, 0)               = 0
clock_gettime(CLOCK_MONOTONIC, {216624, 608219882}) = 0
futex(0x2083980, FUTEX_WAKE, 1)         = 1
futex(0xc820574110, FUTEX_WAKE, 1)      = 1
clock_gettime(CLOCK_MONOTONIC, {216624, 608587202}) = 0
clock_gettime(CLOCK_MONOTONIC, {216624, 609140069}) = 0
clock_gettime(CLOCK_MONOTONIC, {216624, 609185048}) = 0
clock_gettime(CLOCK_MONOTONIC, {216624, 609272020}) = 0
futex(0xc82002ea10, FUTEX_WAKE, 1)      = 1
clock_gettime(CLOCK_MONOTONIC, {216624, 616982914}) = 0
sched_yield()                           = 0
clock_gettime(CLOCK_MONOTONIC, {216624, 626726774}) = 0
futex(0x20844d0, FUTEX_WAIT, 0, NULL)   = 0
futex(0x2083970, FUTEX_WAKE, 1)         = 1
futex(0x20838c0, FUTEX_WAKE, 1)         = 1
sched_yield()                           = 0
futex(0x20838c0, FUTEX_WAIT, 2, NULL)   = 0
futex(0x20838c0, FUTEX_WAKE, 1)         = 0
futex(0x20844d0, FUTEX_WAIT, 0, NULL

Note that all times, the strace blocked on the same futex.

@akalipetis Can you send a SIGUSR1 to the process and pull the full stack trace from the daemon logs please?

I'll do this as soon as I encounter this again @cpuguy83 - sorry but I was not aware of that...

(Un)fortunately it happened again, you can find the complete stacktrace in the attached file below:

stacktrace.txt

Let me know if you'd like anything else on my side.

/cc @cpuguy83

Much appreciated @akalipetis!

No problem @cpuguy83 - let me know if you'd like any other data, or even if you'd like to have stacktraces from future stalls.

Unfortunately, I could not find a way to reproduce this, or identified any similar situation between hangs.

Since we updated to 1.11.0 running rspec docker image tests (~10 parallel containers running these tests on a 4 cpu machine) sometimes freezes and fails with a timeout. Docker freezes completely and doesn't respond (eg. docker ps). This is happening on vserver with Debian strech (btrfs) and with (vagrant) Parallels VM Ubuntu 14.04 (backported kernel 3.19.0-31-generic, ext4).

Filesystem for /var/lib/docker on both servers was cleared (btrfs was recreated) after first freeze. The freeze happens randomly when running these tests.

Stack trace is attached from both servers:
docker-log.zip

strace from docker-containerd and docker daemons:

# strace -p 21979 -p 22536
Process 21979 attached
Process 22536 attached
[pid 22536] futex(0x219bd90, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 21979] futex(0xf9b170, FUTEX_WAIT, 0, NULL

Docker info (Debian strech)

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 1096
Server Version: 1.11.0
Storage Driver: btrfs
 Build Version: Btrfs v4.4
 Library Version: 101
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge null
Kernel Version: 4.4.0-1-amd64
Operating System: Debian GNU/Linux stretch/sid
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 11.74 GiB
Name: slave.webdevops.io
ID: WU2P:YOYC:VP2F:N6DE:BINB:B6YO:2HJO:JKZA:HTP3:SIDA:QOJZ:PJ2E
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Username: webdevopsslave
Registry: https://index.docker.io/v1/
WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support

Docker info (Ubuntu 14.04 with backported kernel)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64
root@DEV-VM:/var/lib/docker# docker info
Containers: 11
 Running: 1
 Paused: 0
 Stopped: 10
Images: 877
Server Version: 1.11.0
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 400
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 3.19.0-31-generic
Operating System: Ubuntu 14.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 3.282 GiB
Name: DEV-VM
ID: KCQP:OGCT:3MLX:TAQD:2XG6:HBG2:DPOM:GJXY:NDMK:BXCK:QEIT:D6KM
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

Docker version (Debian strech)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:22:26 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:22:26 2016
 OS/Arch:      linux/amd64

Docker version (Ubuntu 14.04 with backported kernel)

Client:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.0
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   4dc5990
 Built:        Wed Apr 13 18:34:23 2016
 OS/Arch:      linux/amd64

@mblaschke you wanna use -f for strace for golang programs afaik.

I can confirm this bug for Ubuntu 14.04 but I'm not sure if this bug is really the problem on Debian stretch with 4.4 kernel. Still trying to get more informations.

The issue for Debian stretch with Docker 1.10.3 was the process limit of systemd (was 512, raised to 8192) and our Docker container tests are now running without issues. 1.11.0 is still not working and rspec tests are still freezing but docker ps is still responsive on Debian Stretch with 4.4 kernel so i think it's another issue.

Created a new issue #22124 to track the report from @mblaschke that may be something new.

also running into this I believe

$ uname -a
Linux ip-10-15-42-103.ec2.internal 4.4.6-coreos #2 SMP Sat Mar 26 03:25:31 UTC 2016 x86_64 Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/Linux
$ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.5.3
 Git commit:   9894698
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.5.3
 Git commit:   9894698
 Built:
 OS/Arch:      linux/amd64
$ docker info
Containers: 198
Images: 196
Server Version: 1.9.1
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: journald
Kernel Version: 4.4.6-coreos
Operating System: CoreOS 991.2.0 (Coeur Rouge)
CPUs: 2
Total Memory: 3.862 GiB
$ sudo strace -q -y -v -f docker ps
<snip>
[pid 26889] connect(5<socket:[18806726]>, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
[pid 26889] clock_gettime(CLOCK_REALTIME, {1462217998, 860739240}) = 0
[pid 26889] epoll_ctl(4<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 5<socket:[18806726]>, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=914977888, u64=140334676407392}}) = 0
[pid 26889] getsockname(5<socket:[18806726]>, {sa_family=AF_LOCAL, NULL}, [2]) = 0
[pid 26889] getpeername(5<socket:[18806726]>, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
[pid 26889] read(5<socket:[18806726]>,  <unfinished ...>
[pid 26893] <... futex resumed> )       = 1
[pid 26889] <... read resumed> 0xc820015000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 26893] futex(0xc8200e9790, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26891] futex(0xc820026a10, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 26891] epoll_wait(4<anon_inode:[eventpoll]>, {{EPOLLOUT, {u32=914978080, u64=140334676407584}}, {EPOLLOUT, {u32=914977888, u64=140334676407392}}}, 128, 0) = 2
[pid 26891] epoll_wait(4<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 26889] write(5<socket:[18806726]>, "GET /v1.21/containers/json HTTP/"..., 88 <unfinished ...>
[pid 26890] <... select resumed> )      = 0 (Timeout)
[pid 26889] <... write resumed> )       = 88
[pid 26891] <... epoll_wait resumed> {{EPOLLOUT, {u32=914977888, u64=140334676407392}}}, 128, -1) = 1
[pid 26891] epoll_wait(4<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 935071149}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 861179188}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 935216184}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 861327424}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 935376601}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 861479813}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 935543531}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 861646718}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 935709999}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 861817062}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 935872149}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 861975102}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 936046201}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 862149543}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 936215534}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 862318597}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 936384887}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 862488231}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 936547503}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 862650775}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 936708047}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 862810981}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 936875834}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 862978790}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 937049520}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 863161620}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 937216897}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 863319694}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 937382999}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 863485851}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 937549477}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 863652283}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 937722463}) = 0
[pid 26890] clock_gettime(CLOCK_REALTIME, {1462217998, 863833602}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 937888537}) = 0
[pid 26889] futex(0xc8200e9790, FUTEX_WAKE, 1 <unfinished ...>
[pid 26890] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 26889] <... futex resumed> )       = 1
[pid 26893] <... futex resumed> )       = 0
[pid 26890] <... clock_gettime resumed> {1462217998, 864010029}) = 0
[pid 26890] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 26889] futex(0x1df2f50, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26893] futex(0xc8200e9790, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 26890] <... select resumed> )      = 0 (Timeout)
[pid 26890] clock_gettime(CLOCK_MONOTONIC, {1194110, 938059687}) = 0
[pid 26890] futex(0x1df2400, FUTEX_WAIT, 0, {60, 0}

@mwhooker Unfortunately an strace of the docker client is not really going to help much since we need to see what's happening on the daemon.

Please use SIGUSR1 on the stuck daemon to have it spit out a stack trace to its logs.

Here's another sigusr1 dump of a stuck docker daemon

https://gist.github.com/mwhooker/e837f08370145d183e661c03a5b9d07e

strace again finds the daemon in FUTEX_WAIT

I can actually connect to the host this time, so can perform additional debugging, though I will kill the daemon soon

ping @cpuguy83 ^^

I have the same issue on 1.11.1 (4.2.5-300.fc23), Overlay/EXT4. It happens when I run a Celery worker, load it up with jobs and then try to stop it.

I can't then exec anything inside it:
rpc error: code = 2 desc = "oci runtime error: exec failed: exit status 1"

So I guess some of the container is destroyed, but cannot finish the job. I've confirmed that Celery has been killed within the container, so not sure what's holding it.

Edit:
Celery hasn't been killed completely, there's a process stuck in D, so I guess a reboot is the only option.

I ran into the same issue under high load (containers creating/deleting at high rate)

  • CoreOS 1068.8.0
  • Docker 1.10.3
  • uname -a:
Linux coreos_k8s_28 4.6.3-coreos #2 SMP Mon Jul 18 06:10:39 UTC 2016 x86_64 Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz GenuineIntel GNU/Linux
  • kubernetes 1.2.0

Here is my gist of docker debug logs (sent SIGUSR1 to docker daemon): https://gist.github.com/ntquyen/409376bc0d8418a31023c1546241e190

As @ntquyen . Same os/kernel/docker version.

Repro still stands https://github.com/docker/docker/issues/13885#issuecomment-181811082

For my case, it might be because the machines were running at > 90% disk space used. I assigned more disk space for them and they are running quite stable now, even being under high load.

@mwhooker it looks like your daemon is hung in a netlink call to delete an interface.
What daemon options do you have setup?

Note that when I say "hung", it's not really freezing the whole daemon, but any API call that needs to access the that container object will also hang because it will be stuck waiting for the container mutex which is stuck in the netlink call.

Looking at various things we can do to mitigate these types of issues.

@pmoust Can you provide a stack trace from the daemon? (send SIGUSR1 to a stuck daemon and collect the trace from the daemon logs)
Also the options you have started the daemon with.

Thanks!

@ntquyen I think the issue you hit is related to #22732 (although slightly different repro, still centered around image deletion) which is fixed in 1.11.2
Did you only hit the issue once?

@cpuguy83 thanks for getting back to me. here's how we run the daemon

/usr/bin/docker daemon --icc=false --storage-driver=overlay --log-driver=journald --host=fd://

Not sure it's related but we've also noticed a bunch of these messages in our logs

Jul 19 10:22:55 ip-10-0-37-191.ec2.internal systemd-networkd[852]: veth0adae8a: Removing non-existent address: fe80::e855:98ff:fe3f:8b2c/64 (valid forever)

(edit: also these)

[19409.536106] unregister_netdevice: waiting for veth2304bd1 to become free. Usage count = 1

@mwhooker the bottom message is probably here docker is getting stuck, but I'm surprised you are getting this without using --userland-proxy=false

@cpuguy83 interesting you mention that, we are running hyperkube with --proxy-mode=userspace. is it possible that's causing it?

@mwhooker based on the flag description, it doesn't seem like it.
Basically --userland-proxy=false enables hairpin mode on the container's bridge interface, which seems to trigger some condition in the kernel where when we add/remove a lot of containers (or do a bunch in parallel) it causes the a pretty bad freeze (and the error message you see).

You could check if your bridge interfaces are in hairpin mode.

@cpuguy83 this is the docker command from a separate set of servers that also experiences this issue

docker daemon --host=fd:// --icc=false --storage-driver=overlay --log-driver=journald --exec-opt native.cgroupdriver=systemd --bip=10.2.122.1/24 --mtu=8951 --ip-masq=true --selinux-enable

and I can see that these veth interfaces are in hairpin mode

cat /sys/devices/virtual/net/docker0/brif/veth*/hairpin_mode
1
1

I wonder if this is the cause of at least one of the docker hanging issues we're seeing.

(fwiw we're using userspace proxying on kube-proxy because of an interaction with flannel https://github.com/kubernetes/kubernetes/issues/20391)

@cpuguy83 Unfortunately, I'm using docker 1.10 in the latest stable coreos and cannot upgrade to docker 1.12. The issue was not solved completely (as I had thought) after I increased the disk space.

Same here.

  • Docker: 1.12
  • Kernel: 4.4.3
  • Distro: ubuntu

Daemon:

    # strace -q -y -v -p `pidof dockerd`
    futex(0x2a69be8, FUTEX_WAIT, 0, NULL
    futex(0xc820e1b508, FUTEX_WAKE, 1)      = 1
    futex(0x2a69be8, FUTEX_WAIT, 0, NULL)   = 0
    futex(0xc8224b0508, FUTEX_WAKE, 1)      = 1
    futex(0x2a69be8, FUTEX_WAIT, 0, NULL)   = 0
    futex(0x2a69be8, FUTEX_WAIT, 0, NULL

Client:

    # strace docker ps
    execve("/usr/bin/docker", ["docker", "ps"], [/* 24 vars */]) = 0
    brk(0)                                  = 0x2584000
    ...
    stat("/root/.docker/config.json", {st_mode=S_IFREG|0600, st_size=91, ...}) = 0
    openat(AT_FDCWD, "/root/.docker/config.json", O_RDONLY|O_CLOEXEC) = 3
    read(3, "{\n\t\"auths\": {\n\t\t\"https://quay.io"..., 512) = 91
    close(3)                                = 0
    ioctl(0, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
    ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
    futex(0xc820068108, FUTEX_WAKE, 1)      = 1
    futex(0xc820068108, FUTEX_WAKE, 1)      = 1
    socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
    setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
    connect(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
    epoll_create1(EPOLL_CLOEXEC)            = 4
    epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3136698616, u64=140182279305464}}) = 0
    getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
    getpeername(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
    futex(0xc820032d08, FUTEX_WAKE, 1)      = 1
    read(3, 0xc820356000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
    write(3, "GET /v1.24/containers/json HTTP/"..., 95) = 95
    futex(0x132aca8, FUTEX_WAIT, 0, NULL

Is still happening here on Debian when running rspec tests:

Kernel: 4.4.0-28-generic
Docker: 1.12.2-0~xenial

The current workaround is to go back to Docker 1.10.3-0~wily which is the last stable version.

@mblaschke @Bregor Can you post a stacktrace please?

This has been happening in our environment for a while now:[

$ docker info
Containers: 20
 Running: 6
 Paused: 0
 Stopped: 14
Images: 57
Server Version: 1.11.1
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 3.18.27
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 24
Total Memory: 125.8 GiB
Name: appdocker414-sjc1
ID: ZTWC:53NH:VKUZ:5FZK:SLZN:YPI4:ICK2:W7NF:UIGD:P2NQ:RHUD:PS6Y
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

Kernel: 3.18.27

Stackstrace: https://gist.github.com/dmyerscough/6948218a228ff69dd6e309f8de0f0261

@dmyerscough This seems to be https://github.com/docker/docker/issues/22732 . Fixed in 1.11.2 and 1.12

Running a simple container with docker run --rm every minute and Docker service deadlocks every now and then until it's restarted.

docker -D info
Containers: 6
 Running: 2
 Paused: 0
 Stopped: 4
Images: 6
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge null
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: <hostname>
ID: GFHQ:Y6IV:XCQI:Y2NA:PCQN:DEII:OSPZ:LENL:OCFU:6CBI:MDFV:EMD7
Docker Root Dir: /var/lib/docker
Debug mode (client): true
Debug mode (server): false
Username: <username>
Registry: https://index.docker.io/v1/

Last message before it hanged:

Nov 04 16:42:01 dockerd[1447]: time="2016-11-04T16:42:01Z" level=error msg="containerd: deleting container" error="wait: no child processes"

@liquid-sky do you have more information, perhaps daemon logs show anything useful, or a stacktrace? Also note that we don't have official packages for CoreOS; given that they distribute a modified package/configuration, it may be worth reporting there.

@thaJeztah Unfortunately journal log has rotated since the process last logged, when I strace it, I see that it's hanging on mutex lock.

$ ps -ef | grep -w 148[9] | head -1
root      1489     1  0 Nov02 ?        00:20:35 docker daemon --host=fd:// --insecure-registry=<registry_address> --selinux-enabled
sudo strace -e verbose=all -v -p 1489
Process 1489 attached
futex(0x22509c8, FUTEX_WAIT, 0, NULL
$ sudo strace docker ps
.....
clock_gettime(CLOCK_REALTIME, {1478601605, 8732879}) = 0
clock_gettime(CLOCK_REALTIME, {1478601605, 9085011}) = 0
clock_gettime(CLOCK_REALTIME, {1478601605, 9242006}) = 0
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/socket"}, 30) = 0
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3317119944, u64=140066495609800}}) = 0
getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/socket"}, [30]) = 0
stat("/root/.docker/config.json", 0xc8203b6fa8) = -1 ENOENT (No such file or directory)
stat("/root/.dockercfg", 0xc8203b7078)  = -1 ENOENT (No such file or directory)
stat("/bin/docker-credential-secretservice", 0xc8203b7148) = -1 ENOENT (No such file or directory)
stat("/sbin/docker-credential-secretservice", 0xc8203b7218) = -1 ENOENT (No such file or directory)
stat("/usr/bin/docker-credential-secretservice", 0xc8203b72e8) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/docker-credential-secretservice", 0xc8203b73b8) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/docker-credential-secretservice", 0xc8203b7488) = -1 ENOENT (No such file or directory)
stat("/usr/local/sbin/docker-credential-secretservice", 0xc8203b7558) = -1 ENOENT (No such file or directory)
stat("/opt/bin/docker-credential-secretservice", 0xc8203b7628) = -1 ENOENT (No such file or directory)
ioctl(0, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
futex(0xc8202c9108, FUTEX_WAKE, 1)      = 1
clock_gettime(CLOCK_REALTIME, {1478601605, 11810493}) = 0
clock_gettime(CLOCK_REALTIME, {1478601605, 11888495}) = 0
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 5
setsockopt(5, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(5, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
clock_gettime(CLOCK_REALTIME, {1478601605, 12378242}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3317119752, u64=140066495609608}}) = 0
getsockname(5, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(5, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc820026d08, FUTEX_WAKE, 1)      = 1
read(5, 0xc8203d6000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json HTTP/"..., 89) = 89
futex(0xc820026d08, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL

And so does every containerd service:

ps -ef | grep container[d]
root      1513  1489  0 Nov02 ?        00:00:53 containerd -l /var/run/docker/libcontainerd/docker-containerd.sock --runtime runc --start-timeout 2m
root      2774  1513  0 Nov02 ?        00:00:00 containerd-shim a90d4642fd88ab38c66a733e2cef8f427533e736d14d48743d42f55dec62447f /var/run/docker/libcontainerd/a90d4642fd88ab38c66a733e2cef8f427533e736d14d48743d42f55dec62447f runc
root      3946  1513  0 Nov02 ?        00:00:00 containerd-shim c8903c4a137fbb297efc3fcf2c69d746e94431f22c7fdf1a46ff7c69d04ffb0d /var/run/docker/libcontainerd/c8903c4a137fbb297efc3fcf2c69d746e94431f22c7fdf1a46ff7c69d04ffb0d runc
root      4783  1513  0 Nov02 ?        00:03:36 containerd-shim d8c2203adfc26f7d11a62d9d90ddf97f04c458f72855ee1987ed1af911a2ab55 /var/run/docker/libcontainerd/d8c2203adfc26f7d11a62d9d90ddf97f04c458f72855ee1987ed1af911a2ab55 runc
root     16684  1513  0 Nov02 ?        00:00:00 containerd-shim 4d62424ca8cceb29c877bf129cd46341a53e191c9858b93aca3d5cbcfaaa1876 /var/run/docker/libcontainerd/4d62424ca8cceb29c877bf129cd46341a53e191c9858b93aca3d5cbcfaaa1876 runc
root     16732  1513  0 Nov02 ?        00:03:24 containerd-shim 2f8e2a858306322c10aa7823c92f22133f1c5e5f267ce61e542c1d8bd537b121 /var/run/docker/libcontainerd/2f8e2a858306322c10aa7823c92f22133f1c5e5f267ce61e542c1d8bd537b121 runc
root     20902  1513  0 Nov02 ?        00:00:05 containerd-shim 58572e7fab122d593bdb096b0dd33551c22ce50a0c51d6662bc0c7b3d3bf9248 /var/run/docker/libcontainerd/58572e7fab122d593bdb096b0dd33551c22ce50a0c51d6662bc0c7b3d3bf9248 runc
$ sudo strace -T -e verbose=all -v -p 1513
Process 1513 attached
futex(0x1028dc8, FUTEX_WAIT, 0, NULL

@liquid-sky strace on a normal dockerd will also show futex(0x22509c8, FUTEX_WAIT, 0, NULL which shows nothing meaningful i think...you'd better add -f for strace to track all the threads.

I think this might be related to this issue In a sense that we run container every minute with --rm and Docker hangs on that particular container, however some nodes where docker ps is hanging do not have unregistered loopback device message.

Just in case, a snippet of strace -f of Docker daemon:

[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1766] write(23, "\2\0\0\0\0\0\0\321I1108 10:48:12.429657   "..., 217) = 217
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1766] futex(0xc822075d08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC, {514752, 140621361}) = 0
[pid  1875] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1298424080, u64=140528333381904}}}, 128, -1) = 1
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1875] epoll_wait(6,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 431376727}) = 0
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] <... read resumed> "I1108 10:48:12.431656    12 mast"..., 32768) = 1674
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] futex(0xc822075d08, FUTEX_WAKE, 1 <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1766] <... futex resumed> )       = 0
[pid  1514] <... futex resumed> )       = 1
[pid  1766] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 142476308}) = 0
[pid  1514] <... clock_gettime resumed> {1478602092, 432632124}) = 0
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431656   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 432677401}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 432812895}) = 0
[pid  1766] <... select resumed> )      = 0 (Timeout)
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431763   "..., 349 <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] <... write resumed> )       = 349
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 142831922}) = 0
[pid  1514] <... clock_gettime resumed> {1478602092, 432948135}) = 0
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431874   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {1478602092, 432989394}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1766] read(44, "I1108 10:48:12.432255    12 mast"..., 32768) = 837
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 433114452}) = 0
[pid  1766] read(44,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.431958   "..., 349) = 349
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] <... clock_gettime resumed> {1478602092, 433272783}) = 0
[pid  1507] <... clock_gettime resumed> {514752, 143176397}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432035   "..., 349 <unfinished ...>
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] <... write resumed> )       = 349
[pid  1507] <... clock_gettime resumed> {1478602092, 433350170}) = 0
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid  1514] <... clock_gettime resumed> {1478602092, 433416138}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432126   "..., 349) = 349
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1514] <... clock_gettime resumed> {1478602092, 433605782}) = 0
[pid  1507] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432255   "..., 349 <unfinished ...>
[pid  1507] <... clock_gettime resumed> {514752, 143537029}) = 0
[pid  1514] <... write resumed> )       = 349
[pid  1507] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME, {1478602092, 433748730}) = 0
[pid  1507] <... clock_gettime resumed> {1478602092, 433752245}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432343   "..., 348 <unfinished ...>
[pid  1507] futex(0xc8211b2d08, FUTEX_WAKE, 1 <unfinished ...>
[pid  1514] <... write resumed> )       = 348
[pid  1507] <... futex resumed> )       = 1
[pid 10488] <... futex resumed> )       = 0
[pid 10488] write(23, "\2\0\0\0\0\0\t\317I1108 10:48:12.431656   "..., 2519 <unfinished ...>
[pid  1514] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1507] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 10488] <... write resumed> )       = 2519
[pid 10488] futex(0xc8211b2d08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1875] <... epoll_wait resumed> {{EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=1298424080, u64=140528333381904}}}, 128, -1) = 1
[pid  1514] <... clock_gettime resumed> {1478602092, 434094221}) = 0
[pid  1514] write(45, "{\"log\":\"I1108 10:48:12.432445   "..., 349) = 349
[pid  1514] futex(0xc820209908, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  1507] <... select resumed> )      = 0 (Timeout)
[pid  1507] clock_gettime(CLOCK_MONOTONIC, {514752, 144370753}) = 0
[pid  1507] futex(0x224fe10, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid  1875] epoll_wait(6,  <unfinished ...>
[pid  1683] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1683] clock_gettime(CLOCK_MONOTONIC, {514752, 292709791}) = 0
[pid  1683] futex(0x224fe10, FUTEX_WAKE, 1) = 1
[pid  1507] <... futex resumed> )       = 0

i am using devicemapper with loopback on centos 7.

Not sure about what happend, but dmsetup udevcookies & dmsetup udevcomplete <cookie> works for me.

@thaJeztah could you help explain what happend ?

I'm repeating what has been commented many times before in here. If you encounter a hang please send SIGUSR1 signal to daemon process and copy the trace that is written to the logs. Strace output is not useful for debugging hangs, is likely from a client process and usually doesn't even show if the process is stuck or just sleeping.

I have a USR1 dump as @tonistiigi asked for with a hang-under-load.

$ docker info
Containers: 18
 Running: 15
 Paused: 0
 Stopped: 3
Images: 232
Server Version: 1.12.3
Storage Driver: devicemapper
 Pool Name: vg_ex-docker--pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 107.4 GB
 Backing Filesystem: xfs
 Data file: 
 Metadata file: 
 Data Space Used: 132.8 GB
 Data Space Total: 805.3 GB
 Data Space Available: 672.5 GB
 Metadata Space Used: 98.46 MB
 Metadata Space Total: 4.001 GB
 Metadata Space Available: 3.903 GB
 Thin Pool Minimum Free Space: 80.53 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Library Version: 1.02.107-RHEL7 (2016-06-09)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null bridge host overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.36.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 56
Total Memory: 251.6 GiB
Name: server.masked.example.com
ID: Z7J7:CLEG:KZPK:TNEI:QLYL:JBTO:XNM4:NX2X:KPQC:VHPC:6SFS:G4GR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

docker.txt

from the provided log, it seems it stuck on dm_udev_wait

goroutine 2747 [syscall, 12 minutes, locked to thread]:
github.com/docker/docker/pkg/devicemapper._Cfunc_dm_udev_wait(0xc80d4d887c, 0xc800000000)
\t??:0 +0x41
github.com/docker/docker/pkg/devicemapper.dmUdevWaitFct(0xd4d887c, 0x44288e)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:222 +0x22
github.com/docker/docker/pkg/devicemapper.UdevWait(0xc821c7e8f8, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src
Nov 17 06:50:13 server docker: /github.com/docker/docker/pkg/devicemapper/devmapper.go:259 +0x3b
github.com/docker/docker/pkg/devicemapper.activateDevice(0xc82021a880, 0x1e, 0xc8202f0cc0, 0x57, 0x9608, 0x1900000000, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:771 +0x8b8
github.com/docker/docker/pkg/devicemapper.ActivateDevice(0xc82021a880, 0x1e, 0xc8202f0cc0, 0x57, 0x9608, 0x1900000000, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:732 +0x78
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).activateDeviceIfNeeded(0xc8200b6340, 0xc8208e0a40, 0xc8200b6300, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:539 +0x58d
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).MountDevice(0xc8200b6340, 0xc820a00200, 0x40, 0xc820517ab0, 0x61, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:2269 +0x2cd
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Get(0xc82047bf40, 0xc820a00200, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
\t/root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:185 +0x62f
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).Get(0xc8201c2680, 0xc820a00200, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
\t<autogenerated>:30 +0xa6

similar issues #27900 #27543

@AaronDMarasco-VSI Have you tried dmsetup udevcomplete_all?

@coolljt0725 No, sorry, after I dumped the log for reporting, I restarted the daemon without playing with it any more.

@tonistiigi, full stack trace from Docker daemon after SIGUSR1 can be found here

The trace from @liquid-sky seems to be blocked on netlink socket

goroutine 13038 [syscall, 9838 minutes, locked to thread]:
syscall.Syscall6(0x2d, 0x16, 0xc820fc3000, 0x1000, 0x0, 0xc8219bc028, 0xc8219bc024, 0x0, 0x300000002, 0x66b5a6)
    /usr/lib/go1.6/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.recvfrom(0x16, 0xc820fc3000, 0x1000, 0x1000, 0x0, 0xc8219bc028, 0xc8219bc024, 0x427289, 0x0, 0x0)
    /usr/lib/go1.6/src/syscall/zsyscall_linux_amd64.go:1712 +0x9e
syscall.Recvfrom(0x16, 0xc820fc3000, 0x1000, 0x1000, 0x0, 0x1000, 0x0, 0x0, 0x0, 0x0)
    /usr/lib/go1.6/src/syscall/syscall_unix.go:251 +0xb9
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xc820c6c5e0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:341 +0xae
github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute(0xc820dded20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:228 +0x20b
github.com/vishvananda/netlink.LinkSetMasterByIndex(0
 x7f1e88c67c20, 0xc820e00630, 0x3, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:231 +0x3b0
github.com/vishvananda/netlink.LinkSetMaster(0x7f1e88c67c20, 0xc820e00630, 0xc8219bc550, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:205 +0xb7
github.com/docker/libnetwork/drivers/bridge.addToBridge(0xc820c78830, 0xb, 0x177bf88, 0x7, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:782 +0x275
github.com/docker/libnetwork/drivers/bridge.(*driver).CreateEndpoint(0xc820316640, 0xc8201f6840, 0x40, 0xc821879500, 0x40, 0x7f1e88c67bd8, 0xc820f38580, 0xc821a87bf0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:1004 +0x1436
github.com/docker/libnetwork.(*network).addEndpoint(0xc820f026c0, 0xc8209f2300, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/network.go:749 +0x21b
github.com/docker/libnetwork.(*network).CreateEndpoint(0xc820f026c0, 0xc820eb1909, 0x6, 0xc8210af160, 0x4, 0x4, 0x0, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/docker/libnetwork/network.go:813 +0xaa7
github.com/docker/docker/daemon.(*Daemon).connectToNetwork(0xc82044e480, 0xc820e661c0, 0x177aea8, 0x6, 0xc820448c00, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:539 +0x39f
github.com/docker/docker/daemon.(*Daemon).allocateNetwork(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker
 -1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:401 +0x373
github.com/docker/docker/daemon.(*Daemon).initializeNetworking(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/container_operations.go:680 +0x459
github.com/docker/docker/daemon.(*Daemon).containerStart(0xc82044e480, 0xc820e661c0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/start.go:125 +0x219
github.com/docker/docker/daemon.(*Daemon).ContainerStart(0xc82044e480, 0xc820ed61d7, 0x40, 0x0, 0x0, 0x0)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/.gopath/src/github.com/docker/docker/daemon/start.go:75 +0x575

cc @aboch

As other have suggested this is likely a side effect of something very bad happening in the kernel.
We are adding https://github.com/docker/libnetwork/pull/1557 to attempt to mitigate this situation.

Maybe i have another two three stack traces with some error messages (taken from journalctl syslog).
I'm running some rspec/serverspec tests for docker images and in this state docker seems not to respond. The test is stuck for more than 20 minutes (will retry 5 times if fails).
Each image tests is finished in less than 5 minutes, the test is now running for at least 30 or 60 minutes.

https://gist.github.com/mblaschke/d5d38cb34f6178e50e465c6e1f02131c

uname -a:
Linux webdevops.infogene.fr 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

docker info:
Containers: 5
Running: 1
Paused: 0
Stopped: 4
Images: 1208
Server Version: 1.12.3
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 449
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-47-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 7
Total Memory: 11.73 GiB
Name: webdevops.infogene.fr
ID: DGNM:G3MV:ZMMV:HY6K:UPLU:CMEV:VHCA:RVY3:CRV7:FS5M:KMVQ:MQO5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8

Here also the result output from serverspec:

https://gist.github.com/mblaschke/21a521367a2a2b71301607482647a748

@mblaschke I looked over your traces(https://gist.github.com/tonistiigi/0fb0abfb068a89975c072b68e6ed07ce for better view). I can't find anything suspicious in there though. All long running goroutines are from open io copy that is normal if there are running containers or execs as these goroutines don't hold any locks. From the traces I would expect that other commands like docker ps or docker exec/stop were not blocked in your case and the hang you saw was only that you expected a specific container or exec to finish but it didn't. This may be related to either the application itself hanging on something or containerd not sending us any events about the container(cc @mlaventure).

The warnings you have in the logs should be fixed with https://github.com/docker/containerd/pull/351 . They should only cause spam and not be anything serious. Because the debug log is not enabled in the logs I can't see if there are any suspicious commands sent to the daemon. There don't seem to be any meaningful logs for minutes before you took the stacktrace.

The same code works with Docker 1.10.3, but not after Docker 1.11.x. The serverspec tests are failing randomly with timeouts.
I have the feeling that it's happening more often when we run the tests in parallels (eg. running 2 oder 4 tests on 4 core cpu) than in serial mode but the result is the same. There is no test run which runs successfully.
With Docker 1.11 the whole docker daemon was hanging, since 1.12 Docker only the exec fails or run into a timeout.

@mblaschke I had a look at the trace too, it really looks like the exec is just not finishing with its IO.

What exec program are you executing? Does it fork new processes withing its own session id?

We are executing (docker exec) ip addr on a regular basis in all containers to find their IP-adresses assigned by https://github.com/rancher/rancher , which are unfortunately not part of the docker metadata (yet)

We are seeing the hanging problem on a regular basis within our production cluster.
With ip addr there sould not be much magic with processes running withing their own session id, right?

@mlaventure
We're using serverspec/rspec for testing and are executing very simple tests (file checks, simple command checks, simple application checks). Most of the time even simple file permission checks are failing.

Tests are here (but they need some environment settings for execution):
https://github.com/webdevops/Dockerfile/tree/develop/tests/serverspec

you could try it with our code base:

  1. make requirements for fetching all the python (pip) and ruby modules
  2. bin/console docker:pull --threads=auto (fetch ~180 images from the hub)
  3. bin/console test:serverspec --threads=auto to run all tests in parallized mode or bin/console test:serverspec in serial mode

@GameScripting I'm getting confused now :sweat_smile: . Are you referring to the same context that @mblaschke is running from?

If not, which docker version are you using?

And to answer your question, no, it's unlikely that ip addr would do anything like this.

Which image are you using? What is the exact docker exec command being used?

Sorry, it was not my intention to make this more confusion.
I am not refering to the same context, I do not use the test-suite that hewas refering to. I wanted to provide more (hopefully useful) context and/or infos on what we are doing that might cause the issue.

Main issue on resolving this bug is that no one was yet able to come up with stable, reproducable steps to trigger the hanging.

Seems like @mblaschke found something so heis able to reliably trigger the bug.

I use the CoreOS stable (1185.3.0) with Docker 1.11.2.
I run a watch with docker exec to my Redis container to check some variables. At least a day the Docker daemon hangs.

But I've found a workaround until you find a solution. I use the ctrutility from containerd (https://github.com/docker/containerd) to start a process inside the running container. It can be also used to start and stop containers. I think it is integrated into docker 1.11.2.
Unfortunately it has another bug on CoreOS I reported here: https://github.com/docker/containerd/issues/356#event-874038823 (they fixed it since), so you need to regularly clean /tmp but at least it has been working for me for a week in production.

The next docker exec example:

docker exec -i container_name /bin/sh 'echo "Hello"'

can be translated to ctr:

/usr/bin/ctr --address=/run/docker/libcontainerd/docker-containerd.sock containers exec --id=${id_of_the_running_container} --pid=any_unique_process_name -a --cwd=/ /bin/sh -c 'echo "Hello"'

So you can translate scripts temporarily to ctr as a workaround.

@mblaschke The commands you posted did fail for me but they don't look like docker failures. https://gist.github.com/tonistiigi/86badf5a41dff3fe53bd68d8e83e4ec4 Could you enable debug logs. The master build also stores more debug data about daemon internals and allows to trace containerd as well with sigusr1 signals. Because we are tracking a stuck process even ps aux could help.

@tonistiigi
I've forgot the alpine blacklist (currently build problems) so please run: bin/console test:serverspec --threads=auto --blacklist=:alpine

Sorry :(

@mblaschke Still doesn't look like docker issue. It is hanging on docker exec <id> /usr/bin/php -i. If I trace the image being used and start it manually I see that this image doesn't have real php installed:

root@254424aecc57:/# php -v
HipHop VM 3.11.1 (rel)
Compiler: 3.11.1+dfsg-1ubuntu1
Repo schema: 2f678922fc70b326c82e56bedc2fc106c2faca61

And this HipHopVM doesn't support -i but just blocks.

@tonistiigi
I was hunting this bug in the last Docker versions long time and didn't see this bug in our testsuite..We've fixed it, thanks and sorry /o

I've searched the build logs and found one test failure (random issue, not in the hhvm tests) when using 1.12.3. We will continue to stress Docker and try to find the issue.

@coolljt0725 I just came back to work to find Docker hung again. docker ps hung in one session, and sudo service docker stop hung as well. Opened a third session:

$ sudo lvs
  LV          VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool vg_ex  twi-aot--- 750.00g             63.49  7.88                            
$ sudo dmsetup udevcomplete_all
This operation will destroy all semaphores with keys that have a prefix 3405 (0xd4d).
Do you really want to continue? [y/n]: y
2 semaphores with keys prefixed by 3405 (0xd4d) destroyed. 0 skipped.

As soon as that completed, the other two sessions unblocked. I was checking disk space because that had been a problem once before so was the first place I looked. The 2 semaphores may have been the two calls I had, but there were many hung docker calls from my Jenkins server, which is why I was poking around at all...

An example of hung output after I did the dmsetup call, the docker run had started days ago:

docker run -td -v /opt/Xilinx/:/opt/Xilinx/:ro -v /opt/Modelsim/:/opt/Modelsim/:ro -v /data/jenkins_workspace_modelsim/workspace/examples_hdl/APPLICATION/bias/HDL_PLATFORM/modelsim_pf/TARGET_OS/7:/build --name jenkins-examples_hdl-APPLICATION--bias--HDL_PLATFORM--modelsim_pf--TARGET_OS--7-546 jenkins/build:v3-C7 /bin/sleep 10m
docker: An error occurred trying to connect: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/create?name=jenkins-examples_hdl-APPLICATION--bias--HDL_PLATFORM--modelsim_pf--TARGET_OS--7-546: EOF.
See 'docker run --help'.

trace from @Calpicow seems to be stuck on devicemapper. But doesn't look to udev_wait case. @coolljt0725 @rhvgoyal

goroutine 488 [syscall, 10 minutes, locked to thread]:
github.com/docker/docker/pkg/devicemapper._C2func_dm_task_run(0x7fbffc010f80, 0x0, 0x0, 0x0)
    ??:0 +0x47
github.com/docker/docker/pkg/devicemapper.dmTaskRunFct(0x7fbffc010f80, 0xc800000001)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:96 +0x21
github.com/docker/docker/pkg/devicemapper.(*Task).run(0xc8200266d8, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engin
e/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:155 +0x37
github.com/docker/docker/pkg/devicemapper.SuspendDevice(0xc821b07380, 0x55, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:627 +0x99
github.com/docker/docker/pkg/devicemapper.CreateSnapDevice(0xc821013f80, 0x25, 0x1a, 0xc821b07380, 0x55, 0x17, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:759 +0x92
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).createRegisterSnapDevice(0xc8203be9c0, 0xc82184d1c0, 0x40, 0xc821a78f40, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:860 +0x557
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).AddDevice(0xc8203be9c0, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:1865 +0x81f
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Create(0xc8200ff770, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:124 +0x5f
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).Create(0xc820363940, 0xc82184d1c0, 0x40, 0xc8219c47c0, 0x40, 0x0, 0x0, 0x0, 0x0)
    <autogenerated>:24 +0xaa
github.com/docker/docker/layer.(*layerStore).Register(0xc820363980, 0x7fc02faa3898, 0xc821a6ae00, 0xc821ace370, 0x47, 0x0, 0x0, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/layer/layer_store.go:266 +0x382
github.com/docker/docker/distribution/xfer.(*LayerDownloadManager).makeDownloadFunc.func1.1(0xc8210cd740, 0xc8210cd680, 0x7fc02fb6b758, 0xc820f422d0, 0xc820ee15c0, 0xc820ee1680, 0xc8210cd6e0, 0xc820e8e0b0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/distribut
ion/xfer/download.go:316 +0xc01
created by github.com/docker/docker/distribution/xfer.(*LayerDownloadManager).makeDownloadFunc.func1
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/distribution/xfer/download.go:341 +0x191

@Calpicow Do you have some dmesg log? and can your show the output of dmsetup status?

My docker ps hangs, but other commands like info/restart/images work just fine.

I have a huge SIGUSR1 dump below as well (didn't fit here). https://gist.github.com/ahmetalpbalkan/34bf40c02a78e319eaf5710acb15cf9a

  • docker Server Version: 1.11.2
  • Storage Driver: overlay
  • Kernel Version: 4.7.3-coreos-r3
  • Operating System: CoreOS 1185.5.0 (MoreOS)

It looks like I have a ton (like 700) of these goroutines:

...
goroutine 1149 [chan send, 1070 minutes]:
github.com/vishvananda/netlink.LinkSubscribe.func2(0xc821159d40, 0xc820fa4c60)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:898 +0x2de
created by github.com/vishvananda/netlink.LinkSubscribe
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:901 +0x107

goroutine 442 [chan send, 1129 minutes]:
github.com/vishvananda/netlink.LinkSubscribe.func2(0xc8211eb380, 0xc821095920)
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:898 +0x2de
created by github.com/vishvananda/netlink.LinkSubscribe
    /build/amd64-usr/var/tmp/portage/app-emulation/docker-1.11.2-r5/work/docker-1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go:901 +0x107
...

@ahmetalpbalkan You look blocked waiting on a netlink socket to return.
This is a bug in the kernel, but 1.12.5 should at least have a timeout on this netlink socket.
If I had to guess, you'd have something in your dmesg output like device_count = 1; waiting for <interface> to become free

@cpuguy83 yeah I saw the https://github.com/coreos/bugs/issues/254 which looked similar to my case, however I don't see those "waiting" messages in the kernel logs the person and you mentioned.

it looks like 1.12.5 did not hit even the coreos alpha stream yet. is there a kernel/docker version I can downgrade and have it working?

@ahmetalpbalkan Yay, for another kernel bug.
This is why we introduced the timeout on the netlink socket... in all likelihood you won't be able to start/stop any new containers, but at least Docker won't be blocked.

Is know what exactly IS the bug? Was the kernel-bug reported upstream? Or is there even a kernel version where this bug has been fixed?

@GameScripting Issue will have to be reported to whatever distro this was produced in, and as you can see we have more than 1 issue causing the same effect here as well.

Here's another one with Docker v1.12.3

dump
dmesg
dmsetup

Relevant syslog:

Jan  6 01:41:19 ip-100-64-32-70 kernel: INFO: task kworker/u31:1:91 blocked for more than 120 seconds.
Jan  6 01:41:19 ip-100-64-32-70 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  6 01:41:19 ip-100-64-32-70 kernel: kworker/u31:1   D ffff880201fc98e0     0    91      2 0x00000000
Jan  6 01:41:19 ip-100-64-32-70 kernel: Workqueue: kdmremove do_deferred_remove [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff88020141bcf0 0000000000000046 ffff8802044ce780 ffff88020141bfd8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff88020141bfd8 ffff88020141bfd8 ffff8802044ce780 ffff880201fc98d8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff880201fc98dc ffff8802044ce780 00000000ffffffff ffff880201fc98e0
Jan  6 01:41:20 ip-100-64-32-70 kernel: Call Trace:
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163b959>] schedule_preempt_disabled+0x29/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81639655>] __mutex_lock_slowpath+0xc5/0x1c0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81638abf>] mutex_lock+0x1f/0x2f
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0392e9d>] __dm_destroy+0xad/0x340 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa03947e3>] dm_destroy+0x13/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0398d6d>] dm_hash_remove_all+0x6d/0x130 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039b50a>] dm_deferred_remove+0x1a/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0390dae>] do_deferred_remove+0xe/0x10 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81645818>] ret_from_fork+0x58/0x90
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan  6 01:41:20 ip-100-64-32-70 kernel: INFO: task dockerd:31587 blocked for more than 120 seconds.
Jan  6 01:41:20 ip-100-64-32-70 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  6 01:41:20 ip-100-64-32-70 kernel: dockerd         D 0000000000000000     0 31587      1 0x00000080
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768fab0 0000000000000086 ffff880034215c00 ffff8800e768ffd8
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768ffd8 ffff8800e768ffd8 ffff880034215c00 ffff8800e768fbf0
Jan  6 01:41:20 ip-100-64-32-70 kernel: ffff8800e768fbf8 7fffffffffffffff ffff880034215c00 0000000000000000
Jan  6 01:41:20 ip-100-64-32-70 kernel: Call Trace:
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163a879>] schedule+0x29/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81638569>] schedule_timeout+0x209/0x2d0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8108e4cd>] ? mod_timer+0x11d/0x240
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8163ac46>] wait_for_completion+0x116/0x170
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab676>] __synchronize_srcu+0x106/0x1a0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab190>] ? call_srcu+0x70/0x70
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81219e3f>] ? __sync_blockdev+0x1f/0x40
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff810ab72d>] synchronize_srcu+0x1d/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039318d>] __dm_suspend+0x5d/0x220 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0394c9a>] dm_suspend+0xca/0xf0 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0399fe0>] ? table_load+0x380/0x380 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039a174>] dev_suspend+0x194/0x250 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa0399fe0>] ? table_load+0x380/0x380 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8112482d>] ? call_rcu_sched+0x1d/0x20
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffffa039ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff811f1e75>] do_vfs_ioctl+0x2e5/0x4c0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff8128bbee>] ? file_has_perm+0xae/0xc0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff81640d01>] ? __do_page_fault+0xb1/0x450
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff811f20f1>] SyS_ioctl+0xa1/0xc0
Jan  6 01:41:20 ip-100-64-32-70 kernel: [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b

@Calpicow Thanks, yours looks like devicemapper has stalled.

github.com/docker/docker/pkg/devicemapper._C2func_dm_task_run(0x7fd3a40231b0, 0x7fd300000000, 0x0, 0x0)
    ??:0 +0x4c
github.com/docker/docker/pkg/devicemapper.dmTaskRunFct(0x7fd3a40231b0, 0xc821a91620)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper_wrapper.go:96 +0x75
github.com/docker/docker/pkg/devicemapper.(*Task).run(0xc820345838, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:155 +0x37
github.com/docker/docker/pkg/devicemapper.SuspendDevice(0xc8219c8600, 0x5a, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:648 +0x99
github.com/docker/docker/pkg/devicemapper.CreateSnapDevice(0xc821a915f0, 0x25, 0x21, 0xc8219c8600, 0x5a, 0x1f, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/pkg/devicemapper/devmapper.go:780 +0x92
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).createRegisterSnapDevice(0xc820433040, 0xc821084080, 0x40, 0xc8210842c0, 0x140000000, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:861 +0x550
github.com/docker/docker/daemon/graphdriver/devmapper.(*DeviceSet).AddDevice(0xc820433040, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/deviceset.go:1887 +0xa5c
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).Create(0xc82036af00, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:131 +0x6f
github.com/docker/docker/daemon/graphdriver/devmapper.(*Driver).CreateReadWrite(0xc82036af00, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/graphdriver/devmapper/driver.go:126 +0x86
github.com/docker/docker/daemon/graphdriver.(*NaiveDiffDriver).CreateReadWrite(0xc82021c500, 0xc821084080, 0x40, 0xc82025b5e0, 0x45, 0x0, 0x0, 0x0, 0x0, 0x0)
    <autogenerated>:28 +0xbe
github.com/docker/docker/layer.(*layerStore).CreateRWLayer(0xc82021c580, 0xc82154f980, 0x40, 0xc82025b2c0, 0x47, 0x0, 0x0, 0xc820981380, 0x0, 0x0, ...)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/layer/layer_store.go:476 +0x5a9
github.com/docker/docker/daemon.(*Daemon).setRWLayer(0xc820432ea0, 0xc8216d12c0, 0x0, 0x0)

Can you open a separate issue with all the details?
Thanks!

30003

Was this page helpful?
0 / 5 - 0 ratings