Fabric: Reboot broken on Ubuntu 16.04 hosts

Created on 18 Jul 2016  ·  21Comments  ·  Source: fabric/fabric

The built in reboot() function, which has been working perfectly both on Ubuntu 14.04 and FreeBSD 10.x hosts, but is broken on Ubuntu 16.04 hosts.

What is happening on Ubuntu 14.04:
I receive an output like this and the system reboots, after the reboot Fabric reconnects.

[ubuntu] out:
[ubuntu] out:
[ubuntu] out: Broadcast message from root@ubuntu
[ubuntu] out:
[ubuntu] out:   (/dev/pts/0) at 15:02 ...
[ubuntu] out:
[ubuntu] out:
[ubuntu] out:
[ubuntu] out:
[ubuntu] out: The system is going down for reboot NOW!
[ubuntu] out:
[ubuntu] out:

What is happening on Ubuntu 16.04:

  1. There is no output at all from the command.
  2. The system actually starts rebooting (still no output in Fabric)
  3. The system finishes reboot, but Fabric doesn't realise it, it does not reconnect, still no output.
  4. Fabric just sits there waiting seemingly forever.

If I press the enter key in this state, Fabric actually continues, but shows this message before:

No handlers could be found for logger "paramiko.transport"
Warning: sudo() received nonzero return code -1 while executing 'reboot'!

I am using this code for reboot:

def reboot_():
    with settings(warn_only=True):
        print 'rebooting'
        start_time = time.time()
        reboot(wait=1200)
        print 'reboot took: {} seconds'.format(time.time() - start_time)
Bug Core Needs investigation

Most helpful comment

The ubuntu bug https://bugs.launchpad.net/ubuntu/+source/openssh/+bug/1645002 is marked as fixed in 16.10, but not yet in 16.04, and unclear when it will be.

The current behavior for me is that paramiko/fabric instantly detect that the ssh connection was closed, but it's before paramiko/fabric sees the reboot command to have completed. At least it doesn't hang indefinitely as in the original report.

Fatal error: sudo() received nonzero return code -1 while executing!
...
Aborting.

Plain reboot() did that consistently for me in a handful of tests against AWS EC2 and a local virtualbox VM. (I always used keyfile auth.)

I've found a short and elegant workaround, as I suggested without as much detail above:

reboot(command="shutdown -r +0")

That worked as expected for me (in my handful of tests against AWS EC2 and local virtualbox VM, all running up-to-date ubuntu 16.04). Note that "shutdown -r now" behaved like "reboot" and did not seem to work.

I took a quick look at the freebsd and openbsd man pages, and it looks they have a shutdown command that supports those parameters. I suspect that the command "shutdown -r +0" would work for pretty much any unix system which "reboot" worked on. So it could be considered for changing the default command, or updating the documentation. (But I'd be interested to see a report of a test on a BSD system first.)

All 21 comments

It is exactly the same with run('reboot')

It being the same with a manual run is unsurprising - clearly something changed regarding Ubuntu's handling of reboot, SSH connections, etc.

Nothing obvious springs to mind, but reboot() (Fab's, not Linux's) is pretty basic - it simply calls sudo('reboot'), and temporarily tweaks Fabric's general reconnection settings so it can handle reconnecting after a nontrivial reboot sequence (versus the default, which would give up pretty quickly).

See https://github.com/fabric/fabric/blob/c0224a52df59821f21a8c0bd47ce15e42c2046a4/fabric/operations.py#L1244 - you might want to try tweaking that.

Also try enabling Paramiko's logging (see bottom of our troubleshooting page - http://www.fabfile.org/troubleshooting.html) as it might yield a clue.

Actually, on second thought, it sounds like Ubuntu's reboot is somehow never exiting or submitting an exit code to Fabric's execution handlers (run/sudo), since you note that sudo is what gets mad when you mash Enter after waiting.

If you look at the reboot() code, it expects the sudo('reboot') call to exit eventually, so that it can A) wait a bit and B) initiate reconnection.

The fact that, on Fabric's end, execution is just hanging out within the sudo means something remotely is violating that expectation. Kind of strange. _Maybe_ a bug in Fabric itself, but feels more like bad behavior on the remote end. (P.S.: which fabric version(s) are you seeing this on?)

Offhand thought - we could perhaps set timeout= on the sudo, then except TimeoutException: pass around it. This would ensure that even in this (strange) situation, we default to trying a reconnect.

Only downside would be the case where reboot is actually hanging and the system is not truly rebooting, but it's not like we'd make things any _worse_ for that case by the above change - the infinite hang would just happen on the connection loop instead of within the sudo.

An other really strange, changed behaviour in Ubuntu 16.04 is the following. When I run poweroff in an ssh session, the machine does power off, but the SSH sessions hangs! There is no way to Ctrl + C, or Ctrl + D, or anything. All I can do is wait a _lot_ then ssh aborts with:
packet_write_wait: Connection to 192.168.56.11: Broken pipe

I'm really not into the deep pockets of SSH connection handling, but this might be the exactly the same issue as with reboot.

I've just run into broken reboot (fresh up-to-date Ubuntu 16.04 on AWS, Fabric==1.12.0) but in a different way. For me it just throws:

Fatal error: sudo() received nonzero return code -1 while executing!

Requested: reboot
Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "reboot"

Running sudo reboot in terminal by hand works (host reboots).

May be worth noting:

$ readlink /sbin/reboot 
/bin/systemctl
$ readlink /sbin/shutdown
/bin/systemctl

And another weird thing. I've changed the rebooting code to use aws-cli and after its call (which takes ~1sec, seems like it's asynchronous) I run sudo('add-apt-repository --yes ppa:nginx/stable'). It has always worked, but now after reboot it returned -1 too:

sudo: add-apt-repository --yes ppa:nginx/stable

Fatal error: sudo() received nonzero return code -1 while executing!

Requested: add-apt-repository --yes ppa:nginx/stable
Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "add-apt-repository --yes ppa:nginx/stable"

Then I tried to make fabric to reconnect by adding fabric.network.disconnect_all(). It resulted in requesting a password (why??):

[...] sudo: add-apt-repository --yes ppa:nginx/stable
[...] Login password for 'ubuntu': 

And it started to work only after I added e.g. time.sleep(60 * 3) after reboot. Which is obviously a poor band-aid, and now I'm puzzled how to properly handle the password problem. Looks like it's related to this issue.

The problem seems to be that "reboot" is now sometimes "too fast", before the status of the command gets back over the ssh connection.

(Tip: If you're at a frozen ssh connection as a result: type \n~. aka enter, tilde, period. That's the default ssh escape character, then the disconnect command for ssh. If you just try ctrl-c or ctrl-d, ssh tries to pass that to the process running on the other side.)

One solution is to use shutdown -r +1, which will schedule the reboot for the next minute, and then wait a minute for it to start, and then start trying to re-connect. Admittedly, waiting a minute is not great.

A hacky thing to try: shutdown -r +0 should be equivalent to reboot, but in my limited tests of Ubuntu-16.04 running in VirtualBox, it tends to give a fraction of a second longer, showing the next shell prompt just before disconnecting a manual ssh session.

this is probably a dup of #1444

If the init daemon is switched to upstart reboot works as expected. It looks like systemd is killing sshd immediately.

There was a bug on the Debian/Ubuntu's package of systemd that, on shutdown, killed the network service before the SSH one so everything hang.
It was fixed on the latest point release. Don't know about the Ubuntu package status.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=751636

I also had issues regarding the usage of reboot() in some of my scripts. I found out that when connecting with a password, the reboot was working correctly, but when using keyfile-authentication, the connection hung up (an the reboot was done).

The ubuntu bug https://bugs.launchpad.net/ubuntu/+source/openssh/+bug/1645002 is marked as fixed in 16.10, but not yet in 16.04, and unclear when it will be.

The current behavior for me is that paramiko/fabric instantly detect that the ssh connection was closed, but it's before paramiko/fabric sees the reboot command to have completed. At least it doesn't hang indefinitely as in the original report.

Fatal error: sudo() received nonzero return code -1 while executing!
...
Aborting.

Plain reboot() did that consistently for me in a handful of tests against AWS EC2 and a local virtualbox VM. (I always used keyfile auth.)

I've found a short and elegant workaround, as I suggested without as much detail above:

reboot(command="shutdown -r +0")

That worked as expected for me (in my handful of tests against AWS EC2 and local virtualbox VM, all running up-to-date ubuntu 16.04). Note that "shutdown -r now" behaved like "reboot" and did not seem to work.

I took a quick look at the freebsd and openbsd man pages, and it looks they have a shutdown command that supports those parameters. I suspect that the command "shutdown -r +0" would work for pretty much any unix system which "reboot" worked on. So it could be considered for changing the default command, or updating the documentation. (But I'd be interested to see a report of a test on a BSD system first.)

shutdown -r +0 isn't enough for us. Since reboot doesn't accept a manual timeout, I've even tried something like:

try:
    sudo("shutdown -r +0", timeout=300)
except NetworkError:
    pass
# in case the sudo times out during reboot
sleep(15)

Despite all of this hand waving, the next command hangs indefinitely. Is it possible that the connection pool is holding onto (and using) the dead connection? If so, is there a workaround? Can I temporarily reduce the connection-level timeout?

Indeed, you need to replace the existing connection, the way reboot() does:

https://github.com/fabric/fabric/blob/1.13.2/fabric/operations.py#L1289-L1294

Apologies to revive an old issue, I can also confirm that this problem happens when attempting to reboot a LXC container. @ploxiln's suggestion of using command="shutdown -r +0" did work for us.

Confirming this error on a fresh install of FreeBSD 11.1 with bash installed:

reboot(wait=1) results in:

Fatal error: sudo() received nonzero return code -1 while executing!

Requested: reboot
Executed: sudo -S -p 'sudo password:'  /usr/local/bin/bash -l -c "reboot"

Aborting.
Traceback (most recent call last):
…
    raise env.abort_exception(msg)
hosts.FabricException: sudo() received nonzero return code -1 while executing!

I ended up needing this to get things going after reeding @ambsw-technology and @ploxiln comments. I'm running against an ubuntu 16.04 LTS server (from a windows client).

sudo('shutdown -r +0')
time.sleep(30)
fabric.state.connections.connect(env.host_string)

FYI, I still see this against 18.04.2 LTS servers.

Any fix for this? also getting issue with 16.04

Was this page helpful?
0 / 5 - 0 ratings

Related issues

amezin picture amezin  ·  5Comments

Grazfather picture Grazfather  ·  4Comments

yuvadm picture yuvadm  ·  5Comments

peteruhnak picture peteruhnak  ·  4Comments

haydenflinner picture haydenflinner  ·  5Comments