Ansible: Ansible hangs forever while executing playbook with no information on what is going on

Created on 15 Sep 2017  ·  95Comments  ·  Source: ansible/ansible

ISSUE TYPE

  • Bug Report
COMPONENT NAME


ansible-playbook

ANSIBLE VERSION
ansible 2.3.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, Aug  2 2016, 04:20:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]
CONFIGURATION
OS / ENVIRONMENT

SUMMARY

When running ansible playbook for all servers (1964 servers) it hangs somewhere in middle of execution. I used -vvvvvvvvvv to trace the problem, but it doesn't print ANYTHING. The previous host finishes and the message about finish is last thing I see in console, then there is NOTHING. It just hangs. It doesn't even print the FQDN or name of server that is next in queue, it doesn't say what is is waiting for. Just nothing.

In ps I see 2 processes: Running strace shows:

parent:
Process 22170 attached
select(0, NULL, NULL, NULL, {0, 599}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
wait4(28917, 0x7ffd44d0bf34, WNOHANG, NULL) = 0
wait4(28917, 0x7ffd44d0bf64, WNOHANG, NULL) = 0

...
repeats forever

child:
Process 28917 attached
epoll_wait(40,

STEPS TO REPRODUCE

# This playbook is used to test if server works or not

---
- hosts: run-tools.homecredit.net all:!ignored
  tasks:
  - command: echo works

Looks like a bug to me, it should at least tell me what it's trying to do that causes the hang

affects_2.3 bug hang core

Most helpful comment

it doesn't matter if hang on target host is caused by fuse, nfs or anything, it's still Ansible bug in the way that 1 target host should not be able to lock up whole play for all other hosts.

There must be some timeout or internal watchdog implemented that would kill playbook for misbehaving and broken machines. 1 broken machine out of 5000 shouldn't break Ansible for all 5000 machines.

All 95 comments

You can enable debug mode to get more information than -v* gives you. To do this just export ANSIBLE_DEBUG=1 to enable it.

We would probably need some more info to try and help you with this which hopefully the debug logs will provide. That many hosts does seem like a lot to connect concurrently, are you running in parallel or in blocks of hosts?

needs_info

Hmm, what are you setting forks to? Hitting lots of hosts at once might be taking up a lot of resources - worth checking /var/log/messages too.

You might like to try wait_for_connection as a way of testing that all your hosts are available. Its nice as its aware of the connection type so it works even if you have mixed inventory of say windows and linux hosts.

Hi, it's connecting max to 10 hosts at same time, this is config:

[defaults]
host_key_checking = False
timeout = 20
retry_files_save_path = /home/ansible/retry/
# Do not put more jobs here, or ssh will fail
# Anything more than 10 kills this poor server
forks = 10
remote_user = root
log_path=/var/log/ansible.log

[ssh_connection]
retries=3
pipelining=True

I managed to find a "block of hosts" that contain the one which is probably causing this hang, so I am now removing more and more hosts from that temporary file with hope to identify the one that causes this.

My rough guess is that it maybe hangs on DNS query? It's possible that one of these hosts doesn't exist and misconfigured DNS server make the query take forever? I have no idea, but I am fairly sure that no detailed information is available even in debug log, I will try to provide that one soonish. I see some debug lines, then I see green ok: [hostname] and then nothing forever, not a single debug or verbose line, it just hangs.

After very long of debugging and testing servers one by one, I figured out which one was causing this. Its DNS is OK, it responds and I can ssh there, but there is some problem:

 12502 1506511086.24143: starting run
 12502 1506511086.42742: Loading CacheModule 'memory' from /usr/lib/python2.7/site-packages/ansible/plugins/cache/memory.py
 12502 1506511086.63981: Loading CallbackModule 'default' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/default.py
 12502 1506511086.64073: Loading CallbackModule 'actionable' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/actionable.py (found_in_cache=False, class_only=True)
 12502 1506511086.64109: Loading CallbackModule 'context_demo' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/context_demo.py (found_in_cache=False, class_only=True)
 12502 1506511086.64146: Loading CallbackModule 'debug' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/debug.py (found_in_cache=False, class_only=True)
 12502 1506511086.64166: Loading CallbackModule 'default' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/default.py (found_in_cache=False, class_only=True)
 12502 1506511086.64226: Loading CallbackModule 'dense' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/dense.py (found_in_cache=False, class_only=True)
 12502 1506511086.67297: Loading CallbackModule 'foreman' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/foreman.py (found_in_cache=False, class_only=True)
 12502 1506511086.68313: Loading CallbackModule 'hipchat' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/hipchat.py (found_in_cache=False, class_only=True)
 12502 1506511086.68376: Loading CallbackModule 'jabber' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/jabber.py (found_in_cache=False, class_only=True)
 12502 1506511086.68409: Loading CallbackModule 'json' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/json.py (found_in_cache=False, class_only=True)
 12502 1506511086.68476: Loading CallbackModule 'junit' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/junit.py (found_in_cache=False, class_only=True)
 12502 1506511086.68511: Loading CallbackModule 'log_plays' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/log_plays.py (found_in_cache=False, class_only=True)
 12502 1506511086.68598: Loading CallbackModule 'logentries' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/logentries.py (found_in_cache=False, class_only=True)
 12502 1506511086.68675: Loading CallbackModule 'logstash' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/logstash.py (found_in_cache=False, class_only=True)
 12502 1506511086.68789: Loading CallbackModule 'mail' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/mail.py (found_in_cache=False, class_only=True)
 12502 1506511086.68823: Loading CallbackModule 'minimal' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/minimal.py (found_in_cache=False, class_only=True)
 12502 1506511086.68855: Loading CallbackModule 'oneline' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/oneline.py (found_in_cache=False, class_only=True)
 12502 1506511086.68892: Loading CallbackModule 'osx_say' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/osx_say.py (found_in_cache=False, class_only=True)
 12502 1506511086.68932: Loading CallbackModule 'profile_tasks' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/profile_tasks.py (found_in_cache=False, class_only=True)
 12502 1506511086.68973: Loading CallbackModule 'selective' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/selective.py (found_in_cache=False, class_only=True)
 12502 1506511086.69006: Loading CallbackModule 'skippy' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/skippy.py (found_in_cache=False, class_only=True)
 12502 1506511086.69063: Loading CallbackModule 'slack' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/slack.py (found_in_cache=False, class_only=True)
 12502 1506511086.69186: Loading CallbackModule 'syslog_json' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/syslog_json.py (found_in_cache=False, class_only=True)
 12502 1506511086.69221: Loading CallbackModule 'timer' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/timer.py (found_in_cache=False, class_only=True)
 12502 1506511086.69256: Loading CallbackModule 'tree' from /usr/lib/python2.7/site-packages/ansible/plugins/callback/tree.py (found_in_cache=False, class_only=True)
 12502 1506511086.69296: in VariableManager get_vars()
 12502 1506511086.71899: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py
 12502 1506511086.71980: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py
 12502 1506511086.72396: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py
 12502 1506511086.72433: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py
 12502 1506511086.72655: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py
 12502 1506511086.72845: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py
 12502 1506511086.72880: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py
 12502 1506511086.73542: done with get_vars()
 12502 1506511086.73603: in VariableManager get_vars()
 12502 1506511086.73665: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.73686: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py (found_in_cache=True, class_only=False)
 12502 1506511086.73702: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py (found_in_cache=True, class_only=False)
 12502 1506511086.73717: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.73765: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.73783: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py (found_in_cache=True, class_only=False)
 12502 1506511086.73799: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.73921: done with get_vars()

PLAY [all:!ignored] **********************************************************************************************************************************************************************************************************************************
 12502 1506511086.81512: Loading StrategyModule 'linear' from /usr/lib/python2.7/site-packages/ansible/plugins/strategy/linear.py
 12502 1506511086.82241: getting the remaining hosts for this loop
 12502 1506511086.82263: done getting the remaining hosts for this loop
 12502 1506511086.82282: building list of next tasks for hosts
 12502 1506511086.82297: getting the next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82315: done getting next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82330:  ^ task is: TASK: Gathering Facts
 12502 1506511086.82345:  ^ state is: HOST STATE: block=0, task=0, rescue=0, always=0, run_state=ITERATING_SETUP, fail_state=FAILED_NONE, pending_setup=True, tasks child state? (None), rescue child state? (None), always child state? (None), did rescue? False, did start at task? False
 12502 1506511086.82360: done building task lists
 12502 1506511086.82376: counting tasks in each state of execution
 12502 1506511086.82390: done counting tasks in each state of execution:
        num_setups: 1
        num_tasks: 0
        num_rescue: 0
        num_always: 0
 12502 1506511086.82405: advancing hosts in ITERATING_SETUP
 12502 1506511086.82418: starting to advance hosts
 12502 1506511086.82432: getting the next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82448: done getting next task for host in-terminal01.prod.domain.tld
 12502 1506511086.82462:  ^ task is: TASK: Gathering Facts
 12502 1506511086.82478:  ^ state is: HOST STATE: block=0, task=0, rescue=0, always=0, run_state=ITERATING_SETUP, fail_state=FAILED_NONE, pending_setup=True, tasks child state? (None), rescue child state? (None), always child state? (None), did rescue? False, did start at task? False
 12502 1506511086.82492: done advancing hosts to next task
 12502 1506511086.83078: getting variables
 12502 1506511086.83098: in VariableManager get_vars()
 12502 1506511086.83157: Loading FilterModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.83176: Loading FilterModule 'ipaddr' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/ipaddr.py (found_in_cache=True, class_only=False)
 12502 1506511086.83193: Loading FilterModule 'json_query' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/json_query.py (found_in_cache=True, class_only=False)
 12502 1506511086.83209: Loading FilterModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/filter/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.83247: Loading TestModule 'core' from /usr/lib/python2.7/site-packages/ansible/plugins/test/core.py (found_in_cache=True, class_only=False)
 12502 1506511086.83263: Loading TestModule 'files' from /usr/lib/python2.7/site-packages/ansible/plugins/test/files.py (found_in_cache=True, class_only=False)
 12502 1506511086.83281: Loading TestModule 'mathstuff' from /usr/lib/python2.7/site-packages/ansible/plugins/test/mathstuff.py (found_in_cache=True, class_only=False)
 12502 1506511086.83394: done with get_vars()
 12502 1506511086.83421: done getting variables
 12502 1506511086.83438: sending task start callback, copying the task so we can template it temporarily
 12502 1506511086.83453: done copying, going to template now
 12502 1506511086.83470: done templating
 12502 1506511086.83484: here goes the callback...

TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************************
 12502 1506511086.83515: sending task start callback
 12502 1506511086.83530: entering _queue_task() for in-terminal01.prod.domain.tld/setup
 12502 1506511086.83694: worker is 1 (out of 1 available)
 12502 1506511086.83772: exiting _queue_task() for in-terminal01.prod.domain.tld/setup
 12502 1506511086.83840: done queuing things up, now waiting for results queue to drain
 12502 1506511086.83859: waiting for pending results...
 12511 1506511086.84236: running TaskExecutor() for in-terminal01.prod.domain.tld/TASK: Gathering Facts
 12511 1506511086.84316: in run()
 12511 1506511086.84379: calling self._execute()
 12511 1506511086.85119: Loading Connection 'ssh' from /usr/lib/python2.7/site-packages/ansible/plugins/connection/ssh.py
 12511 1506511086.85207: Loading ShellModule 'csh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/csh.py
 12511 1506511086.85264: Loading ShellModule 'fish' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/fish.py
 12511 1506511086.85346: Loading ShellModule 'powershell' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/powershell.py
 12511 1506511086.85381: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py
 12511 1506511086.85423: Loading ShellModule 'sh' from /usr/lib/python2.7/site-packages/ansible/plugins/shell/sh.py (found_in_cache=True, class_only=False)
 12511 1506511086.85479: Loading ActionModule 'normal' from /usr/lib/python2.7/site-packages/ansible/plugins/action/normal.py
 12511 1506511086.85500: starting attempt loop
 12511 1506511086.85516: running the handler
 12511 1506511086.85572: ANSIBALLZ: Using lock for setup
 12511 1506511086.85589: ANSIBALLZ: Acquiring lock
 12511 1506511086.85606: ANSIBALLZ: Lock acquired: 29697296
 12511 1506511086.85624: ANSIBALLZ: Creating module
 12511 1506511087.15592: ANSIBALLZ: Writing module
 12511 1506511087.15653: ANSIBALLZ: Renaming module
 12511 1506511087.15678: ANSIBALLZ: Done creating module
 12511 1506511087.15765: _low_level_execute_command(): starting
 12511 1506511087.15786: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0'
 12511 1506511087.16363: Sending initial data
 12511 1506511089.69186: Sent initial data (103646 bytes)
 12511 1506511089.69246: stderr chunk (state=3):
>>>mux_client_request_session: session request failed: Session open refused by peer
ControlSocket /home/ansible/.ansible/cp/170b9dc5f6 already exists, disabling multiplexing
<<<

Here it hangs forever. I replaced domain of this company with "domain.tld", the actual domain name is different.

This is still bug in Ansible at least in a sense that Ansible should produce some verbose ERROR message that will let user know which host is causing the hang-up so that the host can be removed from inventory file with no need for complex investigation of what went wrong.

Also running a large inventory (2400+), seeing similar errors, and going through the same lengthy troubleshooting in hopes of finding log verbosity that points me in a better direction.

ansible 2.4.1.0
python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

Seeing it get to the end of a playbook and not getting the futex and closing calls. From the strace of a job that show the problem :

21:39:26 stat("/var/lib/awx/projects/<job>/host_vars", 0x7ffd4ce0dfd0) = -1 ENOENT (No such file or directory)
21:39:26 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dae0) = -1 ENOENT (No such file or directory)
21:39:26 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dcf0) = -1 ENOENT (No such file or directory)
21:39:26 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:26 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
... infinite Timeout

From a successful job :

21:39:39 stat("/var/lib/awx/projects/<job>/host_vars", 0x7ffd4ce0dfd0) = -1 ENOENT (No such file or directory)
21:39:39 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dae0) = -1 ENOENT (No such file or directory)
21:39:39 stat("/var/lib/awx/.cache/facts/<host>", 0x7ffd4ce0dcf0) = -1 ENOENT (No such file or directory)
21:39:39 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:39 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
21:39:39 futex(0x7fb178001390, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x1e9d090, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
21:39:39 futex(0x7f15690, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x1e9d090, FUTEX_WAKE_PRIVATE, 1) = 1
21:39:39 futex(0x711f9e0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
21:39:39 close(7)                       = 0
... continues to close and wrap up

@benapetr / @jmighion how did you guys solve the underlying problem? Trying to figure out a workaround when issues like this arise

There are multiple reasons why this can happen. I never said I found a solution. You can in theory run the playbook by hand in a loop on each host separately to figure out which one is causing problems, and then you need to investigate this host.

Good example of problem that can cause this is disconnected NFS mount, which would hang even "df" command.

Sorry, don't remember finding a solution. I think ours was hanging on a device we were trying to connect to, but honestly cant say. We're not hitting the problem anymore though.

https://github.com/ansible/ansible/issues/30411#issuecomment-360766621

Good example of problem that can cause this is disconnected NFS mount, which would hang even > "df" command.

That was exactly my problem. No ansible.cfg parameter(neither timeout fact_caching_timeout) did helped me to interrupt the process. Thank you so much!
IMO this schould result in a error/timeout on the control machine.

I had a similar issue, I set pipelining from True to False, run the playbook (successfully after this change) and then set back to True whereafter subsequent playbook runs also worked correctly.

Edit: @benapetr: Thanks! This was actually the underlying issue. A folder is mounted over SSHFS through a reverse SSH tunnel (for pushing from ansible control machine). After manually connecting over SSH to target machine and unmounting umount -l ... the problem went away.

Looks like I have a similar problem with ansible and an nfs mount on the target. lsof, df, ansible.. all hanging.
Setting the pipelining to false in the ansible.cfg did not help.

@jeroenflvr: In my case the sshfs mount hangs (also when invoking the mount command manually). I switched to CIFS/samba mount over the SSH reverse tunnel and now it works without hanging. This seems to be partly fuse- but not ansible-related.

it doesn't matter if hang on target host is caused by fuse, nfs or anything, it's still Ansible bug in the way that 1 target host should not be able to lock up whole play for all other hosts.

There must be some timeout or internal watchdog implemented that would kill playbook for misbehaving and broken machines. 1 broken machine out of 5000 shouldn't break Ansible for all 5000 machines.

@benapetr yes, this has to be tackled some day. not as a bug, but as a design issue, such hangs are the big plague and no single "fix" will resolve them for real. I occasionally used it since 2011, and IMO it'll always be like this until finally someone accepts something has to be done on a higher level.

Same issue for me.
I have six MacOS hosts and i run:
ansible-playbook tcmacagents.yml -f 6
It hangs on first step "Gathering Facts". Always on the same host(tcmacagent5):
```TASK [Gathering Facts] *************************************
ok: [tcmacagent1]
ok: [tcmacagent2]
ok: [tcmacagent6]
ok: [tcmacagent3]
ok: [tcmacagent4]

strace:

```[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
[pid 24072] wait4(24211, 0x7ffc608410a4, WNOHANG, NULL) = 0
[pid 24072] wait4(24211, 0x7ffc608410d4, WNOHANG, NULL) = 0
[pid 24072] select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)

How i can fix it? What wrong with my tcmacagent5?

# ansible --version
ansible 2.5.0

If it is a mount and it is not usable(touch a file or something) remount it. In my special situation i always reverted a vm to a snapshot for testing purposes and the NFS handle doesnt match within this vm - so it is in stale state

We have also been experiencing Ansible hanging without any error message, when running playbooks against 23 hosts, with forks=25. We use Ansible 2.2.3.0, but have the same problem with Ansible 2.5.3. We run our playbooks from macOS 10.12.6, using Python 2.7.15. With forks=5I don't see the problem. We also don't see the problem when using transport=paramiko (and forks=25).

We have the same issue. Playbooks randomly hangs during playback with no error messages. Once pressing ctrl + c the playbook continues.

This is also a problem for me, running ansible 2.16.1 . I think it mostly happens during fact gathering stage when 'df' hangs ( in my case ). there should be a mechanism to collect facts with option of opting out some parts of it(?), ...just a thoguht.

I have same problem, some kind of timeout (basically pausing playbook without error)
I am using kubectl ansible connection plugin for connecting my kubernetes pods and running playbook against those hosts (3 pods ). it's very slow like takes 8 sec to run each module on remote host and pausing in middle of playbook kind of weird without any error or massage

How can such a serious issue remain unresolved for so long?

Nobody has time to work on a fix?

@pillarsdotnet: I eventually switched to a CIFS/samba mount instead and there are no issues with hanging.

I'm seeing this problem with no network mounts at ALL. Just plain-vanilla AWS instances running Ubuntu 18.

Ok, after checking which ssh session it was stuck at, found out that in my case it was yum going on forever because it failed to connect to mirrors. Of course any other hanging process can/will do the same.The real problem to solve is timeouts in general

@strarsis you can also simply use a soft mount instead of a hard mount (which is telling it to block! on IO if the target is missing). but this is not the point. like @tuxick says, the problem is that there is no error handling for remote timeouts. Manually fixing trivial timeouts in surrounding env so the automation can manage it is ... WHY DOES ONE EVEN DISCUSS THAT?
(the "df" part of gather could fail legitly, or the "yum" module could fail... even a system could fail if we can't come up with anything less sad. but what cannot be is a hang of _the whole run_ against an env)

Yeah, if I can't figure out a workaround, I'm gonna officially recommend we switch away from ansible and toward a more mature automation system.

https://docs.ansible.com/ansible/2.5/user_guide/playbooks_async.html even uses yum as example. This doesn't solve all problems, but it will help out in case where you know things might 'hang', like nfs mounts and yum.

@tuxick So in general, I should replace this:

- name: Something that might hang
  module:
    param: data
  option: value

with this:

- name: Something that might hang
  block:
    - module:
        param: data
      option: value
      async: '{{ async_timeout|default(1000) }}'
      register: 'something_that_might_hang'
    - async_status:
        jid: '{{ something_that_might_hang.ansible_job_id }}'
      register: '{{ something_that_might_hang_result }}'
      retries: '{{ async_retries|default(30) }}'
      until: 'something_that_might_hang_result is defined and
              "finished" in something_that_might_hang_result and
               something_that_might_hang_result.finished'
      when: 'something_that_might_hang.ansible_job_id is defined'

I stuck with plain "async: 300" for now but sure, it can look nicer :). I'm wondering if some form of timeout should be built into modules that might suffer from this kind of problems. Only ones i ever saw go on forever have been nfs and yum anyway.

Most of the hangs I'm seeing are with pip.

Actually, the above pattern is failing, with an error indicating that the when clause could not be evaluated because the variable in the register clause is undefined.

(updated pattern...)

Simple way to avoid hangs, provided you can identify potentially-hanging tasks.

Replace this:

- name: Something that might hang
  module:
    param: data
  option: value

with this:

- name: Something that might hang
  module:
    param: data
  option: value
  async: '{{ async_timeout|default(1000) }}'
  poll: '{{ async_poll|default(10) }}'

I have to agree that we have faced this problem few times when we have 1000+ servers in inventory and it looks with clients. Each module must have timeout which must just move on to next host or task if something is stuck for xx minutes. Every time we have to trim down number of hosts in inventory and run jobs over and over. By the time we reach to the last set, issue is resolved with bad host and we are not able to figure out what went wrong.

I'm having the same issue with fact with ansible 2.5.1
I used limited fact gather_subset = network which is enough for me as i only need ip, kernel, and os information to set my group_vars, it seems to be ok
but seems like some modules are using lsof, like zypper i think, i had a stuck playbook which was stuck, looking at the prcesses, find out lsof -n -FpcuLRftkn0 , after killing it, the playbook start to resume :(

when it comes to update a large number of server it's a real pain

So we are looking into a few things that might help, #47684 adds a new event for callbacks to show 'which host' a task is being started on, then you can compare against 'finished' hosts to figure out which one hung.

Another line of attack is #49398, which attempts to deal with the primary suspect/culprit in most cases, querying mount information, this hopes to both make the process faster, disown blocked checks and report back either with the full information or reason why we weren't able to (timeout or actual error).

Hi,

this problem exist in ansible 2.6 as well

ansible 2.6.11 config file = None configured module search path = [u'/home/pkolze/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /home/pkolze/.virtualenvs/ansible26/local/lib/python2.7/site-packages/ansible executable location = /home/pkolze/.virtualenvs/ansible26/bin/ansible python version = 2.7.13 (default, Sep 26 2018, 18:42:22) [GCC 6.3.0 20170516]

on my side, problem caused by oom:
[ 2507.235957] Out of memory: Kill process 14022 (ansible-playboo) score 393 or sacrifice child
[ 2507.238181] Killed process 14022 (ansible-playboo) total-vm:408724kB, anon-rss:175508kB, file-rss:732kB

definitely not related to ansible
my 2 cents

when I run a command with wait_for and timeout, the timeout doesn't work if connection with the host is broken. in this example, its been running for two hours, and it should die after 60 minutes. the connection with the host breaks due to a reboot. it would be good for timeout to handle a broken connection unless theres another option.

  - name: "Wait until Softnas Update completes. This can take 15-45 mins"
    wait_for:
      path: /tmp/softnas-update.status
      search_regex: "^OK. SoftNAS software update to version.*completed at.*$"
      timeout: 3600
    when: update_file.stat.exists and softnas_runupdate

I wrote a role to deal with long-running tasks here: https://github.com/pillarsdotnet/ansible-watch.git

Thanks for sharing, thats pretty cool!

@pillarsdotnet It seems ansible doesn't support include_tasks with until (https://github.com/ansible/ansible/issues/17098). How did you make it work in the file /tasks/again.yml? (I tested with ansible 2.7.8)

@lucasbasquerotto -- I guess it doesn't work as I expected. Must re-think this. Thanks.

This comment is enlightening:

It looks like in Ansible, if one needs to do something complex, one should write an action plugin. This is what we are going to do.
Here one has many examples:
https://github.com/ansible/ansible/tree/devel/lib/ansible/plugins/action

So I guess I should convert loop.yml into an action plugin that calls async_status and shell.

@pillarsdotnet I was able to make it work by making a loop in again.yml (using range, between 0 and the number of retries) and adding a wait time in loop.yml (to _sleep_ in each iteration):

_again.yml:_

- name: 'retries'
  set_fact:
    watch_retries: '{{ watch_timeout / watch_poll }}'

- name: 'checking {{ watch_job }} status until finished'
  include_tasks: 'loop.yml'
  loop: "{{ range(0, watch_retries | int, 1) | list }}"

_loop.yml:_

...

- wait_for:
    timeout: '{{ watch_poll | int }}'
  when: not watch_status.finished

...

Just make sure that the number of retries is not too big, otherwise it might return an error about memory or something like that (because ansible will make every include before actually running the loop), so if the timeout is 3600 make poll like 10 (instead of 1) so that the include will be done 360 times (instead of 3600).

Another thing is that if the long running job is run with some user, the async_status call in loop.yml must be the same user (it's important to keep this in mind when running with become: yes).

Just stumbled upon this issue when trying to use script module for one of my provisioning scripts. No problem in earlier uses of script module (in the same playbook), but for some reason on this one it hanged forever.

Then I tried changing script module to copying the script to my /usr/bin/ and then running it with shell module, which worked.

small mitigation, this allows seeing 'which hosts are being run' #53819

Same issue on when using ansible in docker.

ansible 2.7.8
  config file = None
  configured module search path = ['/home/jenkins/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.7/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.7.2 (default, Mar  5 2019, 06:22:51) [GCC 6.3.0 20170516]

I am facing this issue when deploying to exact same virtual machines (they are all clones of the same machine). Some of them get the public repo cloned, while others just hang

I have the same problem with single task. Here is output with ANSIBLE_DEBUG=1 (little redacted)

< TASK [myrole : Set Admnistrator SSH key] >
 ---------------------------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

task path: /home/kvaps/git/myrepo/myproject/roles/myrole/tasks/myrole.yaml:54
 29952 1559645685.71733: sending task start callback
 29952 1559645685.71745: entering _queue_task() for stage/raw
 29952 1559645685.72714: worker is 1 (out of 1 available)
 29952 1559645685.72762: exiting _queue_task() for stage/raw
 29952 1559645685.72783: done queuing things up, now waiting for results queue to drain
 29952 1559645685.72794: waiting for pending results...
 30001 1559645685.73159: running TaskExecutor() for stage/TASK: myrole : Set Admnistrator SSH key
 30001 1559645685.73825: in run() - task 10e7c6f3-602f-cf0b-3453-000000000030
 30001 1559645685.74352: calling self._execute()
 30001 1559645685.75753: Loading TestModule 'core' from /usr/lib/python3.7/site-packages/ansible/plugins/test/core.py (found_in_cache=True, class_only=False)
 30001 1559645685.75836: Loading TestModule 'files' from /usr/lib/python3.7/site-packages/ansible/plugins/test/files.py (found_in_cache=True, class_only=False)
 30001 1559645685.75904: Loading TestModule 'mathstuff' from /usr/lib/python3.7/site-packages/ansible/plugins/test/mathstuff.py (found_in_cache=True, class_only=False)
 30001 1559645685.76498: Loading FilterModule 'core' from /usr/lib/python3.7/site-packages/ansible/plugins/filter/core.py (found_in_cache=True, class_only=False)
 30001 1559645685.76653: Loading FilterModule 'ipaddr' from /usr/lib/python3.7/site-packages/ansible/plugins/filter/ipaddr.py (found_in_cache=True, class_only=False)
 30001 1559645685.76768: Loading FilterModule 'json_query' from /usr/lib/python3.7/site-packages/ansible/plugins/filter/json_query.py (found_in_cache=True, class_only=False)
 30001 1559645685.76894: Loading FilterModule 'k8s' from /usr/lib/python3.7/site-packages/ansible/plugins/filter/k8s.py (found_in_cache=True, class_only=False)
 30001 1559645685.76996: Loading FilterModule 'mathstuff' from /usr/lib/python3.7/site-packages/ansible/plugins/filter/mathstuff.py (found_in_cache=True, class_only=False)
 30001 1559645685.77076: Loading FilterModule 'network' from /usr/lib/python3.7/site-packages/ansible/plugins/filter/network.py (found_in_cache=True, class_only=False)
 30001 1559645685.77162: Loading FilterModule 'urls' from /usr/lib/python3.7/site-packages/ansible/plugins/filter/urls.py (found_in_cache=True, class_only=False)
 30001 1559645685.77279: Loading FilterModule 'urlsplit' from /usr/lib/python3.7/site-packages/ansible/plugins/filter/urlsplit.py (found_in_cache=True, class_only=False)
 30001 1559645685.83815: trying /usr/lib/python3.7/site-packages/ansible/plugins/lookup
 30001 1559645685.84568: Loaded config def from plugin (lookup/pipe)
 30001 1559645685.84578: Loading LookupModule 'pipe' from /usr/lib/python3.7/site-packages/ansible/plugins/lookup/pipe.py
 30001 1559645685.86089: Loaded config def from plugin (lookup/file)
 30001 1559645685.86108: Loading LookupModule 'file' from /usr/lib/python3.7/site-packages/ansible/plugins/lookup/file.py
 30001 1559645685.86125: File lookup term: /home/kvaps/git/myrepo/myproject/environments/stage/secrets/ssh_keys/administrator/id_rsa.pub
 30001 1559645685.86928: trying /usr/lib/python3.7/site-packages/ansible/plugins/connection
 30001 1559645685.87110: Loading Connection 'ssh' from /usr/lib/python3.7/site-packages/ansible/plugins/connection/ssh.py (found_in_cache=True, class_only=False)
 30001 1559645685.87329: trying /usr/lib/python3.7/site-packages/ansible/plugins/shell
 30001 1559645685.87513: Loading ShellModule 'sh' from /usr/lib/python3.7/site-packages/ansible/plugins/shell/sh.py (found_in_cache=True, class_only=False)
 30001 1559645685.87623: Loading ShellModule 'sh' from /usr/lib/python3.7/site-packages/ansible/plugins/shell/sh.py (found_in_cache=True, class_only=False)
 30001 1559645685.88879: Loading ActionModule 'raw' from /usr/lib/python3.7/site-packages/ansible/plugins/action/raw.py (searched paths: /usr/lib/python3.7/site-packages/ansible/plugins/action:/usr/lib/python3.7/site-packages/ansible/plugins/action/__pycache__) (found_in_cache=True, class_only=False)
 30001 1559645685.89061: starting attempt loop
 30001 1559645685.89151: running the handler
 30001 1559645685.89317: _low_level_execute_command(): starting
 30001 1559645685.89413: _low_level_execute_command(): executing: set user sshkey Administrator "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQChKqNqqtE+Va3WCSzf7QGRkackZ0dwX0RYF1r6QoVEqf2sXNa07GNjPyJ9lmUM3d6Az41pI5aJt6JVIxQfihaz4JNuoN5HqiZe/RCZ/ztBEY2UkVJfMH/PnFqNhBc1Y73DYkfA2N2BU3daju9Pah4sTwt4pDlAoZImyxw4gnJ0M7Z5hWtKIV/nK/5/FU5+CB9cMPQQB6BqmHRPk85SuyVOiCOYC1sseC6rafBSwM5/1IbHNVEDL/+scfJnmRnQSlAjytxz0jIXpkPCXC0AXDYpjElYsCPxyM/9JDqjiQ5HZ+WVl3Ou+oXZ67ag7eacIZGxTcAfb3dzw0+FDCZdOMhn root@admin"
<myhost.example.org> ESTABLISH SSH CONNECTION FOR USER: Administrator
<myhost.example.org> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/kvaps/git/myrepo/myproject/environments/stage/secrets/ssh_keys/administrator/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="Administrator"' -o ConnectTimeout=10 -o Ciphers=+aes128-cbc -o ControlPath=/home/kvaps/.ansible/cp/5dbae0ef82 -tt myhost.example.org 'set user sshkey Administrator "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQChKqNqqtE+Va3WCSzf7QGRkackZ0dwX0RYF1r6QoVEqf2sXNa07GNjPyJ9lmUM3d6Az41pI5aJt6JVIxQfihaz4JNuoN5HqiZe/RCZ/ztBEY2UkVJfMH/PnFqNhBc1Y73DYkfA2N2BU3daju9Pah4sTwt4pDlAoZImyxw4gnJ0M7Z5hWtKIV/nK/5/FU5+CB9cMPQQB6BqmHRPk85SuyVOiCOYC1sseC6rafBSwM5/1IbHNVEDL/+scfJnmRnQSlAjytxz0jIXpkPCXC0AXDYpjElYsCPxyM/9JDqjiQ5HZ+WVl3Ou+oXZ67ag7eacIZGxTcAfb3dzw0+FDCZdOMhn root@admin"'
 30001 1559645685.96164: stdout chunk (state=2):
>>>set user sshkey Administrator "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQChsiesjhJ9Vyo9WPxMvaytVkX3TAOCiAqMk1ay8D5UlbC6eKHU4fPbnLIt1dkU87+WMak4NAx8l/QurqrzimvUIVwtc6LLtvCj1ZY1GVDu9z3XFPno6xp63lQHMDGnf1jb0PalNWd86tlpR6uU/48BvzPutpZ86Hgp7cFBf7C94j33dO87/rnVdItUVgCIM+W4VtToEMZfMj8V9qKuOX9KW16z0MYBAqMKcY+jUVI6tpD1R+ltuuG+qG8omHG1RTAR5IpyrWYFDOfSk79N87gbZljaYAsdHfuc1f2mJLTZ5lmqT0ansOpqHwWXXxfieo8xKDplFyy6nb5HfJNE8qFDuuJkHwwnRqtjUTDomHYrS/+OKSXWcrOKYBzmgRZO0sRiZT+OJ6kL3nykzKPXOpzoJA09cUrmszaFSyixJZJlibXHjJ4b6dan0c0rIoBb7dQlzSiJw3G57lr8aj9LcjeqbYXrZDSPG7qy37azIyW65O514ZfosxQbjYaMZojfbAs= root@admin"

strace:

epoll_wait(10, [], 2, 7976)             = 0
wait4(31685, 0x7fff339a4c74, WNOHANG, NULL) = 0
epoll_wait(10, [], 2, 12000)            = 0
wait4(31685, 0x7fff339a4c74, WNOHANG, NULL) = 0
epoll_wait(10, [], 2, 12000)            = 0
wait4(31685, 0x7fff339a4c74, WNOHANG, NULL) = 0
epoll_wait(10, [], 2, 12000)            = 0
wait4(31685, 0x7fff339a4c74, WNOHANG, NULL) = 0

ansible:

ansible 2.8.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/home/kvaps/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 3.7.3 (default, Mar 26 2019, 21:43:19) [GCC 8.2.1 20181127]

I use raw module

BTW, I fixed this by adding into ansible.cfg:

[ssh_connection]
control_path = none

+label affects_2.7

The same thing is happening in #57780 - ansible-playbook is waiting for shell module to finish on the remote host(s) and then it hangs forever when this action takes too long and numerous of "Cannot allocate memory" appear shortly before that. strace also shows the same thing as reported by other people here (WNOHANG).

What i don't understand is why certain timeouts don't kick in? - e.g.:

connect_timeout
command_timeout
accelerate_timeout
accelerate_connect_timeout
accelerate_daemon_timeout

This bug has been opened for over 640 days and it is still happening.

My two cents:

For me, Ansible hangs because I configure and enable UFW somewhere in one of my included playbooks.

When UFW is enable, Ubuntu servers break opened sockets after a certain amount of time. Thus Ansible could keep on its execution of other tasks, and suddenly hangs later, about 15s after enabling UFW.

The solution is to configure ubuntu to keep connections alive when enabling the firewall.
I've found the solution here: https://github.com/ansible/ansible/issues/45446

It fixes the problem for me:

- name: Configure the kernel to keep connections alive when enabling the firewall
  sysctl:
    name: net.netfilter.nf_conntrack_tcp_be_liberal
    value: 1
    state: present
    sysctl_set: yes
    reload: yes

- name: Enable ufw
  ufw: state=enabled

I fixed it by setting transport = paramiko in ansible.cfg

I am still facing same issue with openstack-ansible
I tried all above ansible.cfg combination but didn't help any :(

ansible --version configured module search path = [u'/etc/ansible/roles/config_template/library', u'/etc/ansible/roles/plugins/library', u'/etc/ansible/roles/ceph-ansible/library'] ansible python module location = /opt/ansible-runtime/lib/python2.7/site-packages/ansible executable location = /opt/ansible-runtime/bin/ansible python version = 2.7.5 (default, Jun 20 2019, 20:27:34) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]
traces:
# ansible -m setup 192.168.56.102 -vvv Variable files: "-e @/etc/openstack_deploy/user_secrets.yml -e @/etc/openstack_deploy/user_variables.yml " ansible 2.7.9 config file = None configured module search path = [u'/etc/ansible/roles/config_template/library', u'/etc/ansible/roles/plugins/library', u'/etc/ansible/roles/ceph-ansible/library'] ansible python module location = /opt/ansible-runtime/lib/python2.7/site-packages/ansible executable location = /opt/ansible-runtime/bin/ansible python version = 2.7.5 (default, Jun 20 2019, 20:27:34) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] No config file found; using defaults /opt/openstack-ansible/inventory/dynamic_inventory.py did not meet host_list requirements, check plugin documentation if this is unexpected Parsed /opt/openstack-ansible/inventory/dynamic_inventory.py inventory source with script plugin /opt/openstack-ansible/inventory/inventory.ini did not meet host_list requirements, check plugin documentation if this is unexpected /opt/openstack-ansible/inventory/inventory.ini did not meet script requirements, check plugin documentation if this is unexpected /opt/openstack-ansible/inventory/inventory.ini did not meet yaml requirements, check plugin documentation if this is unexpected Parsed /opt/openstack-ansible/inventory/inventory.ini inventory source with ini plugin /etc/openstack_deploy/inventory.ini did not meet host_list requirements, check plugin documentation if this is unexpected /etc/openstack_deploy/inventory.ini did not meet script requirements, check plugin documentation if this is unexpected /etc/openstack_deploy/inventory.ini did not meet yaml requirements, check plugin documentation if this is unexpected Parsed /etc/openstack_deploy/inventory.ini inventory source with ini plugin META: ran handlers Using module file /opt/ansible-runtime/lib/python2.7/site-packages/ansible/modules/system/setup.py <192.168.56.102> ESTABLISH SSH CONNECTION FOR USER: None <192.168.56.102> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=5 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ServerAliveInterval=64 -o ServerAliveCountMax=1024 -o Compression=no -o TCPKeepAlive=yes -o VerifyHostKeyDNS=no -o ForwardX11=no -o ForwardAgent=yes -T -o ControlPath=none 192.168.56.102 '/bin/sh -c '"'"'/usr/bin/python && sleep 0'"'"''

appreciated any help
thanks

@rakeshz11

Seeing the same problem here as of this morning when running ansible via packer. Literally no idea what changed, but this morning it's broken 👍

One thing I noted with Ubuntu 18.04 (tested $5 droplet at Digital Ocean) is that if I install ansible with pip3 (using python3, but maybe it happens with python2 and pip too) ansible hangs for a long time in some cases (it happened sometimes, but not always, but if I kept executing playbooks this happened more than once from time to time).

Weird thing is that when ansible hanged, even pressing Ctrl^C and trying to execute the playbook again, it hanged immediatelly, and ansible hanged even when running ansible --version. Weirder than that is that even running python --version hanged (but running echo "something" and ls worked fine). Even if I disconnected from SSH and connected again it continued hanging for some minutes (and then worked again for some time).

In the end I uninstalled ansible (2.8) with pip3 and installed with apt, and it haven't happened anymore.

Hey guys.

In my case Ansible was hanging in "gathering facts" because of two entries in my known_hosts. I found out when tried to ssh manually:

Warning: the RSA host key for '****' differs from the key for the IP address '****'
Offending key for IP in /****/known_hosts:168
Matching host key in /****/known_hosts:368
Are you sure you want to continue connecting (yes/no)? ^C

After removing these two entries and running ssh manually again (in order to put the right entry in known_hosts), Ansible did not hang anymore.

Hope it helps!

this is an embarrassing bug but this "company" does not give a single shit anyway

Really helpful information @gophobic ... thanks for your input. 👍

Our way of working around this problem is to run playbooks using -u (username) and to have a bash script running that scans and kills open ssh sessions that exist after a specified amount of time and kill the PID when we can assume the host is causing the playbook to hang. I usually make this happen during the gathering facts stage or some other very non-impactful dummy play to check connection. If all problematic hosts get killed off using this method, they simply fail on the playbook and it will continue on through the rest of the plays as all hosts that would cause hanging are already failed.

I explored the ANSIBLE_SHOW_PER_HOST_START=True option which may help identify which hosts are causing an issue but I would still prefer being able to start the playbook and know that it will finish then be able to deal with any failures that may have happened after the fact.

Would it be possible to implement a pre-check to make sure Ansible can connect to each host and fail right away if not?

Our way of working around this problem is to run playbooks using -u (username) and to have a bash script running that scans and kills open ssh sessions that exist after a specified amount of time and kill the PID when we can assume the host is causing the playbook to hang. I usually make this happen during the gathering facts stage or some other very non-impactful dummy play to check connection. If all problematic hosts get killed off using this method, they simply fail on the playbook and it will continue on through the rest of the plays as all hosts that would cause hanging are already failed.

Doing the same, but scraping the /proc filesystem to find the ssh process start time, then killing after n seconds. Unfortunately that then triggers SSH_RETRIES, so the kill script needs to run several times.

We run against an inventory of several thousand hosts, so it is impractical to enable further debug to narrow down the issue. Reducing the inventory down to a few hundred hosts still exhibits the issue. Like others have stated, toggling mux, changing ControlMaster parameters, pipelining, paramiko has not helped.

In our environment, we see it has hung ssh mux processes; strace shows poll syscalls, but the processes, if not killed, remain active for hours/days. The ansible supervisor process really needs to implement a hard stop, or a hard kill timeout.

I believe this can also be due to an oversight in the implementation of SSH connection retries (and lacking/bad documentation as far as I could see).
SSH connections in ansible are also influenced by your local .ssh/config or your host's /etc/ssh/ssh_config in addition to the standard ansible.cfg files.
I discovered this when trying to diagnose hangs due to the wait_for_connection module.
As long as the underlying SSH command is being executed, nothing is printed in the logs -even at the verbosity levels shown below - because ansible is actually stuck waiting for SSH to give back control. It seems to be a synchronous exec().

Ansible details:

❯ ansible-playbook --version
ansible-playbook 2.9.0
  config file = None
  configured module search path = ['~/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/Cellar/ansible/2.9.0/libexec/lib/python3.7/site-packages/ansible
  executable location = /usr/local/bin/ansible-playbook
  python version = 3.7.5 (default, Nov  1 2019, 02:16:23) [Clang 11.0.0 (clang-1100.0.33.8)]

The following will run for 10m, sources .ssh/config: ConnectionAttempts 10, no/default ansible.cfg:

❯ cat ans-test-conn.yml
- hosts: all
  gather_facts: false
  tasks:
      - name: Wait for 60s
        wait_for_connection:
            timeout: 60

❯ ansible-playbook -vvvvvvvv -i8.8.8.8, ans-test-conn.yml
[...]
<8.8.8.8> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s
  -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
  -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=~/.ansible/cp/28faeff4e9 8.8.8.8 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''
[...]
>>> elapsed time 11m12s

The following will run for 30m, sources .ssh/config: ConnectionAttempts 10, SSH_ANSIBLE_RETRIES=3:

❯ cat ans-test-conn.yml
- hosts: all
  gather_facts: false
  tasks:
      - name: Wait for 60s
        wait_for_connection:
            timeout: 60

❯ export SSH_ANSIBLE_RETRIES=3
❯ ansible-playbook -vvvvvvvv -i8.8.8.8, ans-test-conn.yml
[...]
<8.8.8.8> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s
  -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
  -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=~/.ansible/cp/28faeff4e9 8.8.8.8 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''
<8.8.8.8> (255, b'', b'OpenSSH_7.9p1, LibreSSL 2.7.3\r\ndebug1: Reading configuration data ~/.ssh/config\r\n
  [...]
  debug1: connect to address 8.8.8.8 port 22: Operation timed out\r\nssh: connect to host 8.8.8.8 port 22: Operation timed out\r\n')
<8.8.8.8> ssh_retry: attempt: 1, ssh return code is 255. cmd ([b'ssh', b'-vvv', b'-C', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60s',
  b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',
  b'-o', b'PasswordAuthentication=no', b'-o', b'ConnectTimeout=10', b'-o', b'ControlPath=~/.ansible/cp/28faeff4e9',
  b'8.8.8.8', b"/bin/sh -c 'echo ~ && sleep 0'"]...), pausing for 0 seconds
<8.8.8.8> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s
  -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
  -o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=~/.ansible/cp/28faeff4e9 8.8.8.8 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''
[...]
>>> elapsed time 30m43s

The following will actually retry for the time we specify, 60s+360s, because we are overwriting *both ConnectTimeout and ConnectionAttempts (not sure if ControlMaster=no is really required, but I was basically reimplementing wait_for_connection and did not want to clash with the persistent jump-host ControlMaster):

❯ cat ans-test-conn-2.yml
- hosts: all
  gather_facts: false
  tasks:
      - name: Wait for 60s
        connection: local
        shell: "ssh -oConnectTimeout=60s -oControlMaster=no -oConnectionAttempts=1 {{ inventory_hostname }} true"
        register: ssh_conn_result
        retries: 3
        delay: 1
        until: ssh_conn_result is not failed
[..]
<8.8.8.8> EXEC /bin/sh -c 'rm -f -r ~/.ansible/tmp/ansible-tmp-1573639556.634926-33012718610740/ > /dev/null 2>&1 && sleep 0'
FAILED - RETRYING: Wait for 60s (2 retries left).Result was: {
    "attempts": 2,
    "changed": true,
    "cmd": "ssh -oConnectTimeout=60s -oControlMaster=no -oConnectionAttempts=1 8.8.8.8 true",
    "delta": "0:01:00.052729",
    "end": "2019-11-13 11:06:56.841326",
    "invocation": {
        "module_args": {
            "_raw_params": "ssh -oConnectTimeout=60s -oControlMaster=no -oConnectionAttempts=1 8.8.8.8 true",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": true
        }
    },
    "msg": "non-zero return code",
    "rc": 255,
    "retries": 4,
    "start": "2019-11-13 11:05:56.788597",
    "stderr": "ssh: connect to host 8.8.8.8 port 22: Operation timed out",
    "stderr_lines": [
        "ssh: connect to host 8.8.8.8 port 22: Operation timed out"
    ],
    "stdout": "",
    "stdout_lines": []
}
[..]
>>> elapsed time 4m5s

1 initial attempt + 3 retries => 4m.

The last script will only work if you're not behind a jump-host with custom command (as long as SSH handles the connection itself).
If you have a jump-host to connect to the destination and have a custom proxy_command, you will need to raise delay to 59-60s in order to "retry every minute", otherwise the proxied command on the jump-host may fail immediately, and delay: 1 will make ansible run all the attempts as fast as it can.

i finally switched over to ansible pull method.

On Wed, Nov 13, 2019 at 3:04 AM Fabio Scaccabarozzi <
[email protected]> wrote:

I believe this can also be due to an oversight in the implementation of
SSH connection retries (and lacking/bad documentation as far as I could
see).
SSH connections in ansible are also influenced by your local .ssh/config
or your host's /etc/ssh/ssh_config in addition to the standard
ansible.cfg files.
I discovered this when trying to diagnose hangs due to the
wait_for_connection module.
As long as the underlying SSH command is being executed, nothing is
printed in the logs
-even at the verbosity levels shown below - because
ansible is actually stuck waiting for SSH to give back control. It seems to
be a synchronous exec().

Ansible details:

❯ ansible-playbook --version

ansible-playbook 2.9.0

config file = None

configured module search path = ['~/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']

ansible python module location = /usr/local/Cellar/ansible/2.9.0/libexec/lib/python3.7/site-packages/ansible

executable location = /usr/local/bin/ansible-playbook

python version = 3.7.5 (default, Nov 1 2019, 02:16:23) [Clang 11.0.0 (clang-1100.0.33.8)]

The following will run for 10m, sources .ssh/config: ConnectionAttempts
10, no/default ansible.cfg:

❯ cat ans-test-conn.yml

  • hosts: all

    gather_facts: false

    tasks:

    • name: Wait for 60s

      wait_for_connection:

      timeout: 60
      

❯ ansible-playbook -vvvvvvvv -i8.8.8.8, ans-test-conn.yml

[...]

<8.8.8.8> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s

-o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey

-o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=~/.ansible/cp/28faeff4e9 8.8.8.8 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''

[...]

elapsed time 11m12s

The following will run for 30m, sources .ssh/config: ConnectionAttempts
10, SSH_ANSIBLE_RETRIES=3:

❯ cat ans-test-conn.yml

  • hosts: all

    gather_facts: false

    tasks:

    • name: Wait for 60s

      wait_for_connection:

      timeout: 60
      

❯ export SSH_ANSIBLE_RETRIES=3

❯ ansible-playbook -vvvvvvvv -i8.8.8.8, ans-test-conn.yml

[...]

<8.8.8.8> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s

-o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey

-o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=~/.ansible/cp/28faeff4e9 8.8.8.8 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''

<8.8.8.8> (255, b'', b'OpenSSH_7.9p1, LibreSSL 2.7.3\r\ndebug1: Reading configuration data ~/.ssh/config\r\n

[...]

debug1: connect to address 8.8.8.8 port 22: Operation timed out\r\nssh: connect to host 8.8.8.8 port 22: Operation timed out\r\n')

<8.8.8.8> ssh_retry: attempt: 1, ssh return code is 255. cmd ([b'ssh', b'-vvv', b'-C', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60s',

b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',

b'-o', b'PasswordAuthentication=no', b'-o', b'ConnectTimeout=10', b'-o', b'ControlPath=~/.ansible/cp/28faeff4e9',

b'8.8.8.8', b"/bin/sh -c 'echo ~ && sleep 0'"]...), pausing for 0 seconds

<8.8.8.8> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s

-o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey

-o PasswordAuthentication=no -o ConnectTimeout=10 -o ControlPath=~/.ansible/cp/28faeff4e9 8.8.8.8 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''

[...]

elapsed time 30m43s

The following will actually run the time we specify, 60s, because we are
overwriting both ConnectTimeout and ConnectionAttempts (not sure if
ControlMaster=no is really required, but I was basically reimplementing
wait_for_connection and did not want to clash with the persistent jump-host
ControlMaster):

❯ cat ans-test-conn-2.yml

  • hosts: all

    gather_facts: false

    tasks:

    • name: Wait for 60s

      connection: local

      shell: "ssh -oConnectTimeout=60s -oControlMaster=no -oConnectionAttempts=1 {{ inventory_hostname }} true"

      register: ssh_conn_result

      retries: 3

      delay: 1

      until: ssh_conn_result is not failed

[..]

<8.8.8.8> EXEC /bin/sh -c 'rm -f -r ~/.ansible/tmp/ansible-tmp-1573639556.634926-33012718610740/ > /dev/null 2>&1 && sleep 0'

FAILED - RETRYING: Wait for 60s (2 retries left).Result was: {

"attempts": 2,

"changed": true,

"cmd": "ssh -oConnectTimeout=60s -oControlMaster=no -oConnectionAttempts=1 8.8.8.8 true",

"delta": "0:01:00.052729",

"end": "2019-11-13 11:06:56.841326",

"invocation": {

    "module_args": {

        "_raw_params": "ssh -oConnectTimeout=60s -oControlMaster=no -oConnectionAttempts=1 8.8.8.8 true",

        "_uses_shell": true,

        "argv": null,

        "chdir": null,

        "creates": null,

        "executable": null,

        "removes": null,

        "stdin": null,

        "stdin_add_newline": true,

        "strip_empty_ends": true,

        "warn": true

    }

},

"msg": "non-zero return code",

"rc": 255,

"retries": 4,

"start": "2019-11-13 11:05:56.788597",

"stderr": "ssh: connect to host 8.8.8.8 port 22: Operation timed out",

"stderr_lines": [

    "ssh: connect to host 8.8.8.8 port 22: Operation timed out"

],

"stdout": "",

"stdout_lines": []

}

[..]

elapsed time 4m5s

1 initial attempt + 3 retries => 4m.

The last script will only work if you're not behind a jump-host with
custom command (as long as SSH handles the connection itself).
If you have a jump-host to connect to the destination and have a custom
proxy_command, you will need to raise delay to 59-60s in order to "retry
every minute", otherwise the proxied command on the jump-host may fail
immediately, and delay: 1 will make ansible run all the attempts as fast as
it can.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ansible/ansible/issues/30411?email_source=notifications&email_token=AFYYS45ZCK46WBT5ZXU32SDQTPNNVA5CNFSM4D3C4LI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED5YFAA#issuecomment-553353856,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AFYYS43GNCJDVKSACEKRSWTQTPNNVANCNFSM4D3C4LIQ
.

--
Regards,
Imam Toufique
213-700-5485

ssh -o "ConnectTimeout 60" user@host
or
echo "ConnectTimeout 60" >> /etc/ssh/ssh_config

I also faced this issue. In my case there was ~/.ssh/config file in which I declared sth like this:

Host *.domain.tld
  ForwardAgent yes
  User user1
  IdentityFile /home/me/id_rsa

After removing file content everything started to work.

In one of our project we use ad-hoc commands. We needed some time to figure out that it was a Ansible bug. The main reason is that the setup module works correctly when we execute:

ansible -m setup $IP

but when we execute it programmatically:

/bin/sh -c 'ansible -m setup $IP'

It hangs forever on some machines.

It is a big deal for us because we use the setup module to retrieve info about the system and then use them to install the correct packages and to configure the machine.

We would avoid to put a "hard" timeout, but try to programmatically solve out some of the most common issues which prevent setup to work

Since this is a well known and "aged" issue I think it would be useful to write a document down about it. Is there such a document? Are the manteiners willing to write it? If the answer are both no, then I will write it myself and ask you all for contributions.

I am facing similar problem. I am running RAW command against Cisco and Juniper routers. It hangs for some of the devices only. Works 90% of the devices. Can someone help. Attaching the debug output after running the command
This is how I am running the command
ansible all -i "$DEVICES," -m raw -a "$COMMAND" -c paramiko --extra-vars "ansible_user=$ans_user ansible_password=$ans_password"

show_hang.txt

I'm doing something like this:

- name: update system
  yum:
    name: '*'
    state: latest # noqa 403
  async: 600
  register: yum_sleeper

- name: 'YUM - check on async task'
  async_status:
    jid: "{{ yum_sleeper.ansible_job_id }}"
  register: job_result
  until: job_result.finished
  retries: 30

Thanks @tuxick. Looks like you are doing retries.
But in my case it just hangs. I can keep retrying. Same problem.
I got around adding timeout 45 infront of Ansible command. But it doesn't solve my problem.
timeout 45 ansible all -i "$DEVICES," -m raw -a "$COMMAND" -c paramiko --extra-vars "ansible_user=$ans_user ansible_password=$ans_password"

Why is it handing?? it hangs at
<10.121.0.3> SSH: EXEC sshpass -d8 ssh -vvv -C -o ControlPersist=120s -o StrictHostKeyChecking=no -o 'User="user"' -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/5992772463 -tt 'show interfaces descriptions'

can someone help looking at the logs or you had similar experience? Thanks a lot.

I had same problem, sftp goes stalling.
Probably, most of these cases depend on MTU misconfiguration.
It isn't rare Ansible is running on some sort of overlay network such as VLAN or VxLAN. In this last case, it is necessary to configure properly vtep function MTU value in order to enable frames fragmentation.
I faced with this error on a VxLAN with VTEP function MTU value of 1500 same as underlay network MTU value of 1500. So I configured an MTU value of 1450 on VTEP (50bytes are needed by VxLAN encapsulation).

In my setup VTEP interface is eth100@eth100p so:

ip link set mtu 1450 eth100
ip link set mtu 1450 eth100p

This has resolved my problem. I hope helps.

Hitting this now with an inventory of ~200 hosts. This is the second time I've hit this. Very painful experience. Stalling the entire playbook because of one host is pretty bad. We should be able to declare some max_time. Every host in my environment takes <1s to gather facts. A 30s timeout would fix this for me.

In this case I think one host is experiencing this: SSH is responding but then hanging on login due to missing /home on that host.

@halsafar: Yes, I always thought ansible runs the playbook for each host independently and in parallel. The playbook application on one host shouldn't disturb the of another one.

Where it is not necessary do NOT gather facts otherwise get just what you need.

So in playbooks you have something like:

- hosts: hostgroup
  gather_facts: no
  tasks:
  - name: Collect only network facts 
    setup:
      gather_subset:
        - '!all'
        - network

and you gather just what you _really_ need, with the added benefit of speed up all the process.

In ad-hoc command you have something like:

ansible -m <MODULE> -a "gather_subset=!all,network"

You can also set this the config file.

References

Note

Please double check yml syntax. I have no time now to validate it, let me know i will update it

Where it is not necessary do NOT gather facts otherwise get just what you need.

So in playbooks you have something like:

- hosts: hostgroup
  gather_facts: no
  tasks:
  - name: Collect only network facts 
    setup:
      gather_subset:
        - '!all'
        - network

and you gather just what you _really_ need, with the added benefit of speed up all the process.

In ad-hoc command you have something like:

ansible -m <MODULE> -a "gather_subset=!all,network"

You can also set this the config file.

References

Note

Please double check yml syntax. I have no time now to validate it, let me know i will update it

It does help, but some module relies on the FS layer of the system, and if you have some NFS hanging, it will stuck

We experience this bug on a daily basis on a pool of about 40K hosts. It comes and goes based on hosts' state. So far the only real workaround is to shard the ansible jobs into smaller batches in our orchestration (we're using http://screwdriver.cd). This minimizes the pain, but is.

Our playbook is for auditing, is very minimal and is optimized for speed with no gathering facts, etc...:

  • copy a file to hosts
  • execute that file
  • copy results from the host

There is a fundamental issue such that specified timeout values are not respected. If a program on the client is launched, but sits in a running state of sleep—because it becomes deadlocked for ex. Or, maybe ssh is not working as expected: (suspected) ssh connection succeeds, but the remote shell does not finish opening due to a host level issue. This last condition can be observed even by hand where the ssh client does not respect its own timeout, either, when one tries to ssh to the bad client.

This gets to be a pretty big blocker as the scale of what you're working on increases. My team has pressure to make this work. This means I need to get some hack working or switch to some other job control software. It's dejecting that this bug has been open so long

@jgurtz
at that scale, you already answered to yourself what the solution will be.
but, also, go back through the last year-ish of this ticket's comments there HAVE been 1-2 people over the last 2 years with very useful insights. If I'd be in your shoes, I'd look for those as a interim solution. But the root issue will probably not go away.

Thanks for the response. I have referenced this thread several times over the last year and AFAIKT, there are mitigations but no workarounds for this issue. If I have missed something groundbreaking please mention it.

I see some folks in this thread experiencing this issue with orders of magnitude less number of hosts and apparently enterprise switches/routers too. I sure do wish I had known of this bug before I started.

In almost all my cases, -vvvv showed enough to determine the ssh issues relating to authentication, passwords, keys etc. I wouldn't claim this fits others, but for me this is true. It would be good if ansible provided more detail on authentication issues without having to rerun with -vvvv though...

Fixing individual connection issues is not the way to go. Once you have a
get 1000s of systems there will always be some problematic ones. Back in
'14 I simply had 2 preconditionihg playbooks that did connectivity and
tests and only the survivors where then used for the real inventory. But
that also means that my inventory was not dynamic. Not much point :-)

But, what I wanted to point at is...:

For me the most interesting post in the thread was the one from Fabio. This
gives deterministic timeout behaviour if I understand well:

The following will actually run the time we specify, 60s, because we are
overwriting both ConnectTimeout and ConnectionAttempts (not sure if
ControlMaster=no is really required, but I was basically reimplementing
wait_for_connection and did not want to clash with the persistent jump-host
ControlMaster):

Jason Gurtz notifications@github.com schrieb am Do., 30. Apr. 2020, 22:26:

Thanks for the response. I have referenced this thread several times over
the last year and AFAIKT, there are mitigations but no workarounds for this
issue. If I have missed something groundbreaking please mention it.

I see some folks in this thread experiencing this issue with orders of
magnitude less number of hosts and apparently enterprise switches/routers
too. I sure do wish I had known of this bug before I started.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/ansible/ansible/issues/30411#issuecomment-622091204,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAVSVG74SBNZY624W3MYL2DRPHNINANCNFSM4D3C4LIQ
.

doesn't fix 'why things hang' but can create a way to avoid those from stopping a play run #69284

If it hangs on sshpass (and ControlPersist parameter is used) please check https://sourceforge.net/p/sshpass/patches/11/

+1 , we are exactly facing same issue whenever dealing with larger inventories, absolutely no clue why anisble is hang and stuck with which host... can we please have an option which will show some details of problematic host whenever it gets hang at least...

@Sreenadh8 running with -vvv will show you the hosts it attempts to connect to, its not great, but you can contrast that with the 'results' output to find out which hosts it is stuck on.

I was hitting problems along these lines with only a handful of target hosts, and managed to identify two plausible root causes:

  • concurrent Ansible invocations fighting over the ControlPersist files
  • Ansible attempting to reuse the existing SSH connection after the remote host had been restarted and then not timing out properly

The main fix seemed to be having the potentially concurrent invocations define distinct locations for ControlPath, but we also started explicitly deleting the persistence files when we had good reason to believe they'd no longer be valid.

Unfortunately, debugging this kind of issue is super painful, as there are a lot of potential underlying SSH issues that manifest as "Ansible hangs on an SSH call without timing out" :(

@ncoghlan meta: reset_connection should handle removing stale control persist file, I recommend you use this after reboot action.

It's not only ssh issues, many times ssh is fine, but sshpass or the process on the other end hang/take long/block. We continually try to chip away at the possible ways it will hang, but this will probably never be 100% solved ... just trying to get as close to it as possible.

Have you checked the patch I mentioned few posts ago ? We hit same (or similiar) issue in our product and it turned out to be sshpass bug. Kind of a race condition. And probability of hitting it seemed to be correlated with load of the host where ansible was running. In our case we had many hosts we were deploying with ansible and this caused bigger load and higher probably of encountering the problem. But I can imagine we could hit it with smaller amount of hosts if our ansible host had higher cpu load for some other reason.
Please check https://sourceforge.net/p/sshpass/patches/11/ we put some more detailed explanation for the issue. And build and try fixed version of sshpass.

The problem cause can be related with apt problems with ipv6 network, disable ipv6 or prefer ipv4 worked for me, see a forum about that

Hi,

I am Reshma I have kind of similar issue as my ansible module running just stucks and dont give me fails/skipped/ status and does nothing on targeted host. if any one could help with this it would be great. If this is not the correct forum please help to pin the forum where i can seek the help.

Details of system:
Targest host : SUSE 12 SP5
python-xml-2.7.17-28.51.1.x86_64
rpm-4.11.2-16.21.1.x86_64
zypper 1.13.51

Ansible Command:
ANSIBLE_DEBUG=1 ansible -i /etc/ansible/hosts webservers -k -m zypper -a 'name=git-core state=present' -v

Debug output:
3676 1600772716.28160: _low_level_execute_command(): executing: /bin/sh -c 'echo PLATFORM; uname; echo FOUND; command -v '"'"'/usr/bin/python'"'"'; command -v '"'"'python3.7'"'"'; command -v '"'"'python3.6'"'"'; command -v '"'"'python3.5'"'"'; command -v '"'"'python2.7'"'"'; command -v '"'"'python2.6'"'"'; command -v '"'"'/usr/libexec/platform-python'"'"'; command -v '"'"'/usr/bin/python3'"'"'; command -v '"'"'python'"'"'; echo ENDFOUND && sleep 0'
3676 1600772716.37768: stdout chunk (state=2):

PLATFORM
<<<

3676 1600772716.37947: stdout chunk (state=3):

Linux
FOUND
/usr/bin/python
<<<

3676 1600772716.37989: stdout chunk (state=3):

/usr/bin/python3.6
/usr/bin/python2.7
/usr/bin/python3
/usr/bin/python
ENDFOUND
<<<

3676 1600772716.38378: stderr chunk (state=3):

<<<

3676 1600772716.38424: stdout chunk (state=3):

<<<

3676 1600772716.38484: _low_level_execute_command() done: rc=0, stdout=PLATFORM
Linux
FOUND
/usr/bin/python
/usr/bin/python3.6
/usr/bin/python2.7
/usr/bin/python3
/usr/bin/python
ENDFOUND
, stderr=
3676 1600772716.38530 [10.237.222.108]: found interpreters: [u'/usr/bin/python', u'/usr/bin/python3.6', u'/usr/bin/python2.7', u'/usr/bin/python3', u'/usr/bin/python']
3676 1600772716.38609: _low_level_execute_command(): starting
3676 1600772716.38650: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0'
3676 1600772716.39569: Sending initial data
3676 1600772716.39658: Sent initial data (1234 bytes)
3676 1600772716.51494: stdout chunk (state=3):

{"osrelease_content": "NAME=\"SLES\"\nVERSION=\"12-SP5\"\nVERSION_ID=\"12.5\"\nPRETTY_NAME=\"SUSE Linux Enterprise Server 12 SP5\"\nID=\"sles\"\nANSI_COLOR=\"0;32\"\nCPE_NAME=\"cpe:/o:suse:sles:12:sp5\"\n", "platform_dist_result": ["SuSE", "12", "x86_64"]}
<<<

3676 1600772716.51914: stderr chunk (state=3):

<<<

3676 1600772716.51960: stdout chunk (state=3):

<<<

3676 1600772716.52014: _low_level_execute_command() done: rc=0, stdout={"osrelease_content": "NAME=\"SLES\"\nVERSION=\"12-SP5\"\nVERSION_ID=\"12.5\"\nPRETTY_NAME=\"SUSE Linux Enterprise Server 12 SP5\"\nID=\"sles\"\nANSI_COLOR=\"0;32\"\nCPE_NAME=\"cpe:/o:suse:sles:12:sp5\"\n", "platform_dist_result": ["SuSE", "12", "x86_64"]}
, stderr=
3676 1600772716.52196: ANSIBALLZ: using cached module: /root/.ansible/tmp/ansible-local-3668A5bmKd/ansiballz_cache/zypper-ZIP_DEFLATED
3676 1600772716.52622: transferring module to remote /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/AnsiballZ_zypper.py
3676 1600772716.53785: Sending initial data
3676 1600772716.53856: Sent initial data (138 bytes)
3676 1600772716.63011: stdout chunk (state=3):

sftp> put /root/.ansible/tmp/ansible-local-3668A5bmKd/tmpduCRmY /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/AnsiballZ_zypper.py
<<<

3676 1600772716.65358: stderr chunk (state=3):

<<<

3676 1600772716.65404: stdout chunk (state=3):

<<<

3676 1600772716.65466: done transferring module to remote
3676 1600772716.65532: _low_level_execute_command(): starting
3676 1600772716.65568: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/ /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/AnsiballZ_zypper.py && sleep 0'
3676 1600772716.75915: stderr chunk (state=2):

<<<

3676 1600772716.76007: stdout chunk (state=2):

<<<

3676 1600772716.76066: _low_level_execute_command() done: rc=0, stdout=, stderr=
3676 1600772716.76099: _low_level_execute_command(): starting
3676 1600772716.76141: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/AnsiballZ_zypper.py && sleep 0'

I've found that just killing a stuck process on the remote via ssh gives back the stdout. I find thats shorter than writing workarounds that won't be used in the final playbook anyways:

kill -9 PID

try it

This is issue with proxy settings on SUSE target machine, so, on target machine add the proxy as mentioned in below link https://www.suse.com/support/kb/doc/?id=000017441

Because o proxy issue zypper module not able to install.

Hi,

I am Reshma I have kind of similar issue as my ansible module running just stucks and dont give me fails/skipped/ status and does nothing on targeted host. if any one could help with this it would be great. If this is not the correct forum please help to pin the forum where i can seek the help.

Details of system:
Targest host : SUSE 12 SP5
python-xml-2.7.17-28.51.1.x86_64
rpm-4.11.2-16.21.1.x86_64
zypper 1.13.51

Ansible Command:
ANSIBLE_DEBUG=1 ansible -i /etc/ansible/hosts webservers -k -m zypper -a 'name=git-core state=present' -v

Debug output:
3676 1600772716.28160: _low_level_execute_command(): executing: /bin/sh -c 'echo PLATFORM; uname; echo FOUND; command -v '"'"'/usr/bin/python'"'"'; command -v '"'"'python3.7'"'"'; command -v '"'"'python3.6'"'"'; command -v '"'"'python3.5'"'"'; command -v '"'"'python2.7'"'"'; command -v '"'"'python2.6'"'"'; command -v '"'"'/usr/libexec/platform-python'"'"'; command -v '"'"'/usr/bin/python3'"'"'; command -v '"'"'python'"'"'; echo ENDFOUND && sleep 0'
3676 1600772716.37768: stdout chunk (state=2):

PLATFORM
<<<

3676 1600772716.37947: stdout chunk (state=3):

Linux
FOUND
/usr/bin/python
<<<

3676 1600772716.37989: stdout chunk (state=3):

/usr/bin/python3.6
/usr/bin/python2.7
/usr/bin/python3
/usr/bin/python
ENDFOUND
<<<

3676 1600772716.38378: stderr chunk (state=3):

<<<

3676 1600772716.38424: stdout chunk (state=3):

<<<

3676 1600772716.38484: _low_level_execute_command() done: rc=0, stdout=PLATFORM
Linux
FOUND
/usr/bin/python
/usr/bin/python3.6
/usr/bin/python2.7
/usr/bin/python3
/usr/bin/python
ENDFOUND
, stderr=
3676 1600772716.38530 [10.237.222.108]: found interpreters: [u'/usr/bin/python', u'/usr/bin/python3.6', u'/usr/bin/python2.7', u'/usr/bin/python3', u'/usr/bin/python']
3676 1600772716.38609: _low_level_execute_command(): starting
3676 1600772716.38650: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python && sleep 0'
3676 1600772716.39569: Sending initial data
3676 1600772716.39658: Sent initial data (1234 bytes)
3676 1600772716.51494: stdout chunk (state=3):

{"osrelease_content": "NAME="SLES"\nVERSION="12-SP5"\nVERSION_ID="12.5"\nPRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"\nID="sles"\nANSI_COLOR="0;32"\nCPE_NAME="cpe:/o:suse:sles:12:sp5"\n", "platform_dist_result": ["SuSE", "12", "x86_64"]}
<<<

3676 1600772716.51914: stderr chunk (state=3):

<<<

3676 1600772716.51960: stdout chunk (state=3):

<<<

3676 1600772716.52014: _low_level_execute_command() done: rc=0, stdout={"osrelease_content": "NAME="SLES"\nVERSION="12-SP5"\nVERSION_ID="12.5"\nPRETTY_NAME="SUSE Linux Enterprise Server 12 SP5"\nID="sles"\nANSI_COLOR="0;32"\nCPE_NAME="cpe:/o:suse:sles:12:sp5"\n", "platform_dist_result": ["SuSE", "12", "x86_64"]}
, stderr=
3676 1600772716.52196: ANSIBALLZ: using cached module: /root/.ansible/tmp/ansible-local-3668A5bmKd/ansiballz_cache/zypper-ZIP_DEFLATED
3676 1600772716.52622: transferring module to remote /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/AnsiballZ_zypper.py
3676 1600772716.53785: Sending initial data
3676 1600772716.53856: Sent initial data (138 bytes)
3676 1600772716.63011: stdout chunk (state=3):

sftp> put /root/.ansible/tmp/ansible-local-3668A5bmKd/tmpduCRmY /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/AnsiballZ_zypper.py
<<<

3676 1600772716.65358: stderr chunk (state=3):

<<<

3676 1600772716.65404: stdout chunk (state=3):

<<<

3676 1600772716.65466: done transferring module to remote
3676 1600772716.65532: _low_level_execute_command(): starting
3676 1600772716.65568: _low_level_execute_command(): executing: /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/ /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/AnsiballZ_zypper.py && sleep 0'
3676 1600772716.75915: stderr chunk (state=2):

<<<

3676 1600772716.76007: stdout chunk (state=2):

<<<

3676 1600772716.76066: _low_level_execute_command() done: rc=0, stdout=, stderr=
3676 1600772716.76099: _low_level_execute_command(): starting
3676 1600772716.76141: _low_level_execute_command(): executing: /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1600772715.9-203940692219450/AnsiballZ_zypper.py && sleep 0'

This is issue with proxy settings on SUSE target machine, so, on target machine add the proxy as mentioned in below link https://www.suse.com/support/kb/doc/?id=000017441

Because o proxy issue zypper module not able to install.

Was this page helpful?
0 / 5 - 0 ratings