Ansible: Abort Execution if a host is unreachable

2019-06-20 00:40发布

Summary: A better way for aborting ansible playbook immediately if any host is unreachable.

Is there a way to abort Ansible playbook if any one of the host is unreachable. What I find that if it cannot reach a host it will still continue on and execute all the plays/tasks in the playbook.

All my playbooks I specify the max_fail_percentage of 0, but in this case ansible does not complain since all the hosts that are reachable can execute all the plays.

Currently I have a simple but hacky solution, but seeing if there is a better answer.

My current solution:

Since the first step as part of running the playbooks, ansible gathers facts for all the hosts. And in case where a host is not reachable it will not be able to. I write a simple play at the very beginning of my playbook which will use a fact. And in case a host is unreachable that task will fail with "Undefined variable error". The task is just a dummy and will always pass if all hosts are reachable.

See below my example:

- name: Check Ansible connectivity to all hosts
  hosts: host_all
  user: "{{ remote_user }}"
  sudo: "{{ sudo_required }}"
  sudo_user: root
  connection: ssh # or paramiko
  max_fail_percentage: 0
  tasks:
    - name: check connectivity to hosts (Dummy task)
      shell: echo " {{ hostvars[item]['ansible_hostname'] }}"
      with_items: groups['host_all']
      register: cmd_output

    - name: debug ...
      debug: var=cmd_output

In case a host is unreachable you will get an error as below:

TASK: [c.. ***************************************************** 
fatal: [172.22.191.160] => One or more undefined variables: 'dict object'    has no attribute 'ansible_hostname' 
fatal: [172.22.191.162] => One or more undefined variables: 'dict object' has no attribute 'ansible_hostname'

FATAL: all hosts have already failed -- aborting

5条回答
劳资没心,怎么记你
2楼-- · 2019-06-20 01:16

Inspired from other answers.

Using ansible-playbook 2.7.8.

Checking if there are any ansible_facts for each required hosts feels more explicit to me.

# my-playbook.yml
- hosts: myservers
  tasks:
    - name: Check ALL hosts are reacheable before doing the release
      fail:
        msg: >
          [REQUIRED] ALL hosts to be reachable, so flagging {{ inventory_hostname }} as failed,
          because host {{ item }} has no facts, meaning it is UNREACHABLE.
      when: "hostvars[item].ansible_facts|list|length == 0"
      with_items: "{{ groups.myservers }}"

    - debug:
        msg: "Will only run if all hosts are reacheable"
$ ansible-playbook -i my-inventory.yml my-playbook.yml

PLAY [myservers] *************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *********************************************************************************************************************************************************************************************************
fatal: [my-host-03]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname my-host-03: Name or service not known", "unreachable": true}
fatal: [my-host-04]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname my-host-04: Name or service not known", "unreachable": true}
ok: [my-host-02]
ok: [my-host-01]

TASK [Check ALL hosts are reacheable before doing the release] ********************************************************************************************************************************************************************************************************************
failed: [my-host-01] (item=my-host-03) => {"changed": false, "item": "my-host-03", "msg": "[REQUIRED] ALL hosts to be reachable, so flagging my-host-01 as failed, because host my-host-03 has no facts, meaning it is UNREACHABLE."}
failed: [my-host-01] (item=my-host-04) => {"changed": false, "item": "my-host-04", "msg": "[REQUIRED] ALL hosts to be reachable, so flagging my-host-01 as failed, because host my-host-04 has no facts, meaning it is UNREACHABLE."}
failed: [my-host-02] (item=my-host-03) => {"changed": false, "item": "my-host-03", "msg": "[REQUIRED] ALL hosts to be reachable, so flagging my-host-02 as failed, because host my-host-03 has no facts, meaning it is UNREACHABLE."}
failed: [my-host-02] (item=my-host-04) => {"changed": false, "item": "my-host-04", "msg": "[REQUIRED] ALL hosts to be reachable, so flagging my-host-02 as failed, because host my-host-04 has no facts, meaning it is UNREACHABLE."}
skipping: [my-host-01] => (item=my-host-01)
skipping: [my-host-01] => (item=my-host-02)
skipping: [my-host-02] => (item=my-host-01)
skipping: [my-host-02] => (item=my-host-02)
        to retry, use: --limit @./my-playbook.retry

PLAY RECAP *********************************************************************************************************************************************************************************************************************
my-host-01 : ok=1    changed=0    unreachable=0    failed=1
my-host-02 : ok=1    changed=0    unreachable=0    failed=1
my-host-03 : ok=0    changed=0    unreachable=1    failed=0
my-host-04 : ok=0    changed=0    unreachable=1    failed=0
查看更多
ら.Afraid
3楼-- · 2019-06-20 01:25

alternatively, this looks simplier and more expressive

- hosts: myservers
  become: true

  pre_tasks:
    - name: Check ALL hosts are reacheable before doing the release
      assert:
        that:
          - ansible_play_hosts == groups.myservers
        fail_msg: 1 or more host is UNREACHABLE
        success_msg: ALL hosts are REACHABLE, go on
      run_once: yes

  roles:
    - deploy

https://github.com/ansible/ansible/issues/18782#issuecomment-319409529

查看更多
在下西门庆
4楼-- · 2019-06-20 01:29

You could be a bit more explicit about the check:

- fail: Abort if hosts are unreachable
  when: "'ansible_hostname' not in hostvars[item]"
  with_items: groups['all']

I thought you could make a callback plugin to achieve this. Something like:

class CallbackModule(object):
    def runner_on_unreachable(self, host, res):
        raise Exception("Aborting due to unreachable host " + host)

Except I can't find any good way to abort the entire playbook from that callback (the exception doesn't work, return value is ignored and while you could probably abuse self.playbook to stop things, there's no public API I can see).

查看更多
劳资没心,怎么记你
5楼-- · 2019-06-20 01:34

You can combine any_errors_fatal: true or max_fail_percentage: 0 with gather_facts: false, and then run a task that will fail if the host is offline. Something like this at the top of the playbook should do what you need:

- hosts: all
  gather_facts: false
  max_fail_percentage: 0
  tasks:
    - action: ping

A bonus is that this also works with the -l SUBSET option for limiting matching hosts.

查看更多
不美不萌又怎样
6楼-- · 2019-06-20 01:38

I found a way to use a callback to abort the play as soon as the gather_facts has completed.

By setting the _play_hosts to an empty set, there are no hosts to continue with the play.

class CallbackModule(object):

    def runner_on_unreachable(self, host, res):
        # Truncate the play_hosts to an empty set to fail quickly
        self.play._play_hosts = []

The result is something like:

PLAY [test] *******************************************************************

GATHERING FACTS ***************************************************************
fatal: [haderp] => SSH Error: ssh: Could not resolve hostname haderp: nodename nor servname provided, or not known
It is sometimes useful to re-run the command using -vvvv, which prints SSH debug output to help diagnose the issue.
ok: [derp]

TASK: [set a fact] ************************************************************
FATAL: no hosts matched or all hosts have already failed -- aborting


PLAY RECAP ********************************************************************
       to retry, use: --limit @/Users/jkeating/foo.yaml.retry

derp                       : ok=1    changed=0    unreachable=0    failed=0
haderp                     : ok=0    changed=0    unreachable=1    failed=0
查看更多
登录 后发表回答