I have a series of numbered files to be processed separately by each server. Each split file made using linux split and then xz compressed to save transfer time.
split_001 split_002 split_003 ... split_030
How can I push these files out to a group of 30 servers with ansible? It does not matter which server gets which file so long as they each have a single unique file.
I had used a bash file but I am looking for a better solution. Hopefully using ansible. Then I plan to run a shell command to run an at command to start the several hours or days of computation.
scp -oStrictHostKeyChecking=no bt_5869_001.xz usr13@<ip>:/data/
scp -oStrictHostKeyChecking=no bt_5869_002.xz usr13@<ip>:/data/
scp -oStrictHostKeyChecking=no bt_5869_003.xz usr13@<ip>:/data/
...
http://docs.ansible.com/ansible/copy_module.html
# copy file but iterate through each of the split files
- copy: src=/mine/split_001.xz dest=/data/split_001.xz
- copy: src=/mine/compute dest=/data/ owner=root mode=0755
- copy: src=/mine/start.sh dest=/data/ owner=root mode=0755
- shell: xz -d *.xz
- shell: at -f /data/start.sh now
For example:
tasks:
- set_fact:
padded_host_index: "{{ '{0:03d}'.format(play_hosts.index(inventory_hostname)) }}"
- copy: src=/mine/split_{{ padded_host_index }}.xz dest=/data/
You can do this with Ansible. However, this seems like the wrong general approach to me.
You have a number of jobs. You need them each to be processed, and you don't care which server processes which job as long as they only process each job once (and ideally do the whole batch as efficiently as possible). This is precisely the situation a distributed queueing system is designed to work in.
You'll have workers running on each server and one master node (which may run on one of the servers) that knows about all of the workers. When you need to add tasks to get done, you queue them up with the master, and the master distributes them out to workers as they become available - so you don't have to worry about having an equal number of servers as jobs.
There are many, many options for this, including beanstalkd, Celery, Gearman, and SQS. You'll have to do the legwork to find out which one works best for your situation. But this is definitely the architecture best suited to your problem.