How can I pass multiple sub-commands to xargs when

2019-08-06 04:10发布

问题:

I'm trying to write a bash script to process a large directory tree and rsync that in multiple streams. From other research on this site I constructed the following. The assumption is the command is run:

program.sh /input/location /output/location $threads

The key line in my script is

cd $1; find . -depth \( -type d -printf \""%p/\"\n" \) | xargs -n1 -P$3 -I% rsync -lptgoDds --delete --backup --backup-dir=$INCREMENTALS/$DATE/$1 % $2/%

The idea of the above is to find all the directories at a certain point and then pass them to parallel instances of an rsync command to copy the data from $1 to $2.

The issue I am having is rsync may have to create nested folders that don't yet exist depending on how the inputs are parsed. (At least I think that's the cause of the errors I'm seeing). To get around this I thought I could issue two commands within the rsync. The first command would make the directory, the second would start the rsync.

Something like this:

cd $1; find . -depth \( -type d -printf \""%p/\"\n" \) | xargs -n1 -P$3 -I% 'mkdir -p %;rsync -lptgoDdsv --delete % $2/%;'

But that doesn't seem to work either.

So having played with this for some time now, I'm reaching out for help. :-)

回答1:

You cannot pass a complex command in single quotes to xargs like that. But you can pass it to sh, and pass that to xargs.

cd $1; find . -depth \( -type d -printf \""%p/\"\n" \) | xargs -n1 -P$3 -I% sh -c 'mkdir -p %;rsync -lptgoDdsv --delete % $2/%;'

(I have no way to test this. It might still need some adaptation.)



回答2:

It seems the example from GNU Parallel http://www.gnu.org/software/parallel/man.html#example__parallelizing_rsync is pretty close to what you want:

cd src-dir; find . -type f -size +100000 | parallel -v ssh fooserver mkdir -p /dest-dir/{//}\;rsync -Havessh {} fooserver:/dest-dir/{}

Instead this ought to work:

cd $1; find . -depth -type d | parallel -P$3 mkdir -p $INCREMENTALS/$DATE/$1 $2/{}\; rsync -lptgoDds --delete --backup --backup-dir=$INCREMENTALS/$DATE/$1 {} $2/{}

If GNU Parallel is not packaged for your system, this should install it in 10 seconds:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

To learn more: Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial (man parallel_tutorial). You command line will love you for it.



回答3:

Just to come back and re-post what I think is the answer. I had to use a shell invocation to do what I needed to do and after a lot of trial and error it occurred to me the answer was pretty simple to pass the fields down to the sub-shell. By exporting them, they become available to the sub-shells and it works like a charm. Here's my current script.

#!/bin/bash
set -x

export INCREMENTALS="/var/backup/data"
export DATE=`date +%F`
export SRCDIR=$1
export TARGETDIR=$2
export THREADS=$3


cd $SRCDIR; find . -type d -print0 | xargs -0 -n1 -P$THREADS -I {} sh -c 'echo $TARGETDIR/"{}"; mkdir -p $TARGETDIR/"{}"; rsync -lptgoDdXvz --delete --backup --backup-dir=$INCREMENTALS/$DATE/.$SRCDIR "{}"/ $TARGETDIR/"{}"'

To run the script you use this sequence:

rsync.sh /from/dir /to/dir 20

The first two parameters are obvious, the "20" is the number of threads of rsync you want to invoke.

So this way you are pushing many parallel rsync's to the point of exhausting the machine. The only gotcha I've found is that if there are directories with many thousands of files the parallelism falls apart because all the others finish and you are stuck waiting behind the longest one. I'm trying to work out a way to do more of a spray approach now for round II.

My only other issue now is my memory consumption goes up over time. I have a funny feeling there's a leak that's not related to my script, but I'm concerned I may have some unbounded element in this which is causing constant ever increasing memory use. Still that's another problem to solve, unrelated to this.

The net-net answer was to 'export' the functions and then the sub-shells see the content correctly and it works really well.



标签: bash find xargs