Shuffle multiple files in same order

2019-02-27 07:30发布

问题:

Setup:

I have 50 files, each with 25000 lines.

To-do:

I need to shuffle all of them "in the same order". E.g.:

If before shuffle:

File 1  File 2  File 3
A       A       A
B       B       B
C       C       C

then after shuffle I should get:

File 1  File 2  File 3
B       B       B
C       C       C
A       A       A

i.e. corresponding rows in files should be shuffled in same order.

Also, the shuffle should be deterministic, i.e. if I give File A as input, it should always produce same shuffled output.

I can write a Java program to do it, probably a script to. Something like, shuffle number between 1 and 25000 and store that in a file, say shuffle_order. Then simply process one file at a time and order existing rows according to shuffle_order. But is there a better/quick way to do this?

Please let me know if more info needed.

回答1:

The next uses only basic bash commands. The principe is:

  • generate a random order (numbers)
  • order all files in this order

the code

#!/bin/bash
case "$#" in
    0) echo "Usage: $0 files....." ; exit 1;;
esac

ORDER="./.rand.$$"
trap "rm -f $ORDER;exit" 1 2
count=$(grep -c '^' "$1")

let odcount=$(($count * 4))
paste -d" " <(od -A n -N $odcount -t u4 /dev/urandom | grep -o '[0-9]*') <(seq -w $count) |\
    sort -k1n | cut -d " " -f2 > $ORDER

#if your system has the "shuf" command you can replace the above 3 lines with a simple
#seq -w $count | shuf > $ORDER

for file in "$@"
do
    paste -d' ' $ORDER $file | sort -k1n | cut -d' ' -f2-  > "$file.rand"
done

echo "the order is in the file $ORDER"  # remove this line
#rm -f $ORDER                           # and uncomment this
                                        # if dont need preserve the order

paste -d "  " *.rand   #remove this line - it is only for showing test result

from the input files:

A  B  C
--------
a1 a2 a3
b1 b2 b3
c1 c2 c3
d1 d2 d3
e1 e2 e3
f1 f2 f3
g1 g2 g3
h1 h2 h3
i1 i2 i3
j1 j2 j3

will make A.rand B.rand C.rand with the next example content

g1 g2 g3
e1 e2 e3
b1 b2 b3
c1 c2 c3
f1 f2 f3
j1 j2 j3
d1 d2 d3
h1 h2 h3
i1 i2 i3
a1 a2 a3

real testing - genereting 50 files with 25k lines

line="Consequatur qui et qui. Mollitia expedita aut excepturi modi. Enim nihil et laboriosam sit a tenetur."
for n in $(seq -w 50)
do
    seq -f "$line %g" 25000 >file.$n
done

running the script

bash sorter.sh file.??

result on my notebook

real     1m13.404s
user     0m56.127s
sys      0m5.143s


回答2:

Probably very inefficient but try below:

#!/bin/bash

arr=( $(for i in {1..25000}; do
    echo "$i"
done | shuf) )


for file in files*; do
    index=0
    new=$(while read line; do
        echo "${arr[$index]} $line"
        (( index++ ))
    done < "$file" | sort -h | sed 's/^[0-9]\+ //')
    echo "$new" > "$file"
done