Setup:
I have 50 files, each with 25000 lines.
To-do:
I need to shuffle all of them "in the same order".
E.g.:
If before shuffle:
File 1 File 2 File 3
A A A
B B B
C C C
then after shuffle I should get:
File 1 File 2 File 3
B B B
C C C
A A A
i.e. corresponding rows in files should be shuffled in same order.
Also, the shuffle should be deterministic, i.e. if I give File A as input, it should always produce same shuffled output.
I can write a Java program to do it, probably a script to. Something like, shuffle number between 1 and 25000 and store that in a file, say shuffle_order. Then simply process one file at a time and order existing rows according to shuffle_order. But is there a better/quick way to do this?
Please let me know if more info needed.
The next uses only basic bash commands. The principe is:
- generate a random order (numbers)
- order all files in this order
the code
#!/bin/bash
case "$#" in
0) echo "Usage: $0 files....." ; exit 1;;
esac
ORDER="./.rand.$$"
trap "rm -f $ORDER;exit" 1 2
count=$(grep -c '^' "$1")
let odcount=$(($count * 4))
paste -d" " <(od -A n -N $odcount -t u4 /dev/urandom | grep -o '[0-9]*') <(seq -w $count) |\
sort -k1n | cut -d " " -f2 > $ORDER
#if your system has the "shuf" command you can replace the above 3 lines with a simple
#seq -w $count | shuf > $ORDER
for file in "$@"
do
paste -d' ' $ORDER $file | sort -k1n | cut -d' ' -f2- > "$file.rand"
done
echo "the order is in the file $ORDER" # remove this line
#rm -f $ORDER # and uncomment this
# if dont need preserve the order
paste -d " " *.rand #remove this line - it is only for showing test result
from the input files:
A B C
--------
a1 a2 a3
b1 b2 b3
c1 c2 c3
d1 d2 d3
e1 e2 e3
f1 f2 f3
g1 g2 g3
h1 h2 h3
i1 i2 i3
j1 j2 j3
will make A.rand B.rand C.rand
with the next example content
g1 g2 g3
e1 e2 e3
b1 b2 b3
c1 c2 c3
f1 f2 f3
j1 j2 j3
d1 d2 d3
h1 h2 h3
i1 i2 i3
a1 a2 a3
real testing - genereting 50 files with 25k lines
line="Consequatur qui et qui. Mollitia expedita aut excepturi modi. Enim nihil et laboriosam sit a tenetur."
for n in $(seq -w 50)
do
seq -f "$line %g" 25000 >file.$n
done
running the script
bash sorter.sh file.??
result on my notebook
real 1m13.404s
user 0m56.127s
sys 0m5.143s
Probably very inefficient but try below:
#!/bin/bash
arr=( $(for i in {1..25000}; do
echo "$i"
done | shuf) )
for file in files*; do
index=0
new=$(while read line; do
echo "${arr[$index]} $line"
(( index++ ))
done < "$file" | sort -h | sed 's/^[0-9]\+ //')
echo "$new" > "$file"
done