How to get n random “paragraphs” (groups of ordere

2019-08-20 07:44发布

问题:

I have a file (originally compressed) with a known structure - every 4 lines, the first line starts with the character "@" and defines an ordered group of 4 lines. I want to select randomly n groups (half) of lines in the most efficient way (preferably in bash/another Unix tool).

My suggestion in python is:

path = "origin.txt.gz"
unzipped_path = "origin_unzipped.txt"
new_path = "/home/labs/amit/diklag/subset.txt"
subprocess.getoutput("""gunzip -c %s > %s  """ % (path, unzipped_path))
with open(unzipped_path) as f:
  lines = f.readlines()
  subset_size = round((len(lines)/4) * 0.5)
  l = random.sample(list(range(0, len(lines), 4)),subset_size)
  selected_lines = [line for i in l for line in list(range(i,i+4))]
  new_lines = [lines[i] for i in selected_lines]
  with open(new_path,'w+') as f2:
    f2.writelines(new_lines)

Can you help me find another (and faster) way to do it? Right now it takes ~10 seconds to run this code

回答1:

The following script might be helpful. This is however, untested as we do not have an example file:

attempt 1 (awk and shuf) :

#!/usr/bin/env bash
count=30
path="origin.txt.gz"
new_path="subset.txt"
nrec=$(gunzip -c $path | awk '/^@/{c++}{END print c})'
awk '(NR==FNR){a[$1]=1;next}
     !/^@/{next}
     ((++c) in a) { for(i=1;i<=4;i++) { print; getline } }' \
   <(shuf -i 1-$nrec -n $count) <(gunzip -c $path) > $new_path

attempt 2 (sed and shuf) :

#!/usr/bin/env bash
count=30
path="origin.txt.gz"
new_path="subset.txt"
gunzip -c $path | sed ':a;N;$!ba;s/\n/__END_LINE__/g;s/__END_LINE__@/\n@/g' \
   | shuf -n $count | sed 's/__END_LINE__/\n/g' > $new_path

In this example, the sed line will replace all newlines with the string __END_LINE__, except if it is followed by @. The shuf command will then pick $count random samples out of that list. Afterwards we replace the string __END_LINE__ again by \n.

attempt 3 (awk) :

Create a file called subset.awk containing :

# Uniform(m) :: returns a random integer such that
#    1 <= Uniform(m) <= m
function Uniform(m) { return 1+int(m * rand()) }

# KnuthShuffle(m) :: creates a random permutation of the range [1,m]
function KnuthShuffle(m,   i,j,k) {
    for (i = 1; i <= m  ; i++) { permutation[i] = i }
    for (i = 1; i <= m-1; i++) {
        j = Uniform(i-1)
        k = permutation[i]
        permutation[i] = permutation[j]
        permutation[j] = k
    }
}

BEGIN{RS="\n@"; srand() }
{a[NR]=$0}
END{ KnuthShuffle(NR);
    sub("@","",a[1])
    for(r = 1; r <= count; r++) {
        print "@"a[permutation[r]] 
     }
}

And then you can run :

$ gunzip -c <file.gz> | awk -c count=30 -f subset.awk > <output.txt>