I want to select randomly 3000 lines from a sample.file which contains 8000 lines. I will do that with awk codes or do from command line. How can I do that?
问题:
回答1:
If you have gnu sort, it's easy:
sort -R FILE | head -n3000
If you have gnu shuf, it's even easier:
shuf -n3000 FILE
回答2:
awk 'BEGIN{srand();}
{a[NR]=$0}
END{for(i=1; i<=3000; i++){x=int(rand()*NR) + 1; print a[x];}}' yourFile
回答3:
Fixed as per Glenn's comment:
awk 'BEGIN {
a=8000; l=3000
srand(); nr[x]
while (length(nr) <= l)
nr[int(rand() * a) + 1]
}
NR in nr
' infile
P.S. Passing an array to the length built-in function is not portable, you've been warned :)
回答4:
You can use a combination of awk
, sort
, head/tail
and sed
to do this, such as with:
pax$ seq 1 100 | awk '
...$ BEGIN {srand()}
...$ {print rand() " " $0}
...$ ' | sort | head -5 | sed 's/[^ ]* //'
57
25
80
51
72
which, as you can see, selects five random lines from the one hundred generated in seq 1 100
.
The awk
trick prefixes each and every line in the file with a random number and space of the format "0.237788 "
, then sort (obviously) sorts it based on that random number.
Then you use head
(or tail
if you don't have a head
) to get the first (or last) N
lines.
Finally, the sed
will strip off the random number and space and the start of each line.
For your specific case, you could use something like (on one line):
awk 'BEGIN {srand()} {print rand() " " $0}' file8000.txt
| sort
| tail -3000
| sed 's/[^ ]* //'
>file3000.txt
回答5:
I used these commands, and got what I wanted:
awk 'BEGIN {srand()} {print rand() " " $0}' examples/data_text.txt | sort -n | tail -n 80 | awk '{printf "%1d %s %s\n",$2, $3, $4}' > examples/crossval.txt
which in fact randomly selects 80 lines from the input file.
回答6:
In PowerShell:
Get-Content myfile | Get-Random -Count 3000
or shorter:
gc myfile | random -c 3000
回答7:
In case you only need approximately 3000 lines, this is an easy method:
awk -v N=`cat FILE | wc -l` 'rand()<3000/N' FILE
The part between the backticks (`) gives the number of lines in the file.
回答8:
For a huge file that I didn't want to shuffle, this worked out well and pretty fast:
sed -u -n 'l1p;l2p; ... ;l1000p;l1000q'
The -u option reduces buffering, and l1, l2, ... l1000 are random and sorted line numbers obtained from R (would be just as good with python or perl).