I want to shuffle the lines of a text file randomly and create a new file. The file may have several thousands of lines.
How can I do that with cat
, awk
, cut
, etc?
I want to shuffle the lines of a text file randomly and create a new file. The file may have several thousands of lines.
How can I do that with cat
, awk
, cut
, etc?
This bash function has the minimal dependency(only sort and bash):
I use a tiny perl script, which I call "unsort":
I've also got a NULL-delimited version, called "unsort0" ... handy for use with find -print0 and so on.
PS: Voted up 'shuf' too, I had no idea that was there in coreutils these days ... the above may still be useful if your systems doesn't have 'shuf'.
This is a python script that I saved as rand.py in my home folder:
On Mac OSX
sort -R
andshuf
are not available so you can alias this in your bash_profile as:Not mentioned as of yet:
The
unsort
util. Syntax (somewhat playlist oriented):msort
can shuffle by line, but it's usually overkill:here's an awk script
output
This answer complements the many great existing answers in the following ways:
The existing answers are packaged into flexible shell functions:
stdin
input, but alternatively also filename argumentsSIGPIPE
in the usual way (quiet termination with exit code141
), as opposed to breaking noisily. This is important when piping the function output to a pipe that is closed early, such as when piping tohead
.A performance comparison is made.
awk
,sort
, andcut
, adapted from the OP's own answer:Performance comparison:
Note: These numbers were obtained on a late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3. While timings will vary with OS used, machine specs,
awk
implementation used (e.g., the BSDawk
version used on OSX is usually slower than GNUawk
and especiallymawk
), this should provide a general sense of relative performance.Input file is a 1-million-lines file produced with
seq -f 'line %.0f' 1000000
.Times are listed in ascending order (fastest first):
shuf
0.090s
0.289s
0.589s
1.342s
with Python 2.7.6;2.407s
(!) with Python 3.4.2awk
+sort
+cut
3.003s
with BSDawk
;2.388s
with GNUawk
(4.1.1);1.811s
withmawk
(1.3.4);For further comparison, the solutions not packaged as functions above:
sort -R
(not a true shuffle if there are duplicate input lines)10.661s
- allocating more memory doesn't seem to make a difference24.229s
bash
loops +sort
32.593s
Conclusions:
shuf
, if you can - it's the fastest by far.awk
+sort
+cut
combo as a last resort; whichawk
implementation you use matters (mawk
is faster than GNUawk
, BSDawk
is slowest).sort -R
,bash
loops, and Scala.