Randomly mix lines of 3 million-line file

2019-02-04 10:58发布

Everything is in the title. I'm wondering if any one knows a quick and with reasonable memory demands way of randomly mixing all the lines of a 3 million lines file. I guess it is not possible with a simple vim command, so any simple script using Python. I tried with python by using a random number generator, but did not manage to find a simple way out.

9条回答
在下西门庆
2楼-- · 2019-02-04 11:52

If you do not want to load everything into memory and sort it there, you have to store the lines on disk while doing random sorting. That will be very slow.

Here is a very simple, stupid and slow version. Note that this may take a surprising amount of diskspace, and it will be very slow. I ran it with 300.000 lines, and it takes several minutes. 3 million lines could very well take an hour. So: Do it in memory. Really. It's not that big.

import os
import tempfile
import shutil
import random
tempdir = tempfile.mkdtemp()
print tempdir

files = []
# Split the lines:
with open('/tmp/sorted.txt', 'rt') as infile:
    counter = 0    
    for line in infile:
        outfilename = os.path.join(tempdir, '%09i.txt' % counter)
        with open(outfilename, 'wt') as outfile:
            outfile.write(line)
        counter += 1
        files.append(outfilename)

with open('/tmp/random.txt', 'wt') as outfile:
    while files:
        index = random.randint(0, len(files) - 1)
        filename = files.pop(index)
        outfile.write(open(filename, 'rt').read())

shutil.rmtree(tempdir)

Another version would be to store the files in an SQLite database and pull the lines randomly from that database. That is probably going to be faster than this.

查看更多
Summer. ? 凉城
3楼-- · 2019-02-04 11:59
import random
with open('the_file','r') as source:
    data = [ (random.random(), line) for line in source ]
data.sort()
with open('another_file','w') as target:
    for _, line in data:
        target.write( line )

That should do it. 3 million lines will fit into most machine's memory unless the lines are HUGE (over 512 characters).

查看更多
Emotional °昔
4楼-- · 2019-02-04 12:00

I just tried this on a file with 4.3M of lines and fastest thing was 'shuf' command on Linux. Use it like this:

shuf huge_file.txt -o shuffled_lines_huge_file.txt

It took 2-3 seconds to finish.

查看更多
登录 后发表回答