How to shuffle a text file on disk in Python

2019-07-27 11:02发布

问题:

I am working with a text file of about 12*10^6 rows which is stored on my hard disk. The structure of the file is:

data|data|data|...|data\n
data|data|data|...|data\n
data|data|data|...|data\n
...
data|data|data|...|data\n

There's no header, and there's no id to uniquely identify the rows.

Since I want to use it for machine learning purposes, I need to make sure that there's no order in the text file which may affect the stochastic learning.

Usually I upload such kind of files into memory, and I shuffle them before rewriting them to disk. Unfortunately this time it is not possible, due to the size of the file, so I have to manage the shuffling directly on disk(assume I don't have problem with the disk space). Any idea about how to effectively (with lowest possible complexity, i.e. writing to the disk) manage such task with Python?

回答1:

All but one of these ideas use O(N) memory—but if you use an array.array or numpy.ndarray we're talking around N*4 bytes, which is significantly smaller than the whole file. (I'll use a plain list for simplicity; if you need help converting to a more compact type, I can show that too.)


Using a temporary database and an index list:

with contextlib.closing(dbm.open('temp.db', 'n')) as db:
    with open(path) as f:
        for i, line in enumerate(f):
            db[str(i)] = line
    linecount = i
    shuffled = random.shuffle(range(linecount))
    with open(path + '.shuffled', 'w') as f:
        for i in shuffled:
            f.write(db[str(i)])
os.remove('temp.db')

This is 2N single-line disk operations, and 2N single-dbm-key disk operations, which should be 2NlogN single-disk-disk-operation-equivalent operations, so the total complexity is O(NlogN).


If you use a relational database like sqlite3 instead of a dbm, you don't even need the index list, because you can just do this:

SELECT * FROM Lines ORDER BY RANDOM()

This has the same time complexity as the above, and the space complexity is O(1) instead of O(N)—in theory. In practice, you need an RDBMS that can feed you a row at a time from a 100M row set without storing that 100M on either side.


A different option, without using a temporary database—in theory O(N**2), but in practice maybe faster if you happen to have enough memory for the line cache to be helpful:

with open(path) as f:
    linecount = sum(1 for _ in f)
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
    for i in shuffled:
        f.write(linecache.getline(path, i))

Finally, by doubling the size of the index list, we can eliminate the temporary disk storage. But in practice, this might be a lot slower, because you're doing a lot more random-access reads, which drives aren't nearly as good at.

with open(path) as f:
    linestarts = [f.tell() for line in f]
    lineranges = zip(linestarts, linestarts[1:] + [f.tell()])
    shuffled = random.shuffle(lineranges)
    with open(path + '.shuffled', 'w') as f1:
        for start, stop in shuffled:
            f.seek(start)
            f1.write(f.read(stop-start))


回答2:

This is a suggestion based on my comment above. It relies on having the compressed lines still being able to fit into memory. If that is not the case, the other solutions will be required.

import zlib
from random import shuffle

def heavy_shuffle(filename_in, filename_out):
    with open(filename_in, 'r') as f:
        zlines = [zlib.compress(line, 9) for line in f]
    shuffle(zlines)
    with open(filename_out, 'w') as f:
        for zline in zlines:
            f.write(zlib.decompress(zline) + '\n')

My experience has been that zlib is fast, while bz2 offers better compression, so you may want to compare.

Also, if you can get away with chunking, say, n lines together, doing so is likely to lift your compression ratio.


I was wondering about the likelihood of useful compression, so here's an IPython experiment. I don't know what your data looks like, so I just went with floats (as strings) rounded to 3 places and strung together with pipes:

Best-case scenario (e.g. many rows have all same digits):

In [38]: data = '0.000|'*200

In [39]: len(data)
Out[39]: 1200

In [40]: zdata = zlib.compress(data, 9)

In [41]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data)
zlib compression ratio:  0.98

In [42]: bz2data = bz2.compress(data, 9)

In [43]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data)
bz2 compression ratio:  0.959166666667

As expected, best-case is really good, >95% compression ratio.

Worst-case scenario (randomized data):

In [44]: randdata = '|'.join(['{:.3f}'.format(x) for x in np.random.randn(200)])

In [45]: zdata = zlib.compress(randdata, 9)

In [46]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data)
zlib compression ratio:  0.5525

In [47]: bz2data = bz2.compress(randdata, 9)

In [48]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data)
bz2 compression ratio:  0.5975

Surprisingly, worst-case is not too bad ~60% compression ratio, but likely to be problematic if you only have 8 GB of memory (60% of 15 GB is 9 GB).



回答3:

Assuming that disk space is not not a problem with you, I am creating multiple files to hold the data.

import random
import os

PMSize = 100 #Lesser value means using more primary memory
shuffler = lambda x: open(x, 'w')
shufflers = [shuffler('file'+str(x)) for x in range(PMSize)]

with open('filename') as file:
    for line in file:
        i = random.randint(0, len(shufflers)-1)
        shufflers[i].write(line)

with open('filename', 'w') as file:
    for file in shufflers:
        newfile.write(file.read())

for file in shufflers:
    os.remove(file)

Your Memory complexity will be controlled by PMSize. Time complexity will be around O(N + PMSize).



回答4:

This problem can be thought as a problem of efficient memory pages management to reduce swap file I/O. Let your buffer buf be a list of contigous chunks of file you would like to be stored into the output file. Let a contigous chunk of file be a list of a fixed amount of whole lines.

Now, generate a random sequence and remap the returned values to appriopriate chunk numbers and line offsets inside that chunk.

This operation leaves you with a sequence of numbers [1..num of chunks] which can be described as a sequence of accesses to memory fragments contained in pages of numbers between [1..num of chunks]. For online variantion (like in real OS), there is no optimal strategy to this problem, but since you know the actual sequence of pages' referencing there is an optimal solution which can be found here.

What's the gain from this approach? Pages which are going to be used most often are least reread from the HDD meaning less I/O operations to read the data. Also, considering your chunk size is big enough to minimize page swapping compared to memory footprint, a lot of the times following lines of the output file would be taken from the same chunk stored in memory (or any other, but not yet swapped to the drive) rather than reread from the drive.

Maybe it's not the easiest solution (though the optimal page swapping algorithm is easy to write), it could be a fun exercise to do, wouldn't it?