I am working with a text file of about 12*10^6 rows which is stored on my hard disk. The structure of the file is:
data|data|data|...|data\n
data|data|data|...|data\n
data|data|data|...|data\n
...
data|data|data|...|data\n
There's no header, and there's no id to uniquely identify the rows.
Since I want to use it for machine learning purposes, I need to make sure that there's no order in the text file which may affect the stochastic learning.
Usually I upload such kind of files into memory, and I shuffle them before rewriting them to disk. Unfortunately this time it is not possible, due to the size of the file, so I have to manage the shuffling directly on disk(assume I don't have problem with the disk space). Any idea about how to effectively (with lowest possible complexity, i.e. writing to the disk) manage such task with Python?
All but one of these ideas use O(N) memory—but if you use an
array.array
ornumpy.ndarray
we're talking around N*4 bytes, which is significantly smaller than the whole file. (I'll use a plain list for simplicity; if you need help converting to a more compact type, I can show that too.)Using a temporary database and an index list:
This is 2N single-line disk operations, and 2N single-dbm-key disk operations, which should be 2NlogN single-disk-disk-operation-equivalent operations, so the total complexity is O(NlogN).
If you use a relational database like
sqlite3
instead of a dbm, you don't even need the index list, because you can just do this:This has the same time complexity as the above, and the space complexity is O(1) instead of O(N)—in theory. In practice, you need an RDBMS that can feed you a row at a time from a 100M row set without storing that 100M on either side.
A different option, without using a temporary database—in theory O(N**2), but in practice maybe faster if you happen to have enough memory for the line cache to be helpful:
Finally, by doubling the size of the index list, we can eliminate the temporary disk storage. But in practice, this might be a lot slower, because you're doing a lot more random-access reads, which drives aren't nearly as good at.
This is a suggestion based on my comment above. It relies on having the compressed lines still being able to fit into memory. If that is not the case, the other solutions will be required.
My experience has been that zlib is fast, while bz2 offers better compression, so you may want to compare.
Also, if you can get away with chunking, say, n lines together, doing so is likely to lift your compression ratio.
I was wondering about the likelihood of useful compression, so here's an IPython experiment. I don't know what your data looks like, so I just went with floats (as strings) rounded to 3 places and strung together with pipes:
Best-case scenario (e.g. many rows have all same digits):
As expected, best-case is really good, >95% compression ratio.
Worst-case scenario (randomized data):
Surprisingly, worst-case is not too bad ~60% compression ratio, but likely to be problematic if you only have 8 GB of memory (60% of 15 GB is 9 GB).
This problem can be thought as a problem of efficient memory pages management to reduce swap file I/O. Let your buffer
buf
be a list of contigous chunks of file you would like to be stored into the output file. Let a contigous chunk of file be a list of a fixed amount of whole lines.Now, generate a random sequence and remap the returned values to appriopriate chunk numbers and line offsets inside that chunk.
This operation leaves you with a sequence of numbers
[1..num of chunks]
which can be described as a sequence of accesses to memory fragments contained in pages of numbers between[1..num of chunks]
. For online variantion (like in real OS), there is no optimal strategy to this problem, but since you know the actual sequence of pages' referencing there is an optimal solution which can be found here.What's the gain from this approach? Pages which are going to be used most often are least reread from the HDD meaning less I/O operations to read the data. Also, considering your chunk size is big enough to minimize page swapping compared to memory footprint, a lot of the times following lines of the output file would be taken from the same chunk stored in memory (or any other, but not yet swapped to the drive) rather than reread from the drive.
Maybe it's not the easiest solution (though the optimal page swapping algorithm is easy to write), it could be a fun exercise to do, wouldn't it?
Assuming that disk space is not not a problem with you, I am creating multiple files to hold the data.
Your Memory complexity will be controlled by PMSize. Time complexity will be around O(N + PMSize).