I have a huge text file (~1GB) and sadly the text editor I use won't read such a large file. However, if I can just split it into two or three parts I'll be fine, so, as an exercise I wanted to write a program in python to do it.
What I think I want the program to do is to find the size of a file, divide that number into parts, and for each part, read up to that point in chunks, writing to a filename.nnn output file, then read up-to the next line-break and write that, then close the output file, etc. Obviously the last output file just copies to the end of the input file.
Can you help me with the key filesystem related parts: filesize, reading and writing in chunks and reading to a line-break?
I'll be writing this code test-first, so there's no need to give me a complete answer, unless its a one-liner ;-)
While Ryan Ginstrom's answer is correct, it does take longer than it should (as he has already noted). Here's a way to circumvent the multiple calls to
itertools.islice
by successively iterating over the open file descriptor:I've written the program and it seems to work fine. So thanks to Kamil Kisiel for getting me started.
(Note that FileSizeParts() is a function not shown here)
Later I may get round to doing a version that does a binary read to see if its any quicker.
This generator method is a (slow) way to get a slice of lines without blowing up your memory.
linux has a split command
split -l 100000 file.txt
would split into files of equal 100,000 line size
don't forget seek() and mmap() for random access to files.
usage - split.py filename splitsizeinkb