I need to update the last line from a few more than 2GB files made up of lines of text that can not be read with readlines()
. Currently, it work fine by looping through line by line. However, I am wondering if there is any compiled library can achieve this more efficiently? Thanks!
Current approach
myfile = open("large.XML")
for line in myfile:
do_something()
If this is really something line based (where a true XML parser isn't necessary the best solution),
mmap
can help here.mmap
the file, then call.rfind('\n')
on the resulting object (possibly with adjustments to handle the file ending with a newline when you really want the non-empty line before it, not the empty "line" following it). You can then slice out the final line alone. If you need to modify the file in place, you can resize the file to shave off (or add) a number of bytes corresponding to the difference between the line you sliced and the new line, then write back the new line. Avoids reading or writing any more of the file than you need.Example code (please comment if I made a mistake):
Apparently on some systems (e.g. OSX) without
mremap
,mm.resize
won't work, so to support those systems, you'd probably split thewith
(so themmap
closes before the file object), and use file object based seeks, writes and truncates to fix up the file. The following example includes my previously mentioned Python 3.1 and earlier specific adjustment to usecontextlib.closing
for completeness:The advantages to
mmap
over any other approach are:rfind
means you can let Python do the work of finding the newline quickly at the C layer (in CPython); explicitseek
s andread
s of a file object could match the "only read a page or so", but you'd have to hand-implement the search for the newlineCaveat: This approach will not work (at least, not without modification to avoid mapping more than 2 GB, and to handle resizing when the whole file might not be mapped) if you're on a 32 bit system and the file is too large to map into memory. On most 32 bit systems, even in a newly spawned process, you only have 1-2 GB of contiguous address space available; in certain special cases, you might have as much as 3-3.5 GB of user virtual addresses (though you'll lose some of the contiguous space to the heap, stack, executable mapping, etc.).
mmap
doesn't require much physical RAM, but it needs contiguous address space; one of the huge benefits of a 64 bit OS is that you stop worrying about virtual address space in all but the most ridiculous cases, sommap
can solve problems in the general case that it couldn't handle without added complexity on a 32 bit OS. Most modern computers are 64 bit at this point, but it's definitely something to keep in mind if you're targeting 32 bit systems (and on Windows, even if the OS is 64 bit, they may have installed a 32 bit version of Python by mistake, so the same problems apply). Here's yet one more example that works (assuming the last line isn't 100+ MB long) on 32 bit Python (omittingclosing
and imports for brevity) even for huge files:Update: Use ShadowRanger's answer. It's much shorter and robust.
For posterity:
Read the last N bytes of the file and search backwards for the newline.