I need to parse a file that could be many gbs in size. I would like to do this in C. Can anyone suggest any methods to accomplish this?
The file that I need to open and parse is a hard drive dump that I get from my mac's hard drive. However, I plan on running my program inside of 64-bit Ubuntu 10.04. Also given the large file size, the more optimized the method the better.
On both *nix and Windows, there are extensions to the I/O routines that touch file size that will support sizes larger than 2GB or 4GB. Naturally, the underlying file system must also support a file that large. On Windows, NTFS does, but FAT doesn't for instance. This is generally known as "large file support".
The two routines that are most critical for these purposes are fseek()
and ftell()
so that you can do random access to the whole file. Otherwise, the ordinary fopen()
and fread()
and friends can do sequential access to any size of file as long as the underlying OS and stdio implementation support large files.
Assuming you're on a linux/bsd/mac/notwindows 64-bit system (and seriously, who isn't these days?), mmap performs extremely well. It essentially lets you map a whole file into a process' address space and let the kernel perform caching/paging for you.
And if you MUST use windows, here's the same concept, but made by the friendly folks at Redmond. Note that for either of these, you will want to be running on a 64-bit system as the ABSOLUTE largest file you can map on a 32-bit system is ~4GB.
Define the macro -D_FILE_OFFSET_BITS=64
or #define _FILE_OFFSET_BITS 64
for all relevant sources (preferably the entire project). This common macro is provided automatically by several common build systems. Then use off_t
(which will be 64 bit now) wherever the API requires it.
In addition to RBerteig's and Matt's answer:
If you enable the 64 bit IO support correctly and carefully for all
your files in your project (for which the methods are systemn
dependent) you don't have to be worried about integer overflow if you
use the correct types, I think. off_t
should then be the correct
choice to position your file pointer.
If all else fails go with the correct C99 types if you make
assumptions about the width of the type. Using int
or long
is
almost always the wrong thing to do, they are too much
compiler/platform dependent. Use int64_t
(or int_fast64_t
if you
don't have that).
Depending on the Chomsky level there may be several free and commercial toolkits to create parsers for file format. I think the real problem you think you have is how to 'handle' several GB's of data.
Do you want all of the data in memory simultaneously ?
One way is to write out parts of file on to disk in temporary files, when not in use. Simple fread / fwrite of struct, and some clever ref-counted 'on demand' loading and writing can do this,