how to read to huge file into buffer

2019-06-01 19:21发布

问题:

i have some code to read a file:

FILE* file = fopen(fileName.c_str(), "r");
assert(file != NULL);
size_t BUF_SIZE = 10 * 1024 * 1024;
char* buf = new char[BUF_SIZE];
string contents;
while (!feof(file))
{
    int ret = fread(buf, BUF_SIZE, 1, file);
    assert(ret != -1);
    contents.append(buf);
}

I know the size of the file in advance, so I assign a buffer to store the content from the file in this line:

char* buf = new char[BUF_SIZE];

If the file I need read is very large, for example up to several GB, it's impossible to allocate several GBs of memory to storing the contents of the file. Or else I do not know how much size of the file need to read. What should I do?

回答1:

First, something you should know is that there are already layers of buffering in the C runtime and often the OS under that. If you're adding yet another buffering layer for no reason, you're probably just slowing things down.

(You might find my article "where the printf rubber meets the road" illustrative just in the sense of a case of seeing what some of glibc looks like under the hood, for instance.)

Second--don't synchronously read gigantic files into contiguous blocks of memory like that. Sometimes it's fine if you're doing quick-and-dirty code that you're going to run once and throw out. But it's not a technique for using in real programs that you might subject others to, who have arbitrary size input.

If there's truly nothing you can know about how to read the input without having access to all of it, and you want to treat a file of arbitrary size as if it were loaded like this, then you can learn about Memory Mapped Files. This gets you some help from the operating system.

But if every time you run your program you have to read a large file whose contents you have unpredictable needs for, that does sound like it might be a job for a database. Instead of talking to it with freads/fwrites, get the information loaded into it and talk to it in terms of queries and updates--where other people have addressed much of the complexity of that.



回答2:

The easiest way to handle huge file input, is to map the file into memory. This will look as if you had loaded the entire file into a single huge buffer, but it does not require the OS to actually hold all this data in memory at once - the data can be read in a lazy fashion, and the OS is free to just reuse memory pages from the mapping without even swapping data back to disk.

Under Linux, the call is mmap(), there is something similar under windows, but I don't know how it's called. The mmap() function is used like this:

int file = open(path, O_RDONLY);    //Open the file.
off_t fileLength = lseek(file, 0, SEEK_END);    //Get its size.

//Map its contents into memory.
const char* contents = mmap(NULL, fileLength, PROT_READ, MAP_SHARED, file, 0);

close(file);    //The file can be closed right away, the mapping is not affected.

Inspect the file in any way you want. Like counting lines:

off_t lineCount = 0;
for(off_t i = 0; i < fileLength; i++) if(contents[i] == '\n') lineCount++;

Finally, you should clean up the mapping using

munmap(input, length);

I have left out error handling to avoid obfuscating the code, but of course, you will need to handle any errors produced by any of these calls.


Of course, mmapping a file is most advantageous on a 64 bit OS: the size of the mapping is limited by the size of the virtual address space. Consequently, you won't be able to mmap() a 5 GB file on a 32 bit OS in one piece, which is no problem on a 64 bit OS.



标签: c++ file bigdata