Is it possible to read a file without loading it i

2019-08-04 17:22发布

I want to read a file but it is too big to load it completely into memory.

Is there a way to read it without loading it into memory? Or there is a better solution?

标签: c file memory
3条回答
姐就是有狂的资本
2楼-- · 2019-08-04 17:57

One way to do this, if the problem is RAM, not virtual address space, is memory mapping the file, either via mmap on POSIX systems, or CreateFileMapping/MapViewOfFile on Windows.

That can get you what looks like a raw array of the file bytes, but with the OS responsible for paging the contents in (and writing them back to disk if you alter them) as you go. When mapped read-only, it's quite similar to just malloc-ing a block of memory and fread-ing to populate it, but:

  1. It's lazy: For a 1 GB file, you're not waiting the 5-30 seconds for the whole thing to be read in before you can work with any part of it, instead, you just pay for each page on access (and sometimes, the OS will pre-read in the background, so you don't even have to wait on the per-page load in)
  2. It responds better under memory pressure; if you run out of memory, the OS can just drop clean pages from memory without writing them to swap, knowing it can page them back in from the golden copy in the file whenever they're needed; with malloc-ed memory, it has to write it out to swap, increasing disk traffic at a time when you're likely oversubscribed on the disk already

Performance-wise, this can be slightly slower under default settings (since, without memory pressure, reading the whole file in mostly guarantees it will be in memory when asked for, while random access to a memory mapped file is likely to trigger on-demand page faults to populate each page on first access), though you can use posix_madvise with POSIX_MADV_WILLNEED (POSIX systems) or PrefetchVirtualMemory (Windows 8 and higher) to provide a hint that the entire file will be needed, causing the system to (usually) page it in in the background, even as you're accessing it. On POSIX systems, other advise hints can be used for more granular hinting when paging the whole file in at once isn't necessary (or possible), e.g. using POSIX_MADV_SEQUENTIAL if you're reading the file data in order from beginning to end usually triggers more aggressive prefetch of subsequent pages, increasing the odds that they're in memory by the time you get to them. By doing so, you get the best of both worlds; you can begin accessing the data almost immediately, with a delay on accessing pages not paged in yet, but the OS will be pre-loading the pages for you in the background, so you eventually run as full speed (while still being more resilient to memory pressure, since the OS can just drop clean pages, rather than writing them to swap first).

The main limitation here is virtual address space. If you're on a 32 bit system, you're likely limited to (depending on how fragmented the existing address space is) 1-3 GB of contiguous address space, which means you'd have to map the file in chunks, and can't have on-demand random access to any point in the file at any time without additional system calls. Thankfully, on 64 bit systems, this limitation rarely comes up; even the most limiting 64 bit systems (Windows 7) provide 8 TB of user virtual address space per process, far larger than the vast, vast majority of files you're likely to encounter (and later versions increase the cap to 128 TB).

查看更多
一夜七次
3楼-- · 2019-08-04 18:03

I need the content to do a checksum, so I need the complete message

Many checksum libraries support incremental updates to the checksum. For example, the GLib has g_checksum_update(). So you can read the file a block at a time with fread and update the checksum as you read.

#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <stdlib.h>
#include <glib.h>

int main(void) {
    char filename[] = "test.txt";

    // Create a SHA256 checksum
    GChecksum *sum = g_checksum_new(G_CHECKSUM_SHA256);
    if( sum == NULL ) {
        fprintf(stderr, "Could not create checksum.\n");
        exit(1);
    }

    // Open the file we'll be checksuming.
    FILE *fp = fopen( filename, "rb" );
    if( fp == NULL ) {
        fprintf(stderr, "Could not open %s: %s.\n", filename, strerror(errno));
        exit(1);
    }

    // Read one buffer full at a time (BUFSIZ is from stdio.h)
    // and update the checksum.    
    unsigned char buf[BUFSIZ];
    size_t size_read = 0;
    while( (size_read = fread(buf, 1, sizeof(buf), fp)) != 0 ) {
        // Update the checksum
        g_checksum_update(sum, buf, (gssize)size_read);
    }

    // Print the checksum.
    printf("%s %s\n", g_checksum_get_string(sum), filename);
}

And we can check it works by comparing the result with sha256sum.

$ ./test
0c46af5bce717d706cc44e8c60dde57dbc13ad8106a8e056122a39175e2caef8 test.txt
$ sha256sum test.txt 
0c46af5bce717d706cc44e8c60dde57dbc13ad8106a8e056122a39175e2caef8  test.txt
查看更多
你好瞎i
4楼-- · 2019-08-04 18:17

I want to read a file but it is too big to load it completely into memory.

Be aware that -in practice- files are an abstraction (so somehow an illusion) provided by your operating system thru file systems. Read Operating Systems: Three Easy Pieces (freely downloadable) to learn more about OSes. Files can be quite big (even if most of them are small), e.g. many dozens of gigabytes on current laptops or desktops (and many terabytes on servers, and perhaps more).

You don't define what is memory, and the C11 standard n1570 uses that word in a different way, speaking of memory locations in §3.14, and of memory management functions in §7.22.3...

In practice, a process has its virtual address space, related to virtual memory.

On many operating systems -notably Linux and POSIX- you can change the virtual address space with mmap(2) and related system calls, and you could use memory-mapped files.

Is there a way to read it without loading it into memory?

Of course, you can read and write partial chunks of some file (e.g. using fread, fwrite, fseek, or the lower-level system calls read(2), write(2), lseek(2), ...). For performance reasons, better use large buffers (several kilobytes at least). In practice, most checksums (or cryptographic hash functions) can be computed chunkwise, on a very long stream of data.

Many libraries are built above such primitives (doing direct IO by chunks). For example the sqlite database library is able to handle database files of many terabytes (more than the available RAM). And you could use RDBMS (they are software coded in C or C++)

So of course you can deal with files larger than available RAM and read or write them by chunks (or "records"), and this has been true since at least the 1960s. I would even say that intuitively, files can (usually) be much larger than RAM, but smaller than a single disk (however, even this is not always true; some file systems are able to span several physical disks, e.g. using LVM techniques).

(on my Linux desktop with 32Gbytes of RAM, the largest file has 69Gbytes, on an ext4 filesystem with 669G available and 780G total space, and I did had in the past files above 100 Gbytes)

You might find worthwhile to use some database like sqlite (or be a client of some RDBMS like PostGreSQL, etc...), or you could be interested in libraries for indexed files like gdbm. Of course you can also do direct I/O operations (e.g. fseek then fread or fwrite, or lseek then read or write, or pread(2) or pwrite ...).

查看更多
登录 后发表回答