I want to read a file but it is too big to load it completely into memory.
Is there a way to read it without loading it into memory? Or there is a better solution?
I want to read a file but it is too big to load it completely into memory.
Is there a way to read it without loading it into memory? Or there is a better solution?
One way to do this, if the problem is RAM, not virtual address space, is memory mapping the file, either via
mmap
on POSIX systems, orCreateFileMapping
/MapViewOfFile
on Windows.That can get you what looks like a raw array of the file bytes, but with the OS responsible for paging the contents in (and writing them back to disk if you alter them) as you go. When mapped read-only, it's quite similar to just
malloc
-ing a block of memory andfread
-ing to populate it, but:malloc
-ed memory, it has to write it out to swap, increasing disk traffic at a time when you're likely oversubscribed on the disk alreadyPerformance-wise, this can be slightly slower under default settings (since, without memory pressure, reading the whole file in mostly guarantees it will be in memory when asked for, while random access to a memory mapped file is likely to trigger on-demand page faults to populate each page on first access), though you can use
posix_madvise
withPOSIX_MADV_WILLNEED
(POSIX systems) orPrefetchVirtualMemory
(Windows 8 and higher) to provide a hint that the entire file will be needed, causing the system to (usually) page it in in the background, even as you're accessing it. On POSIX systems, otheradvise
hints can be used for more granular hinting when paging the whole file in at once isn't necessary (or possible), e.g. usingPOSIX_MADV_SEQUENTIAL
if you're reading the file data in order from beginning to end usually triggers more aggressive prefetch of subsequent pages, increasing the odds that they're in memory by the time you get to them. By doing so, you get the best of both worlds; you can begin accessing the data almost immediately, with a delay on accessing pages not paged in yet, but the OS will be pre-loading the pages for you in the background, so you eventually run as full speed (while still being more resilient to memory pressure, since the OS can just drop clean pages, rather than writing them to swap first).The main limitation here is virtual address space. If you're on a 32 bit system, you're likely limited to (depending on how fragmented the existing address space is) 1-3 GB of contiguous address space, which means you'd have to map the file in chunks, and can't have on-demand random access to any point in the file at any time without additional system calls. Thankfully, on 64 bit systems, this limitation rarely comes up; even the most limiting 64 bit systems (Windows 7) provide 8 TB of user virtual address space per process, far larger than the vast, vast majority of files you're likely to encounter (and later versions increase the cap to 128 TB).
Many checksum libraries support incremental updates to the checksum. For example, the GLib has
g_checksum_update()
. So you can read the file a block at a time withfread
and update the checksum as you read.And we can check it works by comparing the result with
sha256sum
.Be aware that -in practice- files are an abstraction (so somehow an illusion) provided by your operating system thru file systems. Read Operating Systems: Three Easy Pieces (freely downloadable) to learn more about OSes. Files can be quite big (even if most of them are small), e.g. many dozens of gigabytes on current laptops or desktops (and many terabytes on servers, and perhaps more).
You don't define what is memory, and the C11 standard n1570 uses that word in a different way, speaking of memory locations in §3.14, and of memory management functions in §7.22.3...
In practice, a process has its virtual address space, related to virtual memory.
On many operating systems -notably Linux and POSIX- you can change the virtual address space with mmap(2) and related system calls, and you could use memory-mapped files.
Of course, you can read and write partial chunks of some file (e.g. using fread, fwrite, fseek, or the lower-level system calls read(2), write(2), lseek(2), ...). For performance reasons, better use large buffers (several kilobytes at least). In practice, most checksums (or cryptographic hash functions) can be computed chunkwise, on a very long stream of data.
Many libraries are built above such primitives (doing direct IO by chunks). For example the sqlite database library is able to handle database files of many terabytes (more than the available RAM). And you could use RDBMS (they are software coded in C or C++)
So of course you can deal with files larger than available RAM and read or write them by chunks (or "records"), and this has been true since at least the 1960s. I would even say that intuitively, files can (usually) be much larger than RAM, but smaller than a single disk (however, even this is not always true; some file systems are able to span several physical disks, e.g. using LVM techniques).
(on my Linux desktop with 32Gbytes of RAM, the largest file has 69Gbytes, on an ext4 filesystem with 669G available and 780G total space, and I did had in the past files above 100 Gbytes)
You might find worthwhile to use some database like sqlite (or be a client of some RDBMS like PostGreSQL, etc...), or you could be interested in libraries for indexed files like gdbm. Of course you can also do direct I/O operations (e.g.
fseek
thenfread
orfwrite
, orlseek
thenread
orwrite
, or pread(2) orpwrite
...).