Processing large files (1GB+) in C++ by reading them into memory. The baseline used std::ifstream::read() into a std::vector<char>, which required allocating a buffer equal to the file size, waiting for the entire read to complete before processing, and doubling memory usage if the file content needed to persist alongside processed results.
Using mmap() (POSIX) or CreateFileMapping (Windows) to map the file directly into the process's virtual address space. The OS loads pages on demand as they're accessed, and the kernel's page cache serves as a shared buffer — no explicit allocation or copying needed.
Key advantages:
- Zero-copy access: File data is accessed directly from the kernel page cache. No
read()syscall per chunk, no userspace buffer. - Lazy loading: Only pages actually touched are loaded from disk. For sparse access patterns (e.g., searching a large file), this avoids loading irrelevant sections.
- Memory efficiency: Multiple processes mapping the same file share physical pages. The OS can evict pages under memory pressure without the application managing a cache.
For sequential processing of a 2GB CSV file, mmap was 1.4x faster than fread with 64KB buffers and used 50% less resident memory (RSS) because untouched pages were never loaded.
- mmap for random small reads on HDD: On spinning disks, mmap's page-fault-driven I/O generates random seeks per fault. Explicit
read()withposix_fadvise(POSIX_FADV_RANDOM)performed better by issuing larger batched reads. - mmap on 32-bit systems: The 4GB virtual address limit makes mapping large files impossible. Use
read()with streaming.
#include <sys/mman.h>
#include <fcntl.h>
int fd = open("large_file.csv", O_RDONLY);
struct stat st;
fstat(fd, &st);
// Map entire file; OS handles paging
const char* data = static_cast<const char*>(
mmap(nullptr, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0));
madvise((void*)data, st.st_size, MADV_SEQUENTIAL); // hint for readahead
// Process directly — no buffer allocation
for (size_t i = 0; i < st.st_size; i++) {
if (data[i] == '\n') { /* process line */ }
}
munmap((void*)data, st.st_size);
close(fd);POSIX systems (Linux, macOS). On Linux, madvise hints (MADV_SEQUENTIAL, MADV_WILLNEED, MADV_HUGEPAGE) significantly affect performance. Tested on Linux 6.1, ext4 filesystem, NVMe SSD.