Skip to content

Latest commit

 

History

History
49 lines (33 loc) · 2.46 KB

File metadata and controls

49 lines (33 loc) · 2.46 KB

Memory-Mapped I/O for Large File Processing

Problem

Processing large files (1GB+) in C++ by reading them into memory. The baseline used std::ifstream::read() into a std::vector<char>, which required allocating a buffer equal to the file size, waiting for the entire read to complete before processing, and doubling memory usage if the file content needed to persist alongside processed results.

What Worked

Using mmap() (POSIX) or CreateFileMapping (Windows) to map the file directly into the process's virtual address space. The OS loads pages on demand as they're accessed, and the kernel's page cache serves as a shared buffer — no explicit allocation or copying needed.

Key advantages:

  1. Zero-copy access: File data is accessed directly from the kernel page cache. No read() syscall per chunk, no userspace buffer.
  2. Lazy loading: Only pages actually touched are loaded from disk. For sparse access patterns (e.g., searching a large file), this avoids loading irrelevant sections.
  3. Memory efficiency: Multiple processes mapping the same file share physical pages. The OS can evict pages under memory pressure without the application managing a cache.

For sequential processing of a 2GB CSV file, mmap was 1.4x faster than fread with 64KB buffers and used 50% less resident memory (RSS) because untouched pages were never loaded.

What Didn't Work

  • mmap for random small reads on HDD: On spinning disks, mmap's page-fault-driven I/O generates random seeks per fault. Explicit read() with posix_fadvise(POSIX_FADV_RANDOM) performed better by issuing larger batched reads.
  • mmap on 32-bit systems: The 4GB virtual address limit makes mapping large files impossible. Use read() with streaming.

Code Example

#include <sys/mman.h>
#include <fcntl.h>

int fd = open("large_file.csv", O_RDONLY);
struct stat st;
fstat(fd, &st);

// Map entire file; OS handles paging
const char* data = static_cast<const char*>(
    mmap(nullptr, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0));
madvise((void*)data, st.st_size, MADV_SEQUENTIAL); // hint for readahead

// Process directly — no buffer allocation
for (size_t i = 0; i < st.st_size; i++) {
    if (data[i] == '\n') { /* process line */ }
}

munmap((void*)data, st.st_size);
close(fd);

Environment

POSIX systems (Linux, macOS). On Linux, madvise hints (MADV_SEQUENTIAL, MADV_WILLNEED, MADV_HUGEPAGE) significantly affect performance. Tested on Linux 6.1, ext4 filesystem, NVMe SSD.