update design doc regarding block manager

guycipher · guycipher · commit 71833f1a0f74 · 2026-04-20T22:04:09.000-04:00
diff --git a/src/content/docs/getting-started/how-does-tidesdb-work.md b/src/content/docs/getting-started/how-does-tidesdb-work.md
@@ -680,6 +680,8 @@ The block manager provides a lock-free, append-only file abstraction with atomic
 
 Writers use `pread`/`pwrite` for position-independent I/O, allowing concurrent reads and writes without locks. Block writes use `pwritev` to combine the header, data, and footer into a single scatter-gather syscall (3 syscalls down to 1), improving sequential and parallel write throughput by 2 to 2.5x. These POSIX functions are abstracted through `compat.h` for cross-platform support (Windows uses `ReadFile`/`WriteFile` with `OVERLAPPED` structures). The file size is tracked atomically in memory to avoid syscalls. Blocks use atomic reference counting: callers must call `block_manager_block_release()` when done, and blocks free when refcount reaches zero. Durability operations use `fdatasync` (also abstracted via `compat.h`).
 
+Concurrent writers face a second contention point inside the kernel: filesystems take a per-inode write lock on any operation that advances the file's logical EOF (`i_rwsem` on Linux ext4, the vnode write lock on macOS APFS, the file-extension lock on NTFS). With many threads appending to the same block manager, this lock serializes them regardless of how cleanly the userland code hands out disjoint offsets. To keep the kernel out of the way, the block manager preallocates the file in 64 MB chunks (`BLOCK_MANAGER_PREALLOC_CHUNK`) ahead of writes via a cross-platform `tdb_preallocate_extent()` helper in `compat.h` (`fallocate` on Linux, `F_PREALLOCATE` plus `ftruncate` on macOS, `FILE_ALLOCATION_INFO` plus `FILE_END_OF_FILE_INFO` on Windows, `posix_fallocate` elsewhere). Subsequent `pwrite` calls land within the already-extended EOF and take only the cheaper read path on the inode lock. The extension itself is lock-free: a CAS on `preallocated_size` claims the right to extend, and racing claimants at worst issue a redundant idempotent `fallocate`. On clean close, `block_manager_close` truncates the file back to the actual data extent, so the trailing zeros only exist while the file is open. After a crash, `block_manager_validate_last_block` distinguishes a preallocation tail (all-zero suffix, legitimate) from real corruption (non-zero garbage past the data) by combining a forward scan with a trailing-zero check: strict mode accepts the former and rejects the latter, permissive mode truncates either. This yields 1.6 to 2x higher multi-writer throughput on small-block workloads (the WAL case) at the cost of bounded space amplification (one preallocation chunk per open file at runtime, zero at rest).
+
 Block reads use two `pread` syscalls: one for the 8-byte header (size plus checksum) and one for the data payload directly into the final allocation, avoiding intermediate buffer copies. The fused `block_manager_cursor_read_and_advance()` operation combines read and cursor advance into a single call, using the block size from the just-read block to compute the next position without a redundant `pread`. Cursors also cache block sizes from previous operations, allowing `cursor_read_partial()` to skip the size lookup when the cache is valid. These optimizations reduce syscall overhead on the hot read path.
 
 Block manager cursors enable sequential and random access. Cursors maintain current position and can move forward, backward, or jump to specific offsets. The `cursor_read_partial()` operation reads only the first N bytes of a block, useful for reading headers without loading large values.
@@ -688,7 +690,7 @@ The system supports strict and permissive validation modes. WAL files use permis
 
 The block format provides layered protection against silent data corruption, whether from media degradation, controller firmware bugs, or bit flips in the storage path. Each block stores an xxHash32 checksum computed over the data payload at write time. On every read, the checksum is recomputed and compared against the stored value; any mismatch causes the read to fail immediately rather than return corrupt data. The block size field is stored twice, once in the header and once in the footer, so a single-bit corruption in either copy can be detected by cross-validation during backward cursor traversal. The footer magic number (0x42445442, "BTDB") acts as a high-entropy sentinel: random corruption is unlikely to produce it, so its absence reliably identifies torn writes and partial flushes. During recovery, permissive validation uses this structure to walk forward through WAL blocks, accepting blocks whose footer magic and header/footer size agree, and truncating at the first inconsistency. Forward cursor operations that encounter a checksum failure can call `block_manager_cursor_skip_corrupt()`, which distinguishes partial writes (footer magic absent, block extent known from the size field) from genuine corruption (footer magic present but data checksum fails), advancing past the former and rejecting the latter. The combination of per-block checksums, redundant size fields, and magic sentinels means that any single-point corruption, whether it hits the data, the metadata, or the framing, is detected before it can propagate to the application layer. SSTables inherit this protection directly since their klog and vlog files are block manager files. WAL files add an additional layer: entries are deserialized with bounds checking on every varint and field offset, so a corrupt WAL entry that passes the block checksum (for example, valid bytes rearranged by a controller bug) still fails deserialization rather than silently loading garbage into the memtable.
 
-TidesDB uses block managers for all persistent storage: WAL files, klog files, and vlog files. The atomic offset allocation enables concurrent flush and compaction workers to write to different files simultaneously. The reference counting prevents use-after-free when multiple readers access the same SSTable.
+TidesDB uses block managers for all persistent storage: WAL files, klog files, and vlog files. The atomic offset allocation combined with file preallocation enables concurrent flush and compaction workers to write to different files simultaneously, and lets multiple writers share a single file (notably the WAL) without serializing on the kernel's per-inode write lock. The reference counting prevents use-after-free when multiple readers access the same SSTable.
 
 ### Bloom Filter