|
| 1 | +# FlowDB Page Manager Design |
| 2 | + |
| 3 | +## Core Responsibilities |
| 4 | +1. Track pin counts for allocated pages to prevent premature eviction from buffer pool |
| 5 | +2. Initialize pin count to 0 when a new page is allocated by the disk manager |
| 6 | +3. Provide thread-safe pin/unpin operations with overflow/underflow protection |
| 7 | +4. Do NOT manage free lists or disk allocation (handled by disk manager) |
| 8 | +5. Do NOT track dirty state (handled by buffer pool) |
| 9 | + |
| 10 | +## Data Structures |
| 11 | +### Page Descriptor Table |
| 12 | +- `pin_counts: Vec<AtomicU32>` - Vector of atomic pin counters indexed by page number |
| 13 | +- `length: AtomicUsize` - Current valid length of pin_counts vector |
| 14 | +- `resize_mutex: Mutex<()>` - Mutex for vector resizing operations |
| 15 | + |
| 16 | +### Per-Page State (stored in pin_counts[page_no]) |
| 17 | +- `pin_count: u32` - Number of times page is currently pinned (0 = not pinned) |
| 18 | + - Range: 0 to u32::MAX-1 (u32::MAX reserved for error detection) |
| 19 | + |
| 20 | +## Memory Layout Decisions |
| 21 | +- Each pin counter: 4 bytes (AtomicU32) |
| 22 | +- Vector grows exponentially (doubling) when resizing to minimize amortized cost |
| 23 | +- Cache alignment: Each AtomicU32 naturally aligned to 4-byte boundary |
| 24 | +- False sharing mitigation: Pages accessed concurrently are likely spaced far apart in vector (page numbers differ significantly) |
| 25 | +- No padding between elements - maximizes pin counts per cache line (typically 64 bytes = 16 counters) |
| 26 | +- Tradeoff: Vector resizing requires mutex acquisition, potentially blocking concurrent allocations. Chosen because allocations are infrequent (<1% of operations) versus pin/unpin which are extremely frequent. |
| 27 | + |
| 28 | +## Concurrency Model |
| 29 | +### Pin Operation (thread-safe, lock-free for hot path) |
| 30 | +1. Read current vector length via `length.load(Acquire)` |
| 31 | +2. If page_no < length: proceed to step 4 |
| 32 | +3. Else: |
| 33 | + - Acquire resize_mutex |
| 34 | + - Recheck length (double-checked locking) |
| 35 | + - If still insufficient: resize vector to max(page_no+1, length*2), new elements initialized to 0 |
| 36 | + - Release resize_mutex |
| 37 | +4. Load pin counter: `pin_counts[page_no].load(Acquire)` |
| 38 | +5. If value == u32::MAX: return PinOverflow error |
| 39 | +6. Attempt CAS: `compare_exchange_weak(old, old+1, Acquire, Relaxed)` |
| 40 | +7. On failure: retry from step 4 |
| 41 | +8. On success: return Ok(()) |
| 42 | + |
| 43 | +### Unpin Operation (thread-safe, lock-free) |
| 44 | +1. Read current vector length via `length.load(Acquire)` |
| 45 | +2. If page_no >= length: return PageNotAllocated error (should not happen if caller validated) |
| 46 | +3. Load pin counter: `pin_counts[page_no].load(Acquire)` |
| 47 | +4. If value == 0: return NotPinned error |
| 48 | +5. Attempt CAS: `compare_exchange_weak(old, old-1, Acquire, Relaxed)` |
| 49 | +6. On failure: retry from step 3 |
| 50 | +7. On success: return Ok(()) |
| 51 | + |
| 52 | +### Vector Resizing (requires mutex) |
| 53 | +- Only triggered when accessing page_no >= current length |
| 54 | +- Multiple threads may contend for mutex during allocation bursts |
| 55 | +- After mutex acquisition, recheck length to avoid redundant resizing |
| 56 | +- New elements zero-initialized (valid initial state) |
| 57 | + |
| 58 | +## Failure Scenarios |
| 59 | +### Crash During Pin |
| 60 | +- **Scenario**: Power failure after CAS succeeds but before instruction completes |
| 61 | +- **Effect**: Pin count may be incremented in memory but not persisted |
| 62 | +- **Recovery**: On restart, all pin counts reset to 0 |
| 63 | +- **Correctness**: Safe because no transactions survive crash; any pinned page at crash time belongs to aborted transaction |
| 64 | +- **Tradeoff**: Loses pin count precision but maintains safety invariant (no false positives for pinned pages) |
| 65 | + |
| 66 | +### Crash During Unpin |
| 67 | +- **Scenario**: Power failure after CAS succeeds but before instruction completes |
| 68 | +- **Effect**: Pin count may be decremented in memory but not persisted |
| 69 | +- **Recovery**: On restart, all pin counts reset to 0 |
| 70 | +- **Correctness**: Safe because over-unpinning is impossible (we prevent underflow); any undercount at crash time belongs to aborted transaction's cleanup |
| 71 | + |
| 72 | +### Vector Resize Failure |
| 73 | +- **Scenario**: OOM during vector resize |
| 74 | +- **Effect**: Pin operation returns error |
| 75 | +- **Correctness**: Caller must handle as allocation failure (propagate up) |
| 76 | +- **Tradeoff**: Fail-stop rather than corrupting state; simpler than trying to preserve partial state |
| 77 | + |
| 78 | +### Pin Count Overflow |
| 79 | +- **Scenario**: More than u32::MAX-1 concurrent pins on same page |
| 80 | +- **Effect**: Pin operation returns PinOverflow error |
| 81 | +- **Correctness**: Prevents undefined behavior from wraparound |
| 82 | +- **Tradeoff**: Arbitrary limit (4B-1 pins) chosen as practically unlimited; real systems hit other limits first |
| 83 | + |
| 84 | +## Interaction With Other Components |
| 85 | +### Buffer Pool -> Page Manager |
| 86 | +- **Calls**: `pin_page(page_no)` before reading/modifying page |
| 87 | + `unpin_page(page_no)` after finishing with page |
| 88 | +- **Invariants**: |
| 89 | + - Buffer pool must call unpin for every successful pin |
| 90 | + - Buffer pool must not access page after unpin without re-pinning |
| 91 | + - Page manager guarantees: if pin count > 0, page cannot be deallocated by disk manager |
| 92 | + |
| 93 | +### Disk Manager -> Page Manager |
| 94 | +- **Calls**: `init_page(page_no)` immediately after allocating new page |
| 95 | +- **Invariants**: |
| 96 | + - Disk manager must call init_page before making page available for allocation |
| 97 | + - Page manager guarantees: init_page sets pin count to 0 (clean state) |
| 98 | + - Disk manager must not allocate page with non-zero pin count |
| 99 | + |
| 100 | +### Recovery Manager -> Page Manager |
| 101 | +- **Implicit**: On startup, recovery manager replays log |
| 102 | +- **Effect**: Recovery manager's page accesses will trigger pin/unpin via buffer pool |
| 103 | +- **Invariants**: |
| 104 | + - After crash, all pin counts are 0 (initialized by page manager on restart) |
| 105 | + - Recovery process correctly pins pages during log replay |
| 106 | + - No special handling needed in page manager for recovery |
| 107 | + |
| 108 | +## Rejected Designs |
| 109 | +### Combined Pin/Dirty/Free State in Single AtomicU64 |
| 110 | +- **Reason**: Required persisting state across crashes for correctness |
| 111 | +- **Problem**: Page manager state (in memory) not recoverable after crash without complex logging |
| 112 | +- **Tradeoff**: Simpler concurrent operations but unacceptable recovery complexity |
| 113 | + |
| 114 | +### Separate Free List Tracking in Page Manager |
| 115 | +- **Reason**: Duplicated responsibility with disk manager |
| 116 | +- **Problem**: Required synchronizing two free lists (memory + disk) |
| 117 | +- **Tradeoff**: Slightly faster allocation avoidance but complex crash consistency |
| 118 | + |
| 119 | +### Lock-Based Per-Page Pinning |
| 120 | +- **Reason**: Mutex per page would cause excessive memory usage |
| 121 | +- **Problem**: 100M pages would require ~800MB just for mutexes (assuming 8 bytes each) |
| 122 | +- **Tradeoff**: Simpler locking but prohibitive memory overhead |
| 123 | + |
| 124 | +### Hazard Pointers for Page Reclamation |
| 125 | +- **Reason**: Overkill for pin counting use case |
| 126 | +- **Problem**: Complexity not justified by access patterns |
| 127 | +- **Tradeoff**: Wait-free pinning but significantly increased implementation complexity |
0 commit comments