hellodebojeet
diff --git a/‎.github/workflows/bench.yml‎ b/‎.github/workflows/bench.yml‎
diff --git a/‎.github/workflows/ci.yml‎ b/‎.github/workflows/ci.yml‎
diff --git a/‎.github/workflows/fuzz.yml‎ b/‎.github/workflows/fuzz.yml‎
diff --git a/‎.github/workflows/release.yml‎ b/‎.github/workflows/release.yml‎
diff --git a/‎AGENT.md‎ b/‎AGENT.md‎
diff --git a/‎BENCHMARK.md‎ b/‎BENCHMARK.md‎
diff --git a/‎BENCHMARK_PAGE_MANAGER.md‎
Lines changed: 130 additions & 0 deletions b/‎BENCHMARK_PAGE_MANAGER.md‎
Lines changed: 130 additions & 0 deletions
diff --git a/‎LICENSE‎ b/‎LICENSE‎
diff --git a/‎Makefile‎ b/‎Makefile‎
diff --git a/‎PAGE_MANAGER_DESIGN.md‎
Lines changed: 127 additions & 0 deletions b/‎PAGE_MANAGER_DESIGN.md‎
Lines changed: 127 additions & 0 deletions
@@ -0,0 +1,130 @@
+# Page Manager Benchmark Plan
+
+## Benchmark Goals
+Measure the performance characteristics of the PageManager under realistic database workloads, focusing on:
+1. Throughput (operations per second) for pin/unpin operations
+2. Latency distribution (P50, P95, P99) for pin and unpin operations
+3. Memory usage and allocation overhead
+4. Scalability with increasing concurrent threads
+
+## Test Scenarios
+
+### 1. Sequential Workload
+**Description**: Simulates a table scan workload where pages are accessed in sequential order.
+- Threads pin pages in increasing order (page 0, 1, 2, ...) then unpin in same order
+- Represents OLAP scan workloads and sequential index traversals
+- **Metrics**: 
+  - Throughput: ops/sec (pin+unpin pairs)
+  - Latency: average and tail latency per operation
+  - Cache efficiency: measured via hardware counters (if available) or inferred from latency
+
+### 2. Random Workload
+**Description**: Simulates random index lookups or OLTP workloads.
+- Each thread repeatedly:
+  1. Selects a random page number within current allocated range
+  2. Pins the page
+  3. Performs a dummy operation (to simulate work)
+  4. Unpins the page
+- **Metrics**:
+  - Throughput: ops/sec
+  - Latency: P50, P95, P99 for pin and unpin separately
+  - Contention: measured via thread stall time or retry rates in CAS loops
+
+### 3. High-Concurrency Workload
+**Description**: Stress test with many threads competing for the same pages.
+- Fixed set of hot pages (e.g., 10 pages)
+- Large number of threads (e.g., 2x, 4x, 8x hardware threads) repeatedly:
+  1. Pin a random hot page
+  2. Immediately unpin it
+- **Metrics**:
+  - Throughput: ops/sec as thread count increases
+  - Latency: tail latency (P99) under contention
+  - Scalability: throughput vs. thread count curve
+  - Retry rate: percentage of CAS operations that fail and require retry
+
+### 4. Allocation Workload
+**Description**: Measures performance during active page allocation.
+- Background thread continuously allocates new pages (calling EnsureCapacity/AllocPage)
+- Foreground threads perform random pin/unpin on existing pages
+- **Metrics**:
+  - Allocation throughput: pages allocated/sec
+  - Impact on foreground latency: compare to baseline without allocation
+  - Memory usage: total memory consumed by pin count array
+
+### 5. Long-Running Stability Test
+**Description**: Tests for memory leaks and gradual performance degradation.
+- Mixed workload of pin/unpin and allocation for extended period (e.g., 1 hour)
+- **Metrics**:
+  - Memory growth over time
+  - Throughput stability (coefficient of variation)
+  - Any increase in latency or retry rates
+
+## Expected Bottlenecks
+
+### 1. Seqlock Contention
+- **Where**: During concurrent EnsureCapacity calls or when writers block readers
+- **Impact**: Under high allocation rates, writers may cause readers to retry frequently
+- **Mitigation Considered**: 
+  - Using sequence locks minimizes writer-blocking-readers time
+  - Allocation is infrequent (<1% of ops) so impact should be low
+  - **Validation**: Measure retry rates in allocation-heavy scenarios
+
+### 2. Cache Line Ping-Ponging
+- **Where**: When multiple threads pin/unpin pages that share cache lines
+- **Impact**: False sharing degrades performance significantly in random workloads
+- **Mitigation**: 
+  - Cache-line padding in paddedPin structure
+  - **Validation**: Compare padded vs. unpadded versions under random workload
+
+### 3. CAS Retry Loops
+- **Where**: In Pin/Unpin operations under high contention
+- **Impact**: High retry rates waste CPU cycles and increase latency
+- **Mitigation**:
+  - Exponential backoff in retry loops (not implemented; would add complexity)
+  - **Validation**: Measure retry rates vs. thread count; if >10% consider backoff
+
+### 4. Memory Allocation During Resize
+- **Where**: During EnsureCapacity when array needs to grow
+- **Impact**: Stop-the-world pause during copy; allocates large new array
+- **Mitigation**:
+  - Geometric growth minimizes frequency
+  - Copy is O(n) but n is number of pages, not tuples
+  - **Validation**: Measure 99th percentile latency spikes during allocation bursts
+
+## Optimization Strategy Based on Results
+
+### If False Sharing Dominates (evident in random workload):
+- Consider adaptive padding: only pad regions with high access density
+- Use page access histograms to identify hot regions and apply padding selectively
+- Tradeoff: increased complexity for better memory utilization
+
+### If Seqlock Contention High (allocation-heavy workloads):
+- Batch allocations: allow disk manager to allocate multiple pages at once
+- Use per-CPU or per-thread allocation buffers to reduce global allocation frequency
+- Tradeoff: increased memory overhead but reduced contention
+
+### If CAS Retry Rates High (>15%):
+- Implement lightweight backoff in Pin/Unpin loops:
+  - First retry: immediate
+  - Second retry: yield
+  - Third+: short sleep (futex wait)
+- Tradeoff: slightly higher latency at low contention but much better scalability
+
+### If Allocation Latency Spikes Noticeable:
+- Pre-allocate page ranges based on workload forecasts
+- Use huge pages for the pin count array to reduce TLB pressure
+- Tradeoff: complexity vs. reduced latency variance
+
+## Success Criteria
+1. Sequential workload: >1M ops/sec with <10µs P99 latency
+2. Random workload: >500K ops/sec with <50µs P99 latency at 64 threads
+3. Allocation impact: <5% throughput degradation on foreground workload during allocation bursts
+4. Memory overhead: <10 bytes per page (including padding) for typical page counts (<1M pages)
+5. No memory leaks or growing latency in long-running tests
+
+## Implementation Note
+These benchmarks would be implemented using Go's testing/benchmark framework with:
+- `testing.B` for microbenchmarks
+- Custom harnesses for multi-threaded scenarios
+- `runtime/pprof` for latency measurements
+- `golang.org/x/exp/rand` for deterministic random workloads
@@ -0,0 +1,127 @@
+# FlowDB Page Manager Design
+
+## Core Responsibilities
+1. Track pin counts for allocated pages to prevent premature eviction from buffer pool
+2. Initialize pin count to 0 when a new page is allocated by the disk manager
+3. Provide thread-safe pin/unpin operations with overflow/underflow protection
+4. Do NOT manage free lists or disk allocation (handled by disk manager)
+5. Do NOT track dirty state (handled by buffer pool)
+
+## Data Structures
+### Page Descriptor Table
+- `pin_counts: Vec<AtomicU32>` - Vector of atomic pin counters indexed by page number
+- `length: AtomicUsize` - Current valid length of pin_counts vector
+- `resize_mutex: Mutex<()>` - Mutex for vector resizing operations
+
+### Per-Page State (stored in pin_counts[page_no])
+- `pin_count: u32` - Number of times page is currently pinned (0 = not pinned)
+  - Range: 0 to u32::MAX-1 (u32::MAX reserved for error detection)
+
+## Memory Layout Decisions
+- Each pin counter: 4 bytes (AtomicU32)
+- Vector grows exponentially (doubling) when resizing to minimize amortized cost
+- Cache alignment: Each AtomicU32 naturally aligned to 4-byte boundary
+- False sharing mitigation: Pages accessed concurrently are likely spaced far apart in vector (page numbers differ significantly)
+- No padding between elements - maximizes pin counts per cache line (typically 64 bytes = 16 counters)
+- Tradeoff: Vector resizing requires mutex acquisition, potentially blocking concurrent allocations. Chosen because allocations are infrequent (<1% of operations) versus pin/unpin which are extremely frequent.
+
+## Concurrency Model
+### Pin Operation (thread-safe, lock-free for hot path)
+1. Read current vector length via `length.load(Acquire)`
+2. If page_no < length: proceed to step 4
+3. Else: 
+   - Acquire resize_mutex
+   - Recheck length (double-checked locking)
+   - If still insufficient: resize vector to max(page_no+1, length*2), new elements initialized to 0
+   - Release resize_mutex
+4. Load pin counter: `pin_counts[page_no].load(Acquire)`
+5. If value == u32::MAX: return PinOverflow error
+6. Attempt CAS: `compare_exchange_weak(old, old+1, Acquire, Relaxed)`
+7. On failure: retry from step 4
+8. On success: return Ok(())
+
+### Unpin Operation (thread-safe, lock-free)
+1. Read current vector length via `length.load(Acquire)`
+2. If page_no >= length: return PageNotAllocated error (should not happen if caller validated)
+3. Load pin counter: `pin_counts[page_no].load(Acquire)`
+4. If value == 0: return NotPinned error
+5. Attempt CAS: `compare_exchange_weak(old, old-1, Acquire, Relaxed)`
+6. On failure: retry from step 3
+7. On success: return Ok(())
+
+### Vector Resizing (requires mutex)
+- Only triggered when accessing page_no >= current length
+- Multiple threads may contend for mutex during allocation bursts
+- After mutex acquisition, recheck length to avoid redundant resizing
+- New elements zero-initialized (valid initial state)
+
+## Failure Scenarios
+### Crash During Pin
+- **Scenario**: Power failure after CAS succeeds but before instruction completes
+- **Effect**: Pin count may be incremented in memory but not persisted
+- **Recovery**: On restart, all pin counts reset to 0
+- **Correctness**: Safe because no transactions survive crash; any pinned page at crash time belongs to aborted transaction
+- **Tradeoff**: Loses pin count precision but maintains safety invariant (no false positives for pinned pages)
+
+### Crash During Unpin
+- **Scenario**: Power failure after CAS succeeds but before instruction completes
+- **Effect**: Pin count may be decremented in memory but not persisted
+- **Recovery**: On restart, all pin counts reset to 0
+- **Correctness**: Safe because over-unpinning is impossible (we prevent underflow); any undercount at crash time belongs to aborted transaction's cleanup
+
+### Vector Resize Failure
+- **Scenario**: OOM during vector resize
+- **Effect**: Pin operation returns error
+- **Correctness**: Caller must handle as allocation failure (propagate up)
+- **Tradeoff**: Fail-stop rather than corrupting state; simpler than trying to preserve partial state
+
+### Pin Count Overflow
+- **Scenario**: More than u32::MAX-1 concurrent pins on same page
+- **Effect**: Pin operation returns PinOverflow error
+- **Correctness**: Prevents undefined behavior from wraparound
+- **Tradeoff**: Arbitrary limit (4B-1 pins) chosen as practically unlimited; real systems hit other limits first
+
+## Interaction With Other Components
+### Buffer Pool -> Page Manager
+- **Calls**: `pin_page(page_no)` before reading/modifying page
+          `unpin_page(page_no)` after finishing with page
+- **Invariants**: 
+  - Buffer pool must call unpin for every successful pin
+  - Buffer pool must not access page after unpin without re-pinning
+  - Page manager guarantees: if pin count > 0, page cannot be deallocated by disk manager
+
+### Disk Manager -> Page Manager
+- **Calls**: `init_page(page_no)` immediately after allocating new page
+- **Invariants**:
+  - Disk manager must call init_page before making page available for allocation
+  - Page manager guarantees: init_page sets pin count to 0 (clean state)
+  - Disk manager must not allocate page with non-zero pin count
+
+### Recovery Manager -> Page Manager
+- **Implicit**: On startup, recovery manager replays log
+- **Effect**: Recovery manager's page accesses will trigger pin/unpin via buffer pool
+- **Invariants**:
+  - After crash, all pin counts are 0 (initialized by page manager on restart)
+  - Recovery process correctly pins pages during log replay
+  - No special handling needed in page manager for recovery
+
+## Rejected Designs
+### Combined Pin/Dirty/Free State in Single AtomicU64
+- **Reason**: Required persisting state across crashes for correctness
+- **Problem**: Page manager state (in memory) not recoverable after crash without complex logging
+- **Tradeoff**: Simpler concurrent operations but unacceptable recovery complexity
+
+### Separate Free List Tracking in Page Manager
+- **Reason**: Duplicated responsibility with disk manager
+- **Problem**: Required synchronizing two free lists (memory + disk) 
+- **Tradeoff**: Slightly faster allocation avoidance but complex crash consistency
+
+### Lock-Based Per-Page Pinning
+- **Reason**: Mutex per page would cause excessive memory usage
+- **Problem**: 100M pages would require ~800MB just for mutexes (assuming 8 bytes each)
+- **Tradeoff**: Simpler locking but prohibitive memory overhead
+
+### Hazard Pointers for Page Reclamation
+- **Reason**: Overkill for pin counting use case
+- **Problem**: Complexity not justified by access patterns
+- **Tradeoff**: Wait-free pinning but significantly increased implementation complexity