Skip to content

Commit 4bea9b5

Browse files
author
hellodebojeet
committed
first commit
0 parents  commit 4bea9b5

96 files changed

Lines changed: 855 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/bench.yml

Whitespace-only changes.

.github/workflows/ci.yml

Whitespace-only changes.

.github/workflows/fuzz.yml

Whitespace-only changes.

.github/workflows/release.yml

Whitespace-only changes.

AGENT.md

Whitespace-only changes.

BENCHMARK.md

Whitespace-only changes.

BENCHMARK_PAGE_MANAGER.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# Page Manager Benchmark Plan
2+
3+
## Benchmark Goals
4+
Measure the performance characteristics of the PageManager under realistic database workloads, focusing on:
5+
1. Throughput (operations per second) for pin/unpin operations
6+
2. Latency distribution (P50, P95, P99) for pin and unpin operations
7+
3. Memory usage and allocation overhead
8+
4. Scalability with increasing concurrent threads
9+
10+
## Test Scenarios
11+
12+
### 1. Sequential Workload
13+
**Description**: Simulates a table scan workload where pages are accessed in sequential order.
14+
- Threads pin pages in increasing order (page 0, 1, 2, ...) then unpin in same order
15+
- Represents OLAP scan workloads and sequential index traversals
16+
- **Metrics**:
17+
- Throughput: ops/sec (pin+unpin pairs)
18+
- Latency: average and tail latency per operation
19+
- Cache efficiency: measured via hardware counters (if available) or inferred from latency
20+
21+
### 2. Random Workload
22+
**Description**: Simulates random index lookups or OLTP workloads.
23+
- Each thread repeatedly:
24+
1. Selects a random page number within current allocated range
25+
2. Pins the page
26+
3. Performs a dummy operation (to simulate work)
27+
4. Unpins the page
28+
- **Metrics**:
29+
- Throughput: ops/sec
30+
- Latency: P50, P95, P99 for pin and unpin separately
31+
- Contention: measured via thread stall time or retry rates in CAS loops
32+
33+
### 3. High-Concurrency Workload
34+
**Description**: Stress test with many threads competing for the same pages.
35+
- Fixed set of hot pages (e.g., 10 pages)
36+
- Large number of threads (e.g., 2x, 4x, 8x hardware threads) repeatedly:
37+
1. Pin a random hot page
38+
2. Immediately unpin it
39+
- **Metrics**:
40+
- Throughput: ops/sec as thread count increases
41+
- Latency: tail latency (P99) under contention
42+
- Scalability: throughput vs. thread count curve
43+
- Retry rate: percentage of CAS operations that fail and require retry
44+
45+
### 4. Allocation Workload
46+
**Description**: Measures performance during active page allocation.
47+
- Background thread continuously allocates new pages (calling EnsureCapacity/AllocPage)
48+
- Foreground threads perform random pin/unpin on existing pages
49+
- **Metrics**:
50+
- Allocation throughput: pages allocated/sec
51+
- Impact on foreground latency: compare to baseline without allocation
52+
- Memory usage: total memory consumed by pin count array
53+
54+
### 5. Long-Running Stability Test
55+
**Description**: Tests for memory leaks and gradual performance degradation.
56+
- Mixed workload of pin/unpin and allocation for extended period (e.g., 1 hour)
57+
- **Metrics**:
58+
- Memory growth over time
59+
- Throughput stability (coefficient of variation)
60+
- Any increase in latency or retry rates
61+
62+
## Expected Bottlenecks
63+
64+
### 1. Seqlock Contention
65+
- **Where**: During concurrent EnsureCapacity calls or when writers block readers
66+
- **Impact**: Under high allocation rates, writers may cause readers to retry frequently
67+
- **Mitigation Considered**:
68+
- Using sequence locks minimizes writer-blocking-readers time
69+
- Allocation is infrequent (<1% of ops) so impact should be low
70+
- **Validation**: Measure retry rates in allocation-heavy scenarios
71+
72+
### 2. Cache Line Ping-Ponging
73+
- **Where**: When multiple threads pin/unpin pages that share cache lines
74+
- **Impact**: False sharing degrades performance significantly in random workloads
75+
- **Mitigation**:
76+
- Cache-line padding in paddedPin structure
77+
- **Validation**: Compare padded vs. unpadded versions under random workload
78+
79+
### 3. CAS Retry Loops
80+
- **Where**: In Pin/Unpin operations under high contention
81+
- **Impact**: High retry rates waste CPU cycles and increase latency
82+
- **Mitigation**:
83+
- Exponential backoff in retry loops (not implemented; would add complexity)
84+
- **Validation**: Measure retry rates vs. thread count; if >10% consider backoff
85+
86+
### 4. Memory Allocation During Resize
87+
- **Where**: During EnsureCapacity when array needs to grow
88+
- **Impact**: Stop-the-world pause during copy; allocates large new array
89+
- **Mitigation**:
90+
- Geometric growth minimizes frequency
91+
- Copy is O(n) but n is number of pages, not tuples
92+
- **Validation**: Measure 99th percentile latency spikes during allocation bursts
93+
94+
## Optimization Strategy Based on Results
95+
96+
### If False Sharing Dominates (evident in random workload):
97+
- Consider adaptive padding: only pad regions with high access density
98+
- Use page access histograms to identify hot regions and apply padding selectively
99+
- Tradeoff: increased complexity for better memory utilization
100+
101+
### If Seqlock Contention High (allocation-heavy workloads):
102+
- Batch allocations: allow disk manager to allocate multiple pages at once
103+
- Use per-CPU or per-thread allocation buffers to reduce global allocation frequency
104+
- Tradeoff: increased memory overhead but reduced contention
105+
106+
### If CAS Retry Rates High (>15%):
107+
- Implement lightweight backoff in Pin/Unpin loops:
108+
- First retry: immediate
109+
- Second retry: yield
110+
- Third+: short sleep (futex wait)
111+
- Tradeoff: slightly higher latency at low contention but much better scalability
112+
113+
### If Allocation Latency Spikes Noticeable:
114+
- Pre-allocate page ranges based on workload forecasts
115+
- Use huge pages for the pin count array to reduce TLB pressure
116+
- Tradeoff: complexity vs. reduced latency variance
117+
118+
## Success Criteria
119+
1. Sequential workload: >1M ops/sec with <10µs P99 latency
120+
2. Random workload: >500K ops/sec with <50µs P99 latency at 64 threads
121+
3. Allocation impact: <5% throughput degradation on foreground workload during allocation bursts
122+
4. Memory overhead: <10 bytes per page (including padding) for typical page counts (<1M pages)
123+
5. No memory leaks or growing latency in long-running tests
124+
125+
## Implementation Note
126+
These benchmarks would be implemented using Go's testing/benchmark framework with:
127+
- `testing.B` for microbenchmarks
128+
- Custom harnesses for multi-threaded scenarios
129+
- `runtime/pprof` for latency measurements
130+
- `golang.org/x/exp/rand` for deterministic random workloads

LICENSE

Whitespace-only changes.

Makefile

Whitespace-only changes.

PAGE_MANAGER_DESIGN.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# FlowDB Page Manager Design
2+
3+
## Core Responsibilities
4+
1. Track pin counts for allocated pages to prevent premature eviction from buffer pool
5+
2. Initialize pin count to 0 when a new page is allocated by the disk manager
6+
3. Provide thread-safe pin/unpin operations with overflow/underflow protection
7+
4. Do NOT manage free lists or disk allocation (handled by disk manager)
8+
5. Do NOT track dirty state (handled by buffer pool)
9+
10+
## Data Structures
11+
### Page Descriptor Table
12+
- `pin_counts: Vec<AtomicU32>` - Vector of atomic pin counters indexed by page number
13+
- `length: AtomicUsize` - Current valid length of pin_counts vector
14+
- `resize_mutex: Mutex<()>` - Mutex for vector resizing operations
15+
16+
### Per-Page State (stored in pin_counts[page_no])
17+
- `pin_count: u32` - Number of times page is currently pinned (0 = not pinned)
18+
- Range: 0 to u32::MAX-1 (u32::MAX reserved for error detection)
19+
20+
## Memory Layout Decisions
21+
- Each pin counter: 4 bytes (AtomicU32)
22+
- Vector grows exponentially (doubling) when resizing to minimize amortized cost
23+
- Cache alignment: Each AtomicU32 naturally aligned to 4-byte boundary
24+
- False sharing mitigation: Pages accessed concurrently are likely spaced far apart in vector (page numbers differ significantly)
25+
- No padding between elements - maximizes pin counts per cache line (typically 64 bytes = 16 counters)
26+
- Tradeoff: Vector resizing requires mutex acquisition, potentially blocking concurrent allocations. Chosen because allocations are infrequent (<1% of operations) versus pin/unpin which are extremely frequent.
27+
28+
## Concurrency Model
29+
### Pin Operation (thread-safe, lock-free for hot path)
30+
1. Read current vector length via `length.load(Acquire)`
31+
2. If page_no < length: proceed to step 4
32+
3. Else:
33+
- Acquire resize_mutex
34+
- Recheck length (double-checked locking)
35+
- If still insufficient: resize vector to max(page_no+1, length*2), new elements initialized to 0
36+
- Release resize_mutex
37+
4. Load pin counter: `pin_counts[page_no].load(Acquire)`
38+
5. If value == u32::MAX: return PinOverflow error
39+
6. Attempt CAS: `compare_exchange_weak(old, old+1, Acquire, Relaxed)`
40+
7. On failure: retry from step 4
41+
8. On success: return Ok(())
42+
43+
### Unpin Operation (thread-safe, lock-free)
44+
1. Read current vector length via `length.load(Acquire)`
45+
2. If page_no >= length: return PageNotAllocated error (should not happen if caller validated)
46+
3. Load pin counter: `pin_counts[page_no].load(Acquire)`
47+
4. If value == 0: return NotPinned error
48+
5. Attempt CAS: `compare_exchange_weak(old, old-1, Acquire, Relaxed)`
49+
6. On failure: retry from step 3
50+
7. On success: return Ok(())
51+
52+
### Vector Resizing (requires mutex)
53+
- Only triggered when accessing page_no >= current length
54+
- Multiple threads may contend for mutex during allocation bursts
55+
- After mutex acquisition, recheck length to avoid redundant resizing
56+
- New elements zero-initialized (valid initial state)
57+
58+
## Failure Scenarios
59+
### Crash During Pin
60+
- **Scenario**: Power failure after CAS succeeds but before instruction completes
61+
- **Effect**: Pin count may be incremented in memory but not persisted
62+
- **Recovery**: On restart, all pin counts reset to 0
63+
- **Correctness**: Safe because no transactions survive crash; any pinned page at crash time belongs to aborted transaction
64+
- **Tradeoff**: Loses pin count precision but maintains safety invariant (no false positives for pinned pages)
65+
66+
### Crash During Unpin
67+
- **Scenario**: Power failure after CAS succeeds but before instruction completes
68+
- **Effect**: Pin count may be decremented in memory but not persisted
69+
- **Recovery**: On restart, all pin counts reset to 0
70+
- **Correctness**: Safe because over-unpinning is impossible (we prevent underflow); any undercount at crash time belongs to aborted transaction's cleanup
71+
72+
### Vector Resize Failure
73+
- **Scenario**: OOM during vector resize
74+
- **Effect**: Pin operation returns error
75+
- **Correctness**: Caller must handle as allocation failure (propagate up)
76+
- **Tradeoff**: Fail-stop rather than corrupting state; simpler than trying to preserve partial state
77+
78+
### Pin Count Overflow
79+
- **Scenario**: More than u32::MAX-1 concurrent pins on same page
80+
- **Effect**: Pin operation returns PinOverflow error
81+
- **Correctness**: Prevents undefined behavior from wraparound
82+
- **Tradeoff**: Arbitrary limit (4B-1 pins) chosen as practically unlimited; real systems hit other limits first
83+
84+
## Interaction With Other Components
85+
### Buffer Pool -> Page Manager
86+
- **Calls**: `pin_page(page_no)` before reading/modifying page
87+
`unpin_page(page_no)` after finishing with page
88+
- **Invariants**:
89+
- Buffer pool must call unpin for every successful pin
90+
- Buffer pool must not access page after unpin without re-pinning
91+
- Page manager guarantees: if pin count > 0, page cannot be deallocated by disk manager
92+
93+
### Disk Manager -> Page Manager
94+
- **Calls**: `init_page(page_no)` immediately after allocating new page
95+
- **Invariants**:
96+
- Disk manager must call init_page before making page available for allocation
97+
- Page manager guarantees: init_page sets pin count to 0 (clean state)
98+
- Disk manager must not allocate page with non-zero pin count
99+
100+
### Recovery Manager -> Page Manager
101+
- **Implicit**: On startup, recovery manager replays log
102+
- **Effect**: Recovery manager's page accesses will trigger pin/unpin via buffer pool
103+
- **Invariants**:
104+
- After crash, all pin counts are 0 (initialized by page manager on restart)
105+
- Recovery process correctly pins pages during log replay
106+
- No special handling needed in page manager for recovery
107+
108+
## Rejected Designs
109+
### Combined Pin/Dirty/Free State in Single AtomicU64
110+
- **Reason**: Required persisting state across crashes for correctness
111+
- **Problem**: Page manager state (in memory) not recoverable after crash without complex logging
112+
- **Tradeoff**: Simpler concurrent operations but unacceptable recovery complexity
113+
114+
### Separate Free List Tracking in Page Manager
115+
- **Reason**: Duplicated responsibility with disk manager
116+
- **Problem**: Required synchronizing two free lists (memory + disk)
117+
- **Tradeoff**: Slightly faster allocation avoidance but complex crash consistency
118+
119+
### Lock-Based Per-Page Pinning
120+
- **Reason**: Mutex per page would cause excessive memory usage
121+
- **Problem**: 100M pages would require ~800MB just for mutexes (assuming 8 bytes each)
122+
- **Tradeoff**: Simpler locking but prohibitive memory overhead
123+
124+
### Hazard Pointers for Page Reclamation
125+
- **Reason**: Overkill for pin counting use case
126+
- **Problem**: Complexity not justified by access patterns
127+
- **Tradeoff**: Wait-free pinning but significantly increased implementation complexity

0 commit comments

Comments
 (0)