You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/content/docs/getting-started/how-does-tidesdb-work.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -130,7 +130,7 @@ Repeatable Read remembers every key it read, along with the version it saw. At c
130
130
131
131
Snapshot Isolation detects write-write conflicts only, with first-committer-wins. It keeps no read set; its commit aborts if another transaction wrote one of its keys after its snapshot began. It deliberately allows write skew — two transactions reading overlapping data and writing disjoint keys — because that matches the textbook definition, under which snapshot isolation requires only write-write conflict detection.
132
132
133
-
Serializable adds read-write conflict tracking on top of snapshot isolation, implementing serializable snapshot isolation (SSI). Only Repeatable Read and Serializable allocate a read set; once that set passes 64 entries it is backed by an xxHash table for O(1) conflict checks. At commit the engine examines all concurrent transactions: if transaction T read a key that another transaction T′ wrote, it marks an outgoing conflict on T and an incoming conflict on T′. A transaction carrying both an incoming and an outgoing conflict is a pivot in a "dangerous structure," and its commit aborts. This is a deliberately simplified SSI: it detects pivots but builds no precedence graph and does no cycle detection, so it can occasionally abort a transaction that was in fact serializable.
133
+
Serializable adds read-write conflict tracking on top of snapshot isolation, implementing serializable snapshot isolation (SSI). Only Repeatable Read and Serializable allocate a read set; once that set passes 64 entries it is backed by an xxHash table for O(1) conflict checks. At commit the engine examines other concurrent serializable transactions: if transaction T read a key that another transaction T′ wrote, it marks an outgoing conflict on T and an incoming conflict on T′. A transaction carrying both an incoming and an outgoing conflict is a pivot in a "dangerous structure," and its commit aborts. This is a deliberately simplified SSI: it detects pivots but builds no precedence graph and does no cycle detection, so it can occasionally abort a transaction that was in fact serializable.
134
134
135
135
### Transactions Across Column Families
136
136
@@ -197,7 +197,7 @@ The L0 stall bounds the queue of frozen memtables, but not the active memtable t
197
197
198
198
Level 1 is watched alongside L0 because a high L1 count means compaction is falling behind, and a compaction backlog eventually starves flushing too (flushers wait on compaction to free space). Throttling on L1 therefore acts as a leading indicator, applying pressure before L0 becomes critical and heading off a cascade.
199
199
200
-
The per-column-family signals above cannot, by themselves, prevent an out-of-memory condition when many column families fill up at once. So a separate global guard runs in the reaper thread every 100ms. It sums all the memory the database is using — active and immutable memtables, in-flight transaction buffers, compaction scratch space, bloom filters, block indexes, and caches — and divides by a resolved limit (`max_memory_usage`, default half of system RAM, never less than 5%). The resulting pressure level is graduated: normal below 60%, elevated to 75%, high to 95%, critical above. The write path reads this level with one atomic load per commit, so it costs nothing at normal pressure. As pressure climbs, the response escalates: at elevated, the flush threshold tightens and the current family is flushed proactively; at high, the current family is force-flushed and the reaper force-flushes the largest non-flushing family; at critical, writes block entirely until the reaper brings pressure down (timing out after 10 seconds with `TDB_ERR_BUSY`), while the reaper force-flushes every non-flushing family and aggressively compacts the one with the most SSTables. In unified mode, where one memtable is shared, the reaper rotates that single memtable instead of iterating empty per-CF ones. As a last line of defense, an OS-level check polls real free memory every few seconds and forces the level to critical if free RAM drops below 5%, catching consumption that TidesDB's own accounting cannot see.
200
+
The per-column-family signals above cannot, by themselves, prevent an out-of-memory condition when many column families fill up at once. So a separate global guard runs in the reaper thread every 100ms. It sums all the memory the database is using — active and immutable memtables, in-flight transaction buffers, compaction scratch space, bloom filters, block indexes, and caches — and divides by a resolved limit (`max_memory_usage`, default 75% of system RAM, never less than 5%). The resulting pressure level is graduated: normal below 60%, elevated to 75%, high to 95%, critical above. The write path reads this level with one atomic load per commit, so it costs nothing at normal pressure. As pressure climbs, the response escalates: at elevated, the flush threshold tightens and the current family is flushed proactively; at high, the current family is force-flushed, the reaper force-flushes the largest non-flushing family, and it aggressively compacts the family with the most SSTables; at critical, writes block entirely until the reaper brings pressure down (timing out after 10 seconds with `TDB_ERR_BUSY`), while the reaper force-flushes every non-flushing family. In unified mode, where one memtable is shared, the reaper rotates that single memtable instead of iterating empty per-CF ones. As a last line of defense, an OS-level check polls real free memory every few seconds and forces the level to critical if free RAM drops below 5%, catching consumption that TidesDB's own accounting cannot see.
201
201
202
202
The point of the whole scheme is smooth degradation. Increasing the write-buffer size trades flush frequency against memory used during stalls; raising the stall threshold trades memory for burst tolerance; adding flush workers drains the queue faster; and `max_memory_usage` caps the whole envelope. The right settings depend on the write pattern, the available memory, and the disk — but in every case the system slows down gradually as it approaches its limits, rather than swinging between full speed and a dead stop.
203
203
## The Read Path
@@ -348,7 +348,7 @@ The work that does not happen on the caller's thread happens here (Figure 7). Fl
Flush workers (default 2) take frozen memtables off the queue and write them to SSTables, in parallel across column families. Compaction workers (default 2) merge SSTables across levels, in parallel across families, and fan out within a single round through sub-compaction. The sync worker (1 thread, started only if any WAL uses interval sync) periodically fsyncs the WALs configured for it; it finds the smallest configured interval, sleeps that long, and syncs each due WAL. Column families on interval sync also force an explicit fsync at structural boundaries — when a memtable rotates, and during every sorted-run creation and merge — which preserves durability while still batching ordinary writes.
351
+
Flush workers (default auto, min of CPU count and 4) take frozen memtables off the queue and write them to SSTables, in parallel across column families. Compaction workers (default 2) merge SSTables across levels, in parallel across families, and fan out within a single round through sub-compaction. The sync worker (1 thread, started only if any WAL uses interval sync) periodically fsyncs the WALs configured for it; it finds the smallest configured interval, sleeps that long, and syncs each due WAL. Column families on interval sync also force an explicit fsync at structural boundaries — when a memtable rotates, and during every sorted-run creation and merge — which preserves durability while still batching ordinary writes.
352
352
353
353
The reaper (1 thread) runs a maintenance loop every 100ms and is the system's general groundskeeper. Each cycle it sweeps the deferred-free list, retries flushes that were deferred under the concurrency cap, services any compaction triggers that arrived while a compaction was already running, recomputes global memory pressure and acts on it, and evicts idle SSTable file handles when too many are open. The memory-pressure response was described with [Backpressure](#backpressure-and-flow-control); the two pieces of bookkeeping unique to the reaper are worth a word each.
354
354
@@ -407,13 +407,13 @@ The bloom false-positive rate, 1% by default, balances memory against effectiven
407
407
408
408
Memtable size trades flush frequency against recovery time and memory. Larger memtables flush less often but lengthen recovery and use more memory; smaller ones flush more (more SSTables, more compaction) but recover faster. The 64MB default holds about a million small pairs and flushes every few seconds under moderate load. Doubling it halves flush frequency but raises level-1-to-level-2 amplification, since each flush produces a larger table that takes longer to merge.
409
409
410
-
Worker counts default to two flush and two compaction threads, which give cross-family parallelism at modest cost. More threads help with many active families but cost memory (each buffers 64KB blocks) and descriptors (two per table in flight). The device dominates the choice: on a spinning disk, several concurrent compactors cause head seeks that destroy throughput; on NVMe, more workers help. So 1–2 workers for HDD, 4–8 for NVMe.
410
+
Worker counts default to auto flush threads (the CPU count, capped at 4) and two compaction threads, which give cross-family parallelism at modest cost. More threads help with many active families but cost memory (each buffers 64KB blocks) and descriptors (two per table in flight). The device dominates the choice: on a spinning disk, several concurrent compactors cause head seeks that destroy throughput; on NVMe, more workers help. So 1–2 workers for HDD, 4–8 for NVMe.
411
411
412
412
## Operational Considerations
413
413
414
414
A TidesDB instance is safe for many threads in one process but exclusive to a single process: only one process may open a database directory at a time. Exclusivity is a non-blocking file lock taken during open — if another process holds it, open returns `TDB_ERR_LOCKED` at once rather than waiting. The locking primitive is chosen per platform for correct semantics: `fcntl` locks on macOS and BSD (which, unlike `flock`, are not inherited across `fork`, with the owning PID written to the lock file so a same-process double-open is caught), OFD locks on modern Linux, and `LockFileEx` on Windows, with retries on signal interruption so a stray signal cannot spuriously fail the lock.
415
415
416
-
Memory use per family comes from a few structures: the active memtable is configurable (default 64MB) and the immutable queue is that size times its depth (usually 1–2); the block cache is shared across families (default 64MB total); bloom filters cost about 10 bits per key and block indexes about 32 bytes per block. A family with 10M keys across 100 SSTables therefore runs around 150MB plus its share of the cache. The `max_memory_usage` cap (default auto, resolving to half of system RAM, never clamped below 5%) bounds the aggregate across all families, which is what prevents an out-of-memory condition in many-family deployments where per-family limits cannot.
416
+
Memory use per family comes from a few structures: the active memtable is configurable (default 64MB) and the immutable queue is that size times its depth (usually 1–2); the block cache is shared across families (default 64MB total); bloom filters cost about 10 bits per key and block indexes about 32 bytes per block. A family with 10M keys across 100 SSTables therefore runs around 150MB plus its share of the cache. The `max_memory_usage` cap (default auto, resolving to 75% of system RAM, never clamped below 5%) bounds the aggregate across all families, which is what prevents an out-of-memory condition in many-family deployments where per-family limits cannot.
417
417
418
418
Three operational limits interact at the margins. When writes outpace compaction, backpressure stalls them once the flush queue passes its threshold, trading occasional latency spikes for bounded memory. Because SSTables are immutable, space is reclaimed only after a compaction finishes and deletes its inputs, so a compaction can briefly need double the space of the level it rewrites; the engine checks free space before starting one. And because each SSTable holds two descriptors open, a working set larger than the open-file budget makes the reaper thrash; an operator who wants a bigger resident set can raise the process's descriptor ceiling before opening the database, after which the engine sizes its budget to fit. The raise is opt-in and a partial failure is non-fatal.
-`TDB_ERR_INVALID_ARGS` if `cf` or either key pointer is NULL, or sizes are zero
2156
+
-`TDB_ERR_INVALID_ARGS` if `cf` is NULL, if both `start_key` and `end_key` are NULL, or if a non-NULL key has size zero (a single NULL key is allowed and means an unbounded bound on that side)
2157
+
-`TDB_ERR_LOCKED` if another compaction is already running on the column family
2156
2158
- Standard I/O and memory error codes if the merge cannot complete
2157
2159
2158
2160
### Purge Column Family
@@ -2272,7 +2274,7 @@ TidesDB uses separate thread pools for flush and compaction operations. Understa
0 commit comments