Skip to content

Commit fbe34fe

Browse files
authored
Merge pull request #515 from tidesdb/tdb935
update design doc, building, doc and c reference for tidesdb v935
2 parents c788613 + 154cde4 commit fbe34fe

3 files changed

Lines changed: 14 additions & 11 deletions

File tree

src/content/docs/getting-started/how-does-tidesdb-work.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ Repeatable Read remembers every key it read, along with the version it saw. At c
130130

131131
Snapshot Isolation detects write-write conflicts only, with first-committer-wins. It keeps no read set; its commit aborts if another transaction wrote one of its keys after its snapshot began. It deliberately allows write skew — two transactions reading overlapping data and writing disjoint keys — because that matches the textbook definition, under which snapshot isolation requires only write-write conflict detection.
132132

133-
Serializable adds read-write conflict tracking on top of snapshot isolation, implementing serializable snapshot isolation (SSI). Only Repeatable Read and Serializable allocate a read set; once that set passes 64 entries it is backed by an xxHash table for O(1) conflict checks. At commit the engine examines all concurrent transactions: if transaction T read a key that another transaction T′ wrote, it marks an outgoing conflict on T and an incoming conflict on T′. A transaction carrying both an incoming and an outgoing conflict is a pivot in a "dangerous structure," and its commit aborts. This is a deliberately simplified SSI: it detects pivots but builds no precedence graph and does no cycle detection, so it can occasionally abort a transaction that was in fact serializable.
133+
Serializable adds read-write conflict tracking on top of snapshot isolation, implementing serializable snapshot isolation (SSI). Only Repeatable Read and Serializable allocate a read set; once that set passes 64 entries it is backed by an xxHash table for O(1) conflict checks. At commit the engine examines other concurrent serializable transactions: if transaction T read a key that another transaction T′ wrote, it marks an outgoing conflict on T and an incoming conflict on T′. A transaction carrying both an incoming and an outgoing conflict is a pivot in a "dangerous structure," and its commit aborts. This is a deliberately simplified SSI: it detects pivots but builds no precedence graph and does no cycle detection, so it can occasionally abort a transaction that was in fact serializable.
134134

135135
### Transactions Across Column Families
136136

@@ -197,7 +197,7 @@ The L0 stall bounds the queue of frozen memtables, but not the active memtable t
197197

198198
Level 1 is watched alongside L0 because a high L1 count means compaction is falling behind, and a compaction backlog eventually starves flushing too (flushers wait on compaction to free space). Throttling on L1 therefore acts as a leading indicator, applying pressure before L0 becomes critical and heading off a cascade.
199199

200-
The per-column-family signals above cannot, by themselves, prevent an out-of-memory condition when many column families fill up at once. So a separate global guard runs in the reaper thread every 100ms. It sums all the memory the database is using — active and immutable memtables, in-flight transaction buffers, compaction scratch space, bloom filters, block indexes, and caches — and divides by a resolved limit (`max_memory_usage`, default half of system RAM, never less than 5%). The resulting pressure level is graduated: normal below 60%, elevated to 75%, high to 95%, critical above. The write path reads this level with one atomic load per commit, so it costs nothing at normal pressure. As pressure climbs, the response escalates: at elevated, the flush threshold tightens and the current family is flushed proactively; at high, the current family is force-flushed and the reaper force-flushes the largest non-flushing family; at critical, writes block entirely until the reaper brings pressure down (timing out after 10 seconds with `TDB_ERR_BUSY`), while the reaper force-flushes every non-flushing family and aggressively compacts the one with the most SSTables. In unified mode, where one memtable is shared, the reaper rotates that single memtable instead of iterating empty per-CF ones. As a last line of defense, an OS-level check polls real free memory every few seconds and forces the level to critical if free RAM drops below 5%, catching consumption that TidesDB's own accounting cannot see.
200+
The per-column-family signals above cannot, by themselves, prevent an out-of-memory condition when many column families fill up at once. So a separate global guard runs in the reaper thread every 100ms. It sums all the memory the database is using — active and immutable memtables, in-flight transaction buffers, compaction scratch space, bloom filters, block indexes, and caches — and divides by a resolved limit (`max_memory_usage`, default 75% of system RAM, never less than 5%). The resulting pressure level is graduated: normal below 60%, elevated to 75%, high to 95%, critical above. The write path reads this level with one atomic load per commit, so it costs nothing at normal pressure. As pressure climbs, the response escalates: at elevated, the flush threshold tightens and the current family is flushed proactively; at high, the current family is force-flushed, the reaper force-flushes the largest non-flushing family, and it aggressively compacts the family with the most SSTables; at critical, writes block entirely until the reaper brings pressure down (timing out after 10 seconds with `TDB_ERR_BUSY`), while the reaper force-flushes every non-flushing family. In unified mode, where one memtable is shared, the reaper rotates that single memtable instead of iterating empty per-CF ones. As a last line of defense, an OS-level check polls real free memory every few seconds and forces the level to critical if free RAM drops below 5%, catching consumption that TidesDB's own accounting cannot see.
201201

202202
The point of the whole scheme is smooth degradation. Increasing the write-buffer size trades flush frequency against memory used during stalls; raising the stall threshold trades memory for burst tolerance; adding flush workers drains the queue faster; and `max_memory_usage` caps the whole envelope. The right settings depend on the write pattern, the available memory, and the disk — but in every case the system slows down gradually as it approaches its limits, rather than swinging between full speed and a dead stop.
203203
## The Read Path
@@ -348,7 +348,7 @@ The work that does not happen on the caller's thread happens here (Figure 7). Fl
348348
<img src="/design-diags/07_background_workers.png" alt="Figure 7. Background worker pools.">
349349
</div>
350350

351-
Flush workers (default 2) take frozen memtables off the queue and write them to SSTables, in parallel across column families. Compaction workers (default 2) merge SSTables across levels, in parallel across families, and fan out within a single round through sub-compaction. The sync worker (1 thread, started only if any WAL uses interval sync) periodically fsyncs the WALs configured for it; it finds the smallest configured interval, sleeps that long, and syncs each due WAL. Column families on interval sync also force an explicit fsync at structural boundaries — when a memtable rotates, and during every sorted-run creation and merge — which preserves durability while still batching ordinary writes.
351+
Flush workers (default auto, min of CPU count and 4) take frozen memtables off the queue and write them to SSTables, in parallel across column families. Compaction workers (default 2) merge SSTables across levels, in parallel across families, and fan out within a single round through sub-compaction. The sync worker (1 thread, started only if any WAL uses interval sync) periodically fsyncs the WALs configured for it; it finds the smallest configured interval, sleeps that long, and syncs each due WAL. Column families on interval sync also force an explicit fsync at structural boundaries — when a memtable rotates, and during every sorted-run creation and merge — which preserves durability while still batching ordinary writes.
352352

353353
The reaper (1 thread) runs a maintenance loop every 100ms and is the system's general groundskeeper. Each cycle it sweeps the deferred-free list, retries flushes that were deferred under the concurrency cap, services any compaction triggers that arrived while a compaction was already running, recomputes global memory pressure and acts on it, and evicts idle SSTable file handles when too many are open. The memory-pressure response was described with [Backpressure](#backpressure-and-flow-control); the two pieces of bookkeeping unique to the reaper are worth a word each.
354354

@@ -407,13 +407,13 @@ The bloom false-positive rate, 1% by default, balances memory against effectiven
407407

408408
Memtable size trades flush frequency against recovery time and memory. Larger memtables flush less often but lengthen recovery and use more memory; smaller ones flush more (more SSTables, more compaction) but recover faster. The 64MB default holds about a million small pairs and flushes every few seconds under moderate load. Doubling it halves flush frequency but raises level-1-to-level-2 amplification, since each flush produces a larger table that takes longer to merge.
409409

410-
Worker counts default to two flush and two compaction threads, which give cross-family parallelism at modest cost. More threads help with many active families but cost memory (each buffers 64KB blocks) and descriptors (two per table in flight). The device dominates the choice: on a spinning disk, several concurrent compactors cause head seeks that destroy throughput; on NVMe, more workers help. So 1–2 workers for HDD, 4–8 for NVMe.
410+
Worker counts default to auto flush threads (the CPU count, capped at 4) and two compaction threads, which give cross-family parallelism at modest cost. More threads help with many active families but cost memory (each buffers 64KB blocks) and descriptors (two per table in flight). The device dominates the choice: on a spinning disk, several concurrent compactors cause head seeks that destroy throughput; on NVMe, more workers help. So 1–2 workers for HDD, 4–8 for NVMe.
411411

412412
## Operational Considerations
413413

414414
A TidesDB instance is safe for many threads in one process but exclusive to a single process: only one process may open a database directory at a time. Exclusivity is a non-blocking file lock taken during open — if another process holds it, open returns `TDB_ERR_LOCKED` at once rather than waiting. The locking primitive is chosen per platform for correct semantics: `fcntl` locks on macOS and BSD (which, unlike `flock`, are not inherited across `fork`, with the owning PID written to the lock file so a same-process double-open is caught), OFD locks on modern Linux, and `LockFileEx` on Windows, with retries on signal interruption so a stray signal cannot spuriously fail the lock.
415415

416-
Memory use per family comes from a few structures: the active memtable is configurable (default 64MB) and the immutable queue is that size times its depth (usually 1–2); the block cache is shared across families (default 64MB total); bloom filters cost about 10 bits per key and block indexes about 32 bytes per block. A family with 10M keys across 100 SSTables therefore runs around 150MB plus its share of the cache. The `max_memory_usage` cap (default auto, resolving to half of system RAM, never clamped below 5%) bounds the aggregate across all families, which is what prevents an out-of-memory condition in many-family deployments where per-family limits cannot.
416+
Memory use per family comes from a few structures: the active memtable is configurable (default 64MB) and the immutable queue is that size times its depth (usually 1–2); the block cache is shared across families (default 64MB total); bloom filters cost about 10 bits per key and block indexes about 32 bytes per block. A family with 10M keys across 100 SSTables therefore runs around 150MB plus its share of the cache. The `max_memory_usage` cap (default auto, resolving to 75% of system RAM, never clamped below 5%) bounds the aggregate across all families, which is what prevents an out-of-memory condition in many-family deployments where per-family limits cannot.
417417

418418
Three operational limits interact at the margins. When writes outpace compaction, backpressure stalls them once the flush queue passes its threshold, trading occasional latency spikes for bounded memory. Because SSTables are immutable, space is reclaimed only after a compaction finishes and deletes its inputs, so a compaction can briefly need double the space of the level it rewrites; the engine checks free space before starting one. And because each SSTable holds two descriptors open, a working set larger than the open-file budget makes the reaper thrash; an operator who wants a bigger resident set can raise the process's descriptor ceiling before opening the database, after which the engine sizes its budget to fit. The raise is opt-in and a partial failure is non-fatal.
419419
## On-Disk Format

src/content/docs/reference/building.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -469,6 +469,7 @@ TidesDB provides several CMake options to customize the build:
469469
| `TIDESDB_BUILD_TESTS` | Build test suite | `ON` |
470470
| `BUILD_SHARED_LIBS` | Build shared libraries instead of static | `ON` (Unix), `OFF` (Windows) |
471471
| `ENABLE_READ_PROFILING` | Enable read profiling instrumentation | `OFF` |
472+
| `TIDESDB_WARN_MAYBE_UNINIT` | Enable `-Wmaybe-uninitialized` (GCC only; requires an optimized build) | `OFF` |
472473
| `TIDESDB_WITH_SNAPPY` | Build with Snappy compression support | `ON` (`OFF` on SunOS) |
473474
| `TIDESDB_WITH_LZ4` | Build with LZ4 compression support | `ON` |
474475
| `TIDESDB_WITH_ZSTD` | Build with Zstandard compression support | `ON` |

src/content/docs/reference/c.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -195,7 +195,7 @@ tidesdb_finalize();
195195
```c
196196
tidesdb_config_t config = {
197197
.db_path = "./mydb",
198-
.num_flush_threads = 2, /* Flush thread pool size (default: 2) */
198+
.num_flush_threads = 2, /* Flush thread pool size (default: 0 = auto, min(cpu_count, 4)) */
199199
.num_compaction_threads = 2, /* Compaction thread pool size (default: 2) */
200200
.log_level = TDB_LOG_INFO, /* Log level: TDB_LOG_DEBUG, TDB_LOG_INFO, TDB_LOG_WARN, TDB_LOG_ERROR, TDB_LOG_FATAL, TDB_LOG_NONE */
201201
.block_cache_size = 64 * 1024 * 1024, /* 64MB global block cache (default: 64MB) */
@@ -525,6 +525,7 @@ if (tidesdb_rename_column_family(db, "old_name", "new_name") != 0)
525525

526526
**Return values**
527527
- `TDB_SUCCESS` · Rename completed successfully
528+
- `TDB_ERR_INVALID_ARGS` · `db`, `old_name`, or `new_name` is NULL
528529
- `TDB_ERR_NOT_FOUND` · Column family with `old_name` doesn't exist
529530
- `TDB_ERR_EXISTS` · Column family with `new_name` already exists
530531
- `TDB_ERR_IO` · Failed to rename directory on disk
@@ -774,7 +775,7 @@ if (tidesdb_get_db_stats(db, &db_stats) == 0)
774775
| `unified_next_cf_index` | `uint32_t` | Next CF prefix index to assign in unified mode |
775776
| `unified_wal_generation` | `uint64_t` | Current generation number of the unified WAL |
776777
| `object_store_enabled` | `int` | 1 if an object store connector is attached |
777-
| `object_store_connector` | `const char*` | Name of the object store connector ("fs", "s3", or NULL) |
778+
| `object_store_connector` | `const char*` | Name of the object store connector ("fs", "s3", "unknown", or NULL when no store is attached) |
778779
| `local_cache_bytes_used` | `size_t` | Bytes currently used by the local SSTable cache (object store mode) |
779780
| `local_cache_bytes_max` | `size_t` | Local cache capacity in bytes (object store mode) |
780781
| `local_cache_num_files` | `int` | Number of SSTable files resident in the local cache |
@@ -813,8 +814,8 @@ if (tidesdb_get_cache_stats(db, &cache_stats) == 0)
813814
printf("Cache enabled: yes\n");
814815
printf("Total entries: %zu\n", cache_stats.total_entries);
815816
printf("Total bytes: %.2f MB\n", cache_stats.total_bytes / (1024.0 * 1024.0));
816-
printf("Hits: %lu\n", cache_stats.hits);
817-
printf("Misses: %lu\n", cache_stats.misses);
817+
printf("Hits: %" PRIu64 "\n", cache_stats.hits);
818+
printf("Misses: %" PRIu64 "\n", cache_stats.misses);
818819
printf("Hit rate: %.1f%%\n", cache_stats.hit_rate * 100.0);
819820
printf("Partitions: %zu\n", cache_stats.num_partitions);
820821
}
@@ -2152,7 +2153,8 @@ if (tidesdb_compact_range(cf, start, sizeof(start) - 1, end, sizeof(end) - 1) !=
21522153
**Return values**
21532154

21542155
- `TDB_SUCCESS` on success
2155-
- `TDB_ERR_INVALID_ARGS` if `cf` or either key pointer is NULL, or sizes are zero
2156+
- `TDB_ERR_INVALID_ARGS` if `cf` is NULL, if both `start_key` and `end_key` are NULL, or if a non-NULL key has size zero (a single NULL key is allowed and means an unbounded bound on that side)
2157+
- `TDB_ERR_LOCKED` if another compaction is already running on the column family
21562158
- Standard I/O and memory error codes if the merge cannot complete
21572159

21582160
### Purge Column Family
@@ -2272,7 +2274,7 @@ TidesDB uses separate thread pools for flush and compaction operations. Understa
22722274
```c
22732275
tidesdb_config_t config = {
22742276
.db_path = "./mydb",
2275-
.num_flush_threads = 4, /* Flush thread pool size (default: 2) */
2277+
.num_flush_threads = 4, /* Flush thread pool size (default: 0 = auto, min(cpu_count, 4)) */
22762278
.num_compaction_threads = 4, /* Compaction thread pool size (default: 2) */
22772279
.max_concurrent_flushes = 0, /* 0 = auto-match num_flush_threads (recommended) */
22782280
.log_level = TDB_LOG_INFO,

0 commit comments

Comments
 (0)