design doc update, tidesql reference update and c reference update to reflect tidesdb minor v9.1.0

guycipher · guycipher · commit 39b4947dac5e · 2026-04-23T22:51:16.000-04:00
diff --git a/src/content/docs/getting-started/how-does-tidesdb-work.md b/src/content/docs/getting-started/how-does-tidesdb-work.md
@@ -377,6 +377,14 @@ When a cached block has a pre-built key offset index (the indexed block format p
 
 Iterator seek operations cache memtable sources (active memtable, immutable memtables, and transaction write buffer) on the iterator at creation time rather than recreating them on every seek call. This eliminates per-seek overhead of allocating source structs, initializing skip list cursors, traversing to the first entry, and creating initial key-value pairs. The active memtable is pinned with `try_ref` during iterator creation to prevent a concurrent rotation plus flush from freeing the memtable between the atomic load and the merge source creation. The pin is released after the merge source takes its own internal reference. Immutable memtables are snapshotted via the lock-free RCU snapshot mechanism with per-item `try_ref` for the same reason. The cached sources are repositioned to the target key on each seek using the existing cursor seek operations. A pre-allocated temporary source array on the iterator avoids malloc/free of the source list on every seek as well. Combined with the SSTable source cache (which persists across seeks via `cached_sources`), this means the hot seek path performs zero memory allocations.
 
+#### Zero-Copy Memtable Merge Sources
+
+SSTable merge sources expose their current key-value pair as a borrowed pointer into pinned block data via `source->inline_kv` with the `TDB_KV_FLAG_BORROWED` flag, which avoids allocating a fresh key-value pair on every cursor step. Memtable and unified-memtable merge sources use the same pattern. The skip list cursor returns key and value pointers into stable node memory, and the iterator pins the memtable (active via `try_ref`, immutable via refcount) for its entire lifetime, so the borrowed pointers remain valid until the next advance.
+
+The merge heap materialises a stable owned copy into the iterator's double-buffered `pop_buf` arena only when the caller actually retains the popped entry. Discards inside the tombstone-skip loop do not trigger that materialisation.
+
+The tombstone-skip loop itself is consolidated across forward and backward iteration into a single helper. When the heap's top entry is a visible tombstone, the helper copies the tombstone's key into a stable stack buffer (with a heap fallback for keys larger than `TDB_PREFIXED_KEY_STACK_MAX`) and then advances every other source whose current entry matches that key. The forward path uses `tidesdb_merge_heap_pop_discard`, which moves the top source's cursor forward without materialising into `pop_buf`, so each skipped tombstone costs one cursor step and zero key or value copies. Copying the tombstone key onto the stack before the skip loop prevents subsequent pops inside the loop from reusing the same `pop_buf` slot and overwriting the tombstone-key pointer that the comparator still depends on.
+
 ## Compaction
 
 ### Strategy
@@ -517,6 +525,20 @@ If a source encounters corruption while its cursor is advancing, the `tidesdb_me
 
 Large values (those meeting or exceeding the value log threshold) flow through compaction rather than being copied byte-for-byte. The system reads the value from the source value log, recompresses it according to the current column family configuration (which may differ from the original compression setting), and writes the recompressed value to the destination value log. This allows compression settings to evolve over time without requiring a full database rebuild.
 
+### Single-Delete and Pair Cancellation
+
+A regular tombstone written by `tidesdb_txn_delete` has to be carried forward through every compaction until it reaches the largest active level, because any level below the compaction output could still contain an older put for the same key that the tombstone is masking. Dropping the tombstone earlier would re-expose that stale put. Workloads that insert each key once and then delete it once therefore pay a latency tax on reads: every range scan over the deleted region walks across the accumulated tombstones until a compaction at the bottom level finally reaps them.
+
+`tidesdb_txn_single_delete` lets the caller opt out of that conservatism for keys that satisfy a simple contract: between any two single-deletes on the same key, and between the start of the key's history and its first single-delete, the key has been put at most once. With that promise the engine is free to drop a put and its matching single-delete together the first compaction that sees both in the same merge input, regardless of level. Reads treat a single-delete exactly like any other tombstone; the difference lives entirely in the compaction merge.
+
+The single-delete subtype is a second flag bit (`TDB_KV_FLAG_SINGLE_DELETE`) carried alongside `TDB_KV_FLAG_TOMBSTONE` in the kv-pair flag byte. The byte is already persisted by both the klog-block and B+tree SSTable formats, so the extra bit does not change the on-disk layout; older binaries that do not examine the single-delete bit still see a tombstone and treat the entry correctly. The bit is preserved through the write path: the WAL encodes it next to the existing tombstone bit, the skip list carries an equivalent `SKIP_LIST_FLAG_SINGLE_DELETE` bit on each version, memtable flush stamps it onto the flushed SSTable entry, and merge sources surface it on the kv-pair they expose to compaction.
+
+Pair cancellation fires during the merge emit phase. The merge heap delivers same-key versions in descending sequence order, so the first entry popped for a key is the newest surviving version. The emit loop buffers that first-for-key entry as `pending` and only resolves its fate when the next distinct key arrives. While pending is held, any same-key entries popped behind it are dropped silently (the existing dedup rule). When the pending entry is a single-delete and the next older same-key version is a live put rather than another tombstone, the pair is flagged for cancellation. On resolve, a pending single-delete that paired with a put is dropped outright; a pending regular tombstone that did not pair follows the existing rules (dropped only when merging into the largest level) and unexpired live entries are emitted normally. The same lookahead runs in every emit site: the B+tree writer used by full-preemptive merges into the largest level, the klog-block inline loop of the full-preemptive merge, the klog-block inline loop of the dividing merge, and the klog-block inline loop of the partitioned merge.
+
+The partitioned merge's inline loop has a mid-loop SSTable-split on `file_max` that is awkward to restructure around a one-step buffer, so it uses a narrower peek-based variant: when a popped single-delete's key matches the next top-of-heap source's current key and that source has a live put, the single-delete is dropped immediately and the existing same-key dedup sweeps the put on the next iteration. The net effect is the same for the dominant case where the put and the single-delete arrive adjacent in the merge input.
+
+Calling `tidesdb_txn_single_delete` on a key that has been put more than once since the last single-delete is a contract violation; the engine cannot detect it, and the result is that only the most recent put is masked while older puts remain visible. Callers that cannot guarantee the contract must use `tidesdb_txn_delete` instead.
+
 ### Summary
 
 TidesDB's compaction is a multi-faceted algorithm that employs three distinct merge policies, each optimized for different scenarios within the LSM-tree lifecycle. These policies work in concert with Dynamic Capacity Adaptation to automatically scale the tree structure up or down as data volume changes.
diff --git a/src/content/docs/reference/c.md b/src/content/docs/reference/c.md
@@ -1198,6 +1198,39 @@ tidesdb_txn_commit(txn);
 tidesdb_txn_free(txn);
 ```
 
+### Single-Delete
+
+`tidesdb_txn_single_delete` writes a tombstone with the same read semantics as `tidesdb_txn_delete`, but carries a caller-provided promise that lets compaction drop the put and the tombstone together as soon as both appear in the same merge input, rather than carrying the tombstone forward until it reaches the largest active level.
+
+Between any two single-deletes on the same key, and between the start of the key's history and its first single-delete, the key has been put **at most once**. The engine does not and cannot verify this at runtime; violating the contract can leave older puts visible after the single-delete and is a bug in the caller.
+
+This is the right choice for workloads that insert each key exactly once and then delete it exactly once (classic insert-benchmark patterns, secondary-index entries on columns that are never updated, log-style tables with scheduled purges). It is **not** safe for tables that issue repeated updates to the same key.
+
+```c
+tidesdb_column_family_t *cf = tidesdb_get_column_family(db, "my_cf");
+if (!cf) return -1;
+
+tidesdb_txn_t *txn = NULL;
+tidesdb_txn_begin(db, &txn);
+
+const uint8_t *key = (uint8_t *)"mykey";
+tidesdb_txn_single_delete(txn, cf, key, 5);
+
+tidesdb_txn_commit(txn);
+tidesdb_txn_free(txn);
+```
+
+Signature:
+
+```c
+int tidesdb_txn_single_delete(tidesdb_txn_t *txn,
+                              tidesdb_column_family_t *cf,
+                              const uint8_t *key,
+                              size_t key_size);
+```
+
+Returns `TDB_SUCCESS` on success or a negative error code on failure. When in doubt, prefer `tidesdb_txn_delete`.
+
 ### Multi-Operation Transaction
 
 ```c
@@ -1865,7 +1898,8 @@ if (tidesdb_flush_memtable(cf) != 0)
 - Graceful shutdown · Flush pending data before closing the database
 
 **Behavior**
-- Enqueues flush work in the global flush thread pool
+- Rotates the column family's active memtable and enqueues the rotated memtable for flush regardless of its current size (no write-buffer threshold gate)
+- In unified-memtable mode the shared memtable is rotated through the unified flush path, so the call behaves the same whether the database is in per-CF or unified-memtable mode
 - Returns immediately (non-blocking) -- flush runs asynchronously in background threads
 - If flush is already running for the column family, the call succeeds but doesn't queue duplicate work
 - Thread-safe -- can be called concurrently from multiple threads
diff --git a/src/content/docs/reference/tidesql.md b/src/content/docs/reference/tidesql.md
@@ -365,6 +365,46 @@ Statements that touch many rows such as `LOAD DATA INFILE`, multi row `INSERT`,
 The mid commit logic is shared between INSERT, UPDATE, and DELETE via a single `maybe_bulk_commit` helper, so the batching threshold and the iterator plus dup cache invalidation policy are identical across the three paths.
 
 
+## Single-Delete Optimization
+
+DELETE on a TidesDB table writes a tombstone into every column family the row touches: the primary row CF plus one CF per secondary, full-text, or spatial index. Regular tombstones have to be carried forward through every compaction until they reach the largest active level, because any level below could still contain an older put of the same key that the tombstone is masking. Insert-then-delete workloads (event streams, log tables, TTL-style purges, the classic iibench benchmark) pile these tombstones at the low end of the key space where DELETE range scans start, and the scan CPU climbs linearly with the backlog until compaction catches up.
+
+The TidesDB library's single-delete primitive (`tidesdb_txn_single_delete`) lets compaction drop a put and its matching tombstone together the first time both appear in the same merge input, regardless of level. The caller's contract is "at most one put between single-deletes on the same key (or between the start of the key's history and its first single-delete)". For reads, a single-delete behaves exactly like a regular tombstone.
+
+The plugin splits this across two behaviours:
+
+### Secondary-index single-delete (automatic)
+
+Every secondary index entry -- `(col_values, pk)` for a regular index, `(term, pk)` for a FULLTEXT index, `(hilbert_value, pk)` for a SPATIAL index -- is written exactly once per row lifetime and deleted exactly once, across every path the plugin takes: `INSERT`, `UPDATE` (which delete-plus-put when the indexed columns change), `DELETE`, `REPLACE INTO`, `INSERT ... ON DUPLICATE KEY UPDATE`. The same `(composite, pk)` bytes never see a second put without an intervening delete, so the single-delete contract holds by construction of the index key layout.
+
+The plugin therefore uses `tidesdb_txn_single_delete` for every secondary-index delete automatically. No configuration, no user flag, no workload assumption. This alone covers three of the four tombstones per deleted row on a table with three secondary indexes.
+
+### Primary-CF single-delete (opt-in per session)
+
+The primary row CF is different. `UPDATE t SET non_pk_col = ...` writes a fresh row at the same `data_key(pk)`, producing a put-over-put. `REPLACE INTO` on a table without secondary indexes takes a short-circuit path that overwrites the primary row silently for the same reason. Under either pattern, dropping a primary-CF put and its later single-delete together at compaction can re-expose an older put -- a silent correctness problem the engine cannot detect from the outside.
+
+Primary-CF single-delete is therefore behind the session variable `tidesdb_single_delete_primary`, default OFF. Enabling it is the caller's explicit promise that:
+
+- The session performs no `UPDATE` on non-PK columns of TidesDB tables.
+- The session performs no `REPLACE INTO` or `INSERT ... ON DUPLICATE KEY UPDATE` that hits the line-5143 silent-overwrite path on a table without secondary indexes.
+- New rows with a given PK are always preceded by a `DELETE` of that PK (append-only or insert-then-delete).
+
+Enable it only when the workload is known to fit this shape. Typical safe cases:
+
+```sql
+-- classic insert-then-delete (event stream, TTL purge, iibench-shape)
+SET SESSION tidesdb_single_delete_primary = 1;
+INSERT INTO events (...) VALUES ...;   -- monotonic PK
+DELETE FROM events WHERE ts < NOW() - INTERVAL 1 HOUR;
+```
+
+Leave it OFF for any session that may issue `UPDATE` on a non-PK column, `REPLACE INTO` on a no-secondary table, or `INSERT ... ON DUPLICATE KEY UPDATE` on a no-secondary table. Setting the variable ON in those scenarios can leak older row versions through reads after a compaction.
+
+### When to expect a benefit
+
+The larger the tombstone backlog at the scan head of your DELETE statements, the more the single-delete pair-cancellation helps. On iibench-shaped insert-then-delete workloads, with three secondary indexes, turning both automatic secondary-index SD and the primary-CF session variable together typically cuts the `max_d` sawtooth peak by 60 to 95 percent, depending on how long deletes have been running against the same key range without compaction catching up. On workloads with no DELETE, no benefit -- and no risk either, since the secondary-index path only changes behaviour on DELETE and UPDATE.
+
+
 ## Table Options
 
 TidesDB exposes a rich set of per-table options that control the underlying column family's behavior. These are specified as table-level options in `CREATE TABLE` and are baked into the column family at creation time. They appear in `SHOW CREATE TABLE` output.
@@ -1185,6 +1225,7 @@ The engine exposes several system variables that control TidesDB's runtime behav
 |----------|---------|-------------|
 | `tidesdb_ttl` | 0 | Per-session TTL in seconds applied to INSERT/UPDATE; 0 means use the table-level default. Can be set with `SET SESSION` or `SET STATEMENT` |
 | `tidesdb_skip_unique_check` | OFF | Skip uniqueness checks on primary key and unique secondary indexes during INSERT. Only safe when the application guarantees no duplicates (e.g., bulk loads with monotonic PKs) |
+| `tidesdb_single_delete_primary` | OFF | Use single-delete semantics on the primary row CF for this session's DELETEs. See [Single-Delete Optimization](#single-delete-optimization) |
 | `tidesdb_default_compression` | LZ4 | Default compression algorithm (NONE, SNAPPY, LZ4, ZSTD, LZ4_FAST) |
 | `tidesdb_default_write_buffer_size` | 128 MB | Default write buffer size in bytes |
 | `tidesdb_default_bloom_filter` | ON | Default bloom filter setting |