You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `--allocator` flag only affects `libtidesdb.so`'s internal allocations (memtable, klog/vlog buffers, compaction scratch, txn ops). `mariadbd`'s allocator is unchanged. For a process-wide swap, also `LD_PRELOAD` the allocator at server startup:
Note that `--rebuild-plugin` does not rebuild `libtidesdb.so`, so changing `--allocator` requires a full install run (omit `--rebuild-plugin`) to take effect. Verify the linkage with:
@@ -186,11 +206,23 @@ The optimizer is aware of these indexes. The engine reports cost estimates based
186
206
187
207
The engine supports Index Condition Pushdown for secondary index scans. When the optimizer pushes a WHERE condition down to the storage engine, the engine evaluates it on the index key columns before performing the expensive primary key point-lookup into the data column family. Index entries that fail the condition are skipped without touching the data CF at all. This is the same pattern used by InnoDB - the engine decodes the index key columns into the record buffer and calls MariaDB's `handler_index_cond_check()` evaluator. ICP is supported for indexes on integer types (`TINYINT`, `SMALLINT`, `MEDIUMINT`, `INT`, `BIGINT`), temporal types (`DATE`, `DATETIME`, `TIMESTAMP`, `YEAR`), and fixed-length `CHAR`/`BINARY` columns with binary or latin1 charset. For indexes on multi-byte charset string columns (e.g., `utf8mb4`), the engine falls through to the standard PK-lookup path.
188
208
209
+
### Multi-Range Read (MRR)
210
+
211
+
The engine implements a custom MRR path for point-lookup batches such as `WHERE col IN (v1, v2, ..., vN)` on a primary or full-key unique index. When every range the optimizer hands the engine is a full-key point equality (`UNIQUE_RANGE | EQ_RANGE`) and there are at least two ranges, the engine buffers them, converts each key into comparable bytes, and sorts by those bytes so the LSM sees a monotone stream of seeks — much friendlier to the block cache and the merge-heap than N scattered seeks in user-supplied order. Primary-key lookups bypass the iterator entirely via `fetch_row_by_pk`; secondary-index lookups reuse a single cached iterator and do one seek per entry. Ranges whose rows have been deleted concurrently are silently skipped.
212
+
213
+
The engine deliberately declines MRR in three cases, falling back to the base handler's default implementation:
214
+
215
+
- Single-range scans (`count < 2`) — MRR has no sorting win for one key, and the eq_ref path is where pessimistic row locking engages.
216
+
- Non-point ranges — true `BETWEEN`/`<`/`>` scans stay on `read_range_first`.
217
+
- Partitioned tables — `ha_partition` already dispatches MRR across children using its own DS-MRR logic.
218
+
189
219
190
220
## Auto-Increment
191
221
192
222
Auto-increment works in a similar way to InnoDB. The engine calls MariaDB's built-in `update_auto_increment()` mechanism during `write_row()`. Rather than calling `index_last()` on every INSERT (which would create and destroy a TidesDB merge-heap iterator each time), the engine maintains an in-memory atomic counter on the shared table descriptor. The counter is seeded once at table open time by seeking to the last key in the primary key column family, and is atomically incremented via a CAS loop on each INSERT - making auto-increment assignment O(1). When a user inserts an explicit value larger than the current counter, `write_row()` bumps the counter to match.
193
223
224
+
`TRUNCATE TABLE` and `ALTER TABLE ... AUTO_INCREMENT=N` both reset the counter via the engine's `reset_auto_increment` handler hook — the next generated ID equals `N` (or `1` after a bare `TRUNCATE`). This applies to both user-defined AUTO_INCREMENT columns and hidden-PK tables.
225
+
194
226
```sql
195
227
CREATETABLEtickets (
196
228
id INTNOT NULL AUTO_INCREMENT PRIMARY KEY,
@@ -253,7 +285,7 @@ For workloads that depend on InnoDB-style row-level serialization, TidesDB provi
253
285
SET GLOBAL tidesdb_pessimistic_locking =ON;
254
286
```
255
287
256
-
When enabled, the engine acquires per-row locks on primary key values for all write-intent statements:`SELECT ... FOR UPDATE`, `UPDATE`, `DELETE`, and `INSERT`. Locks are held until the transaction commits or rolls back. A second transaction that attempts to access the same primary key value will block until the first transaction releases its lock, rather than proceeding optimistically and failing at commit time.
288
+
When enabled, the engine acquires row level locks on primary key values for every writeintent statement such as`SELECT ... FOR UPDATE`, `UPDATE`, `DELETE`, and `INSERT`. Locks are held until the transaction commits or rolls back. A second transaction that attempts to access the same primary key value will block until the first transaction releases its lock, rather than proceeding optimistically and failing at commit time.
257
289
258
290
The lock manager uses a partitioned hash table with 65,536 partitions, each protected by its own mutex. Primary key bytes are hashed (XXH3) to a partition, and each partition maintains a chain of lock entries keyed by the full comparable PK bytes. This gives per-row granularity without a global bottleneck. Lock entries are created on demand and persist in the hash table for the lifetime of the server, so repeated access to the same key reuses the existing entry without allocation.
259
291
@@ -287,6 +319,10 @@ COMMIT;
287
319
288
320
Before blocking on a held lock, the engine performs wait-for-graph cycle detection. It follows the chain of `waiting_on` pointers from the current lock's owner through any locks they are waiting on, up to a depth of 100 hops. If the chain leads back to the requesting transaction, a cycle exists and the engine returns `ER_LOCK_DEADLOCK` (ERROR 1213) immediately instead of blocking. The victim transaction can retry. This is the same error code and semantics that InnoDB uses for deadlocks.
289
321
322
+
The graph walk is performed without holding the current partition's mutex. The engine publishes its wait intent under the mutex, drops the mutex, walks the graph using atomic loads on the `owner_trx` and `waiting_on` fields (which are `std::atomic` precisely for this reason), then reacquires the mutex and rechecks state. This avoids serializing every locker on a partition behind a walk of up to 100 hops and lets other threads proceed while detection runs. Lock entries and transaction objects are never freed during normal operation, so pointer stability across the walk is guaranteed.
323
+
324
+
A waiter that receives `KILL QUERY` during its wait is woken promptly. The `kill_query` handlerton callback broadcasts on the owning lock's condition variable, the wait loop observes `thd_killed()` on its next iteration, and the statement returns `HA_ERR_LOCK_WAIT_TIMEOUT` instead of hanging until the holder commits.
325
+
290
326
```sql
291
327
-- Connection A:
292
328
BEGIN;
@@ -322,6 +358,12 @@ CREATE TABLE ledger (
322
358
323
359
The available isolation levels are `READ_UNCOMMITTED`, `READ_COMMITTED`, `REPEATABLE_READ`, `SNAPSHOT`, and `SERIALIZABLE`.
324
360
361
+
### Bulk DML Batching
362
+
363
+
Statements that touch many rows such as `LOAD DATA INFILE`, multi row `INSERT`, `INSERT ... SELECT`, and `UPDATE` or `DELETE` over a range keep the TidesDB transaction from growing unbounded by committing mid statement in fixed size batches. The engine hooks MariaDB's `start_bulk_insert`, `start_bulk_update`, and `start_bulk_delete` callbacks, counts the row operations (data write plus secondary index maintenance) against `TIDESDB_BULK_INSERT_BATCH_OPS` (50,000 ops), and at each threshold commits the current transaction and resets it with `READ_COMMITTED` for the next batch. This keeps statement memory bounded and keeps the transaction under `TDB_MAX_TXN_OPS` regardless of statement size. Autocommit semantics are preserved so a failure rolls back only the current batch, and the statement as a whole reports the first error encountered.
364
+
365
+
The mid commit logic is shared between INSERT, UPDATE, and DELETE via a single `maybe_bulk_commit` helper, so the batching threshold and the iterator plus dup cache invalidation policy are identical across the three paths.
366
+
325
367
326
368
## Table Options
327
369
@@ -795,7 +837,7 @@ CREATE TABLE docs (
795
837
) ENGINE=TIDESDB;
796
838
```
797
839
798
-
`M` controls the number of neighbors per node in the MHNSW graph (default 6, range 3-200). Higher values improve recall at the cost of slower inserts and more memory. `DISTANCE` selects the distance metric:`euclidean` (default) or `cosine`.
840
+
`M` controls the number of neighbors per node in the MHNSW graph (default 6, range 3 to 200). Higher values improve recall at the cost of slower inserts and more memory. `DISTANCE` selects the distance metric, either`euclidean` (default) or `cosine`.
799
841
800
842
### DML Support
801
843
@@ -1276,12 +1318,26 @@ Renaming a table renames all associated column families, including secondary ind
1276
1318
1277
1319
Dropping a table drops the main data CF and then enumerates and drops all index CFs that share the table's naming prefix. The operation is idempotent, if a CF does not exist, the engine simply continues.
1278
1320
1321
+
`DROP DATABASE` is wired through the engine's handlerton `drop_database` callback. MariaDB invokes it after removing the `.frm` files for the database; the engine then enumerates every column family whose name starts with `<db_name>__` and drops each one (data CFs plus their `__idx_*` secondary-index CFs), force-removes the on-disk directories, and purges schema-CF entries for the database (object-store mode only). This prevents orphaned column families accumulating on disk when a database is dropped.
1322
+
1279
1323
```sql
1280
1324
RENAME TABLE events TO event_log;
1281
1325
TRUNCATE TABLE event_log;
1282
1326
DROPTABLE event_log;
1327
+
DROPDATABASE mydb; -- drops every TidesDB CF under mydb__*
1283
1328
```
1284
1329
1330
+
## Server Lifecycle Hooks
1331
+
1332
+
The engine wires several handlerton callbacks so that TidesDB cooperates with MariaDB's lifecycle and durability guarantees:
1333
+
1334
+
| Callback | Purpose |
1335
+
|---------|---------|
1336
+
|`flush_logs`|`FLUSH LOGS` (and `mariadb-backup`'s pre-copy step) syncs the TidesDB WAL so on-disk copies are a consistent snapshot. |
1337
+
|`panic`| On signal-driven shutdown paths MariaDB may call `panic(HA_PANIC_CLOSE)` instead of the normal deinit; the engine performs an orderly `tidesdb_close()` there so pending commits are flushed. |
1338
+
|`pre_shutdown`| Lets background threads quiesce before the deinit path begins; syncs the unified WAL so compactions in flight don't get killed mid-write. |
1339
+
|`kill_query`|`KILL QUERY <id>` wakes any waiter blocked in `row_lock_acquire` and, in combination with `thd_killed()` checks scattered through the scan loops (`rnd_next`, `index_next`, `index_prev`, `index_next_same`, `spatial_scan_next`, `ft_read`), promptly terminates long-running statements with `HA_ERR_ABORTED_BY_USER`. |
0 commit comments