|
| 1 | +--- |
| 2 | +title: TidesDB Table Access Method Reference |
| 3 | +description: Official TidesDB table access method for PostgreSQL reference. |
| 4 | +--- |
| 5 | + |
| 6 | +<div class="no-print"> |
| 7 | + |
| 8 | +If you want to download the source of this document, you can find it [here](https://github.com/tidesdb/tidesdb.github.io/blob/master/src/content/docs/reference/tidespg.md). |
| 9 | + |
| 10 | +<hr/> |
| 11 | + |
| 12 | +</div> |
| 13 | + |
| 14 | +TidesPG is a PostgreSQL Table Access Method extension that plugs TidesDB in as an alternative to PostgreSQL's built-in heap storage. |
| 15 | + |
| 16 | +`CREATE TABLE ... USING tidesdb` gives you a PG table whose rows live in TidesDB column families. Writes go through TidesDB's WAL, reads go through its block cache and (where configured) its bloom filters, and compaction/flushing is handled by TidesDB's background threads. |
| 17 | + |
| 18 | +## Why? |
| 19 | + |
| 20 | +Postgres's heap is excellent, but it's one storage model. It's page-based with full tuple versions and a vacuum-driven reclamation story. LSM trees occupy a different point in the tradeoff space with cheap writes, tunable read amplification, good compression, and background compaction that reclaims space without a separate vacuum pass. TidesPG lets you pick that model per-table without giving up anything else about Postgres (SQL, transactions at the statement level, indexes, constraints, the planner). |
| 21 | + |
| 22 | +## When to pick tidesdb (vs. heap) |
| 23 | + |
| 24 | +LSM storage earns its keep in specific, well-understood situations. Use `USING tidesdb` for a table when: |
| 25 | + |
| 26 | +- Your writes are durable and frequent. With `synchronous_commit = on` + `tidesdb.sync_mode = full`, LSM's sequential WAL beats heap's per-page fsync by a wide margin, many multiples. The gap grows with the commit rate because every fsync you save compounds. |
| 27 | +- On-disk space matters. LZ4 compression (default) typically buys 40–60% smaller footprint than heap on the same rows. ZSTD buys more at higher CPU cost. Smaller footprint also means more working set fits in the OS page cache. |
| 28 | +- You store large values. tidesdb streams anything over `tidesdb.klog_value_threshold` (default 512 B) into its own value-log, so there's no TOAST table, no TOAST pointer indirection, no decompression round-trip on read. JSONB blobs, bytea, long text columns. |
| 29 | +- The workload is insert-dominant and mostly ordered. Append-heavy tables (event streams, audit logs, time-series, change-data-capture) sort well under a 48-bit monotonic row counter; sequential scans stay tight and compaction rarely moves the same data twice. |
| 30 | +- You want to avoid VACUUM tuning. Dead rows are reclaimed by LSM compaction running on background threads. No table bloat tracking, no `vacuum_*` autovacuum knobs for that table, no anti-wraparound scares. |
| 31 | +- You have many tables with mixed hotness. Each tidesdb table is a separate column family with its own bloom/index/compression config; cold tables get squeezed hard, hot tables get more memtable and larger block cache. The built-in `tidesdb_cf_stats('my_table')` lets you see per-table amplification. |
| 32 | + |
| 33 | +Stick with heap when: |
| 34 | + |
| 35 | +- Read-dominant workload on an in-RAM working set. Heap's `ctid -> page -> slot` is hard to beat when everything's already in shared_buffers. Our bench shows heap leading on pure SELECT TPS (~80% headroom over tidesdb on point lookups). |
| 36 | +- UPDATE-heavy workloads. Heap's HOT update path keeps the same TID; tidesdb does delete+insert and relies on compaction to reclaim. Write amp is higher on tidesdb for in-place-feeling updates. |
| 37 | +- High concurrent write fan-in. TidesDB takes a process-exclusive advisory lock on its directory, so every extra PG backend that touches a tidesdb table serializes. Fine for modest concurrency; pathological at hundreds of parallel writers. (Fixable with a background-worker fronting design, on the roadmap.) |
| 38 | +- You rely on `PREPARE TRANSACTION`. tidespg refuses it with `ERRCODE_FEATURE_NOT_SUPPORTED`; use heap tables for any relation that will participate in an external-coordinator 2PC. |
| 39 | +- Tables small enough that heap overhead is invisible. Sub-megabyte lookup tables don't benefit from compression and pay the LSM's skiplist + block-cache indirection cost. |
| 40 | + |
| 41 | +Per-table granularity is the real sell. You can mix `USING tidesdb` and the default heap in the same database, same transaction, same query. Put the 2 TB event log on tidesdb, keep the 50 kB reference tables on heap. |
| 42 | + |
| 43 | +## Requirements |
| 44 | + |
| 45 | +- PostgreSQL *18 or newer* (uses the PG 18 TableAM surface `ReadStream`-flavored `scan_analyze_next_block`, the new five-arg `scan_bitmap_next_tuple`, `pg_noreturn`, and `rs_base.st.rs_tbmiterator`) |
| 46 | +- TidesDB installed so that `<tidesdb/tidesdb.h>` is on the include path and `-ltidesdb` resolves |
| 47 | +- A C11-capable compiler and GNU make (standard PGXS requirements) |
| 48 | + |
| 49 | +## Build & install |
| 50 | + |
| 51 | +```bash |
| 52 | +make |
| 53 | +sudo make install |
| 54 | +``` |
| 55 | + |
| 56 | +If `pg_config` isn't on your `PATH`, point PGXS at it explicitly: |
| 57 | + |
| 58 | +```bash |
| 59 | +make PG_CONFIG=/usr/local/pgsql/bin/pg_config |
| 60 | +sudo make install PG_CONFIG=/usr/local/pgsql/bin/pg_config |
| 61 | +``` |
| 62 | + |
| 63 | +If TidesDB is installed somewhere non-standard: |
| 64 | + |
| 65 | +```bash |
| 66 | +make TIDESDB_CFLAGS="-I/opt/tidesdb/include" \ |
| 67 | + TIDESDB_LIBS="-L/opt/tidesdb/lib -ltidesdb -Wl,-rpath,/opt/tidesdb/lib" |
| 68 | +``` |
| 69 | + |
| 70 | +Then in `psql`: |
| 71 | + |
| 72 | +```sql |
| 73 | +CREATE EXTENSION tidesdb; |
| 74 | +``` |
| 75 | + |
| 76 | +## Usage |
| 77 | + |
| 78 | +```sql |
| 79 | +CREATE TABLE events ( |
| 80 | + id bigint, |
| 81 | + ts timestamptz, |
| 82 | + payload jsonb |
| 83 | +) USING tidesdb; |
| 84 | + |
| 85 | +INSERT INTO events VALUES (1, now(), '{"hello": "world"}'); |
| 86 | + |
| 87 | +-- Indexes work normally; they live in PG's heap storage and point at |
| 88 | +-- tidesdb TIDs. |
| 89 | +CREATE INDEX ON events (id); |
| 90 | +CREATE INDEX ON events (ts); |
| 91 | + |
| 92 | +SELECT * FROM events WHERE id = 1; |
| 93 | +``` |
| 94 | + |
| 95 | +To make `tidesdb` the default for new tables in a session: |
| 96 | + |
| 97 | +```sql |
| 98 | +SET default_table_access_method = 'tidesdb'; |
| 99 | +``` |
| 100 | + |
| 101 | +## Architecture |
| 102 | + |
| 103 | +### Storage layout |
| 104 | + |
| 105 | +- One TidesDB handle per PG backend, rooted at `$PGDATA/tidesdb/`, opened lazily on the first TidesPG operation in a backend and closed via `on_proc_exit`. TidesDB's advisory lock on the db directory means only one backend can hold it at a time; `TidesPG_GetDB` retries on `TDB_ERR_LOCKED` with a short backoff to ride through PG's fork-per-backend handoff. |
| 106 | +- One column family per PG relation, named `r_<relfilenumber>`. Using `relfilenumber` (not `relname`) means renames are free and rewrites (TRUNCATE, CLUSTER, some ALTERs) swap storage cleanly by picking up a fresh relfilenumber. |
| 107 | +- Unified memtable mode is on by default (`tidesdb.unified_memtable`), so all CFs share one WAL and one in-memory skiplist, which fits multi-table Postgres transactions and lowers WAL write amp. |
| 108 | + |
| 109 | +### Key encoding |
| 110 | + |
| 111 | +Each live tuple gets a unique 48-bit row counter, packed into an `ItemPointer` as: |
| 112 | + |
| 113 | +- `BlockNumber` = counter ÷ 1024 |
| 114 | +- `OffsetNumber` = (counter mod 1024) + 1 |
| 115 | + |
| 116 | +(Offset 0 is reserved in PG, and 1024 stays well under `MaxOffsetNumber` so bitmap scans and sample scans, which work in (block, offset) space, still have meaningful block locality.) |
| 117 | + |
| 118 | +The TidesDB key is the 6-byte big-endian encoding of that counter. Big-endian means TidesDB's default `memcmp` comparator orders keys numerically, so forward iteration matches insertion order. |
| 119 | + |
| 120 | +The counter is persisted per-CF under a reserved 1-byte key (`0x00`). Because 1 byte ≠ 6 bytes, scans reject the counter key by length check without a special case. See [Row-counter allocation](#row-counter-allocation) below for the reservation scheme. |
| 121 | + |
| 122 | +### Tuple layout |
| 123 | + |
| 124 | +Each row is stored as a raw `MinimalTuple`, which is the same wire form PG uses for slot materialization, starting with the 4-byte `t_len`. There is no tidespg-specific header, no magic sentinel, and no `xmin/xmax`, because visibility is handled by TidesDB's own MVCC so we don't need our own. |
| 125 | + |
| 126 | +TidesDB values can be arbitrarily large, so we don't need PG's TOAST machinery, and the extension's `relation_needs_toast_table` callback returns `false`. Large JSONB / text columns go through TidesDB's own value-log (`vlog`), which stores any value over `klog_value_threshold` (default 512 B) out of line. |
| 127 | + |
| 128 | +### MVCC and transactions |
| 129 | + |
| 130 | +MVCC is delegated to TidesDB. Each backend holds at most one live TidesDB transaction, opened lazily on the first TidesPG operation in a PG transaction and committed / rolled back via `RegisterXactCallback`. PG subtransactions map to TidesDB savepoints named by `SubTransactionId`; if a subxact starts before we've opened the per-xact txn, the savepoints are replayed on the first lazy open so later `ABORT SUB` callbacks still find the right rollback point. |
| 131 | + |
| 132 | +Visibility, write-write conflict detection, and tombstone reclamation all ride on TidesDB's `commit_seq` / `snapshot_seq` machinery. Deletes call `tidesdb_txn_delete` (not payload rewrites with an `xmax` marker), so dead rows are cleaned up by TidesDB's own compaction; no separate VACUUM pass is required for space reclamation. |
| 133 | + |
| 134 | +Isolation mapping (PG -> TidesDB, configurable): |
| 135 | + |
| 136 | +| PG level | TidesDB level | |
| 137 | +|---------------------|--------------------------------------------| |
| 138 | +| `READ UNCOMMITTED` | `tidesdb.rc_isolation` (default `read_committed`) | |
| 139 | +| `READ COMMITTED` | `tidesdb.rc_isolation` (default `read_committed`) | |
| 140 | +| `REPEATABLE READ` | `TDB_ISOLATION_REPEATABLE_READ` | |
| 141 | +| `SERIALIZABLE` | `TDB_ISOLATION_SERIALIZABLE` | |
| 142 | + |
| 143 | +No TidesDB level matches PG's per-statement snapshot semantics exactly. `read_committed` is the closest in spirit (both allow non-repeatable reads) but TidesDB refreshes snapshots per-read rather than per-statement, so a statement that reads the same row twice may see different versions. Flip `tidesdb.rc_isolation` to `snapshot` to buy xact-level consistency plus write-write conflict detection, at the cost of stricter-than-PG-RC behavior. |
| 144 | + |
| 145 | +### Scan path and index fetches |
| 146 | + |
| 147 | +`scan_getnextslot` drives a `tidesdb_iter_t` over the backend's per-xact TidesDB transaction and does not open its own. Entries are filtered only by key length (the reserved 1-byte counter key is skipped), and visibility is handled below the iterator. Direction changes re-seek rather than trusting `prev` from an ambiguous position. |
| 148 | + |
| 149 | +`index_fetch_tuple` is a point `tidesdb_txn_get` on the same per-xact transaction, so index lookups see the backend's own uncommitted writes and benefit from TidesDB's snapshot consistency without per-fetch txn overhead. When the caller passes a `SnapshotDirty` (ON CONFLICT, exclusion-constraint checks), we explicitly populate `xmin / xmax / speculativeToken = 0`, because PG uses those fields to decide whether to wait on another transaction, and leaving them uninitialized causes `check_exclusion_or_unique_constraint` to livelock on stack garbage. |
| 150 | + |
| 151 | +### Parallel sequential scans |
| 152 | + |
| 153 | +Parallel scans share a `ParallelBlockTableScanDescData` whose `phs_nallocated` atomic we repurpose as a chunk claim. At first call, each participant computes a chunk size from the CF's high-water counter and claims a range via `pg_atomic_fetch_add_u64`. The iterator is seeked to the chunk start via `tidesdb_iter_seek`; iteration stops once the current key leaves the chunk, at which point the participant claims the next chunk. This distributes work dynamically (no straggler starvation) without needing any up-front partitioning. |
| 154 | + |
| 155 | +### Bitmap heap scans and TABLESAMPLE |
| 156 | + |
| 157 | +Bitmap scans consume `rs_base.st.rs_tbmiterator`, and each `TBMIterateResult` gives us a synthetic page (our 1024-counter block). Lossy pages expand to all 1024 offsets, while exact pages use `tbm_extract_page_tuple`. Either way we fall through to point `tidesdb_txn_get` per offset, yielding only the offsets that resolve to live rows. |
| 158 | + |
| 159 | +`TABLESAMPLE` uses the same (block, offset) vocabulary. `scan_sample_next_block` asks the TSM routine for a synthetic block in `[0, high_water / 1024)`; `scan_sample_next_tuple` loops the TSM routine's offset picker against the block, skipping `NOT_FOUND` gaps. |
| 160 | + |
| 161 | +### ANALYZE |
| 162 | + |
| 163 | +`scan_analyze_next_block` reports the whole CF as a single virtual block (returns true once, then false). `scan_analyze_next_tuple` drains the iterator into the caller for reservoir sampling. Under unified-memtable mode per-CF stats can read zero until a flush, so `relation_size` / `estimate_rel_size` fall back to the reserved-counter high-water mark times a small per-row constant, which gives ANALYZE's block sampler a non-zero block count to iterate. |
| 164 | + |
| 165 | +### Row-counter allocation |
| 166 | + |
| 167 | +Counters are handed out from a per-backend in-memory reservation. When a chunk is exhausted (default `tidesdb.counter_chunk_size = 1024`), we round-trip to TidesDB under `TDB_ISOLATION_SNAPSHOT` to claim the next chunk, and concurrent reservers serialize via write-write conflict on the counter key. The persisted counter records the next unclaimed counter. Crashed / exited backends "leak" the unused tail of their chunk, but the 48-bit counter space absorbs that effortlessly. |
| 168 | + |
| 169 | +## Testing |
| 170 | + |
| 171 | +```bash |
| 172 | +make installcheck |
| 173 | +``` |
| 174 | + |
| 175 | +This runs the regression tests under `test/sql/` and compares against `test/expected/`. You need a running PostgreSQL cluster whose `pg_config` matches the one you installed against. |
| 176 | + |
| 177 | +## Configuration |
| 178 | + |
| 179 | +All TidesDB tuning knobs are surfaced as `tidesdb.*` GUCs. `SIGHUP` settings reload on `pg_reload_conf()` and are picked up by new backends on their next `tidesdb_open`; `USERSET` settings take effect for any CF created after the change. Existing CFs carry their own config persisted on disk; changing a CF-level GUC does not retroactively reconfigure them. |
| 180 | + |
| 181 | +Database-level (`PGC_SIGHUP`, applied per-backend at `tidesdb_open()`): |
| 182 | + |
| 183 | +| GUC | Default | Notes | |
| 184 | +|-----|---------|-------| |
| 185 | +| `tidesdb.block_cache_size_mb` | 64 | Shared block cache size | |
| 186 | +| `tidesdb.num_flush_threads` | 2 | | |
| 187 | +| `tidesdb.num_compaction_threads` | 2 | | |
| 188 | +| `tidesdb.max_open_sstables` | 256 | | |
| 189 | +| `tidesdb.max_memory_usage_mb` | 0 | 0 = auto | |
| 190 | +| `tidesdb.unified_memtable` | `on` | Shared memtable+WAL across CFs | |
| 191 | +| `tidesdb.log_level` | `warn` | `debug`/`info`/`warn`/`error`/`fatal`/`none` | |
| 192 | +| `tidesdb.log_to_file` | `on` | `$PGDATA/tidesdb/LOG` | |
| 193 | +| `tidesdb.log_truncate_mb` | 24 | | |
| 194 | + |
| 195 | +Column-family-level (applied to newly-created CFs, `PGC_USERSET`): |
| 196 | + |
| 197 | +| GUC | Default | Notes | |
| 198 | +|-----|---------|-------| |
| 199 | +| `tidesdb.use_btree` | `off` | B+tree klog; faster point lookups (index fetches) | |
| 200 | +| `tidesdb.compression` | `lz4` | `none`/`snappy`/`lz4`/`lz4_fast`/`zstd` | |
| 201 | +| `tidesdb.enable_bloom_filter` | `on` | | |
| 202 | +| `tidesdb.bloom_fpr` | 0.01 | | |
| 203 | +| `tidesdb.enable_block_indexes` | `on` | | |
| 204 | +| `tidesdb.index_sample_ratio` | 1 | | |
| 205 | +| `tidesdb.block_index_prefix_len` | 16 | | |
| 206 | +| `tidesdb.klog_value_threshold` | 512 | Values > this go to vlog (no TOAST) | |
| 207 | +| `tidesdb.write_buffer_size_mb` | 128 | Drives both per-CF and unified memtable size | |
| 208 | +| `tidesdb.level_size_ratio` | 10 | | |
| 209 | +| `tidesdb.min_levels` | 5 | | |
| 210 | +| `tidesdb.sync_mode` | `none` | `none`/`full`/`interval` | |
| 211 | +| `tidesdb.sync_interval_us` | 128000 | for `sync_mode = interval` | |
| 212 | + |
| 213 | +Process-local: |
| 214 | + |
| 215 | +| GUC | Default | Notes | |
| 216 | +|-----|---------|-------| |
| 217 | +| `tidesdb.counter_chunk_size` | 1024 | Row counters reserved per TidesDB round-trip | |
| 218 | +| `tidesdb.rc_isolation` | `read_committed` | TidesDB isolation used for PG RC. Closest match by name, though TidesDB RC refreshes per-read vs PG's per-statement. Flip to `snapshot` for xact-level consistency and write-write conflict detection. | |
| 219 | +| `tidesdb.open_max_retries` | 20 | Retries on `TDB_ERR_LOCKED` during `tidesdb_open` | |
| 220 | +| `tidesdb.open_retry_delay_ms` | 50 | Delay between those retries | |
| 221 | + |
| 222 | +### Memory allocator |
| 223 | + |
| 224 | +tidespg does not plug Postgres's `palloc` / `pfree` into TidesDB's allocator hook (`tidesdb_init`), because TidesDB has flush / compaction / sync background threads and Postgres memory contexts are single-threaded. TidesDB runs on its own libc `malloc` by default. Users who want a faster / contention-friendlier allocator can rebuild TidesDB with `-DTIDESDB_WITH_MIMALLOC=ON`, `-DTIDESDB_WITH_TCMALLOC=ON`, or `-DTIDESDB_WITH_JEMALLOC=ON`, which are all thread-safe drop-in replacements. tidespg's own allocations stay on `palloc` (per-xact memory context), so memory lifetime of everything the extension owns is bounded. |
| 225 | + |
| 226 | +## Inspecting TidesDB state |
| 227 | + |
| 228 | +Three SQL-callable functions expose TidesDB's internal counters, useful for tuning. |
| 229 | + |
| 230 | +```sql |
| 231 | +SELECT * FROM tidesdb_cf_stats('events'); -- per-CF: levels, keys, cache, btree |
| 232 | +SELECT * FROM tidesdb_db_stats(); -- cluster-wide memory / queues / CFs |
| 233 | +SELECT * FROM tidesdb_cache_stats(); -- block cache hits / misses / hit_rate |
| 234 | +``` |
| 235 | + |
0 commit comments