Skip to content

Commit 953859e

Browse files
committed
PS-10596 [DOCS] - Improve documentation on MyRocks variables (part 3) 8.0
modified: docs/variables.md
1 parent 6417009 commit 953859e

1 file changed

Lines changed: 244 additions & 30 deletions

File tree

docs/variables.md

Lines changed: 244 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1314,14 +1314,80 @@ non-debug builds.
13141314
| Scope | Global |
13151315
| Data type | String |
13161316

1317-
The dafault value is:
1317+
This variable defines the default settings for the default column family, where
1318+
MyRocks stores data unless a table or index uses a dedicated column family.
1319+
1320+
#### How the option works
1321+
1322+
Instead of exposing every RocksDB tuning knob as its own MySQL variable, MyRocks
1323+
accepts a semicolon-separated list of parameters in RocksDB shorthand and passes
1324+
them to the engine.
1325+
1326+
**Scope:** These settings apply to every table that uses the default column
1327+
family.
1328+
1329+
**Syntax:** For example,
1330+
`write_buffer_size=64M;target_file_size_base=32M`.
1331+
1332+
On startup, the server applies this option to all existing column families. The option is read-only at runtime.
1333+
1334+
#### Commonly configured parameters
1335+
1336+
* `write_buffer_size` — Size of a single memtable. When the limit is reached, the memtable is frozen and scheduled for flush to an SST (Sorted String Table)
1337+
file.
1338+
1339+
* `max_write_buffer_number` — Maximum number of memtables that can accumulate in memory (one active, others waiting to flush). Raising `max_write_buffer_number` helps absorb
1340+
bursts of writes.
1341+
1342+
* `max_bytes_for_level_base` — Total size limit for level 1 of the LSM (Log-Structured Merge) tree; the level-1 limit influences how large subsequent levels become.
1343+
1344+
* `target_file_size_base` — Target size for a single SST file at level 1. Combined with level size limits, `target_file_size_base` affects how many files exist per level.
1345+
1346+
* `compression_per_level` — Compression algorithm per level (for example LZ4, ZSTD) to balance CPU and disk space.
1347+
1348+
* `block_based_table_factory` — Nested settings for blocks: Bloom filters, index types, block cache behavior.
1349+
1350+
* `level0_file_num_compaction_trigger` — How many L0 (level 0) files trigger a compaction.
1351+
1352+
#### Benefits of tuning
1353+
1354+
Centralized control over compaction style, memory, and
1355+
I/O (input/output) parallelism; adjusting the `rocksdb_default_cf_options`
1356+
string for the hardware (SSD versus HDD) is the
1357+
primary way to optimize MyRocks throughput.
1358+
1359+
The default varies by MyRocks version but generally balances LZ4 compression
1360+
with moderate buffer sizes (for example, 64 MB memtables). The default value
1361+
is:
13181362

13191363
```default
1320-
block_based_table_factory= {cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;compression=kLZ4Compression;bottommost_compression=kLZ4Compression;
1364+
block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;compression=kLZ4Compression;bottommost_compression=kLZ4Compression;
13211365
```
13221366

1323-
Specifies the default column family options for MyRocks. On startup, the server applies this option to all existing column families. This option is
1324-
read-only at runtime.
1367+
#### Breakdown of the main components
1368+
1369+
1. **Block-based table options** — How data is laid out and cached inside SST
1370+
(Sorted String Table) files:
1371+
1372+
* `cache_index_and_filter_blocks=1` — Forces the index and Bloom filter data into the RocksDB block cache instead of pinning them outside the cache, for better control of total memory.
1373+
1374+
* `filter_policy=bloomfilter:10:false` — Bloom filter with 10 bits per key. The `false` refers to `use_block_based_builder`, this setting uses the modern, more efficient Full Filter format.
1375+
1376+
* `whole_key_filtering=1` — Hashes the entire key in the Bloom filter for the fastest possible performance for point lookups.
1377+
1378+
2. **Compaction and layout**`level_compaction_dynamic_level_bytes=true`
1379+
adjusts per-level byte limits from the bottom level, reducing space
1380+
amplification and making sizing more self-tuning.
1381+
`compaction_pri=kMinOverlappingRatio` prefers compactions that free the most
1382+
space relative to bytes written.
1383+
1384+
3. **Read optimization**`optimize_filters_for_hits=true` skips Bloom filter
1385+
checks on the bottommost level where hits are statistically more likely,
1386+
saving CPU (central processing unit) time.
1387+
1388+
4. **Compression**`compression=kLZ4Compression` and
1389+
`bottommost_compression=kLZ4Compression` use LZ4 for low CPU overhead and
1390+
solid general-purpose compression.
13251391

13261392
### `rocksdb_delayed_write_rate`
13271393

@@ -1759,22 +1825,48 @@ This variable controls whether to write and check RocksDB file-level checksums.
17591825
| Data type | Numeric |
17601826
| Default | 1 |
17611827

1762-
Specifies whether to sync on every transaction commit,
1763-
similar to [innodb_flush_log_at_trx_commit :octicons-link-external-16:](https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit).
1764-
Enabled by default, which ensures ACID compliance.
1828+
Specifies whether the RocksDB Write-Ahead Log (WAL) is synchronized to disk
1829+
on every transaction commit, similar to
1830+
[innodb_flush_log_at_trx_commit :octicons-link-external-16:](https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit).
1831+
1832+
By default, the setting is enabled (`1`), which ensures ACID compliance by
1833+
guaranteeing that committed transactions are durable even in the event of a
1834+
crash. Choosing less strict values can improve performance at the cost of
1835+
durability.
1836+
1837+
#### Possible values
1838+
1839+
The variable accepts `0`, `1`, or `2`; the following describes each value:
1840+
1841+
* **`0` (Do not sync on commit)**
1842+
1843+
Compared with `1`, which waits for a durable WAL sync on every commit, and with `2`, which still writes the WAL on each commit but defers durable sync to a background thread, `0` does not flush or sync the WAL on commit. That removes the most commit-time I/O of the three settings, so you usually get the highest throughput and lowest commit latency, but you also accept the weakest durability: after a crash, recently committed work may be missing or the database may be inconsistent, often by a wider margin than the roughly once-per-second window commonly associated with `2`, and far beyond what `1` allows. The outcomes are as follows.
1844+
1845+
* Leaving the WAL unflushed and unsynced on transaction commit.
1846+
1847+
* Minimizing commit-time I/O relative to `1` and `2`.
17651848

1766-
Possible values:
1849+
* Risking extensive data loss or inconsistency after a crash compared with stricter settings.
17671850

1768-
* `0`: Do not sync on transaction commit.
1769-
This provides better performance, but may lead to data inconsistency
1770-
in case of a crash.
1851+
* **`1` (Sync on every commit) [Default]**
17711852

1772-
* `1`: Sync on every transaction commit.
1773-
This is set by default and recommended
1774-
as it ensures data consistency,
1775-
but reduces performance.
1853+
Compared with `0`, which does not flush or sync the WAL on commit, and with `2`, which writes the WAL on each commit but batches durable sync, `1` makes every commit wait until the WAL is durably on disk (typically a full sync such as `fsync`) before the commit returns. That is the usual choice when a successful commit must survive a crash: you get the strongest durability and ACID guarantees of the three settings. The tradeoff is the most synchronous disk work per commit, so commit latency and sustained write throughput are often lower than with `0` or `2` when commits are frequent or when disk sync is slow. The outcomes are as follows.
17761854

1777-
* `2`: Sync every second.
1855+
* Writing and syncing the WAL to disk at each transaction commit.
1856+
1857+
* Ensuring full durability and ACID compliance for committed work.
1858+
1859+
* Incurring the highest per-commit I/O and typically the slowest commits of the three settings.
1860+
1861+
* **`2` (Sync in background, typically once per second)**
1862+
1863+
With `1`, each commit waits until the WAL is durably on disk (typically a full sync such as `fsync`) before the commit returns. With `2`, each commit still writes the WAL, but the session usually does not wait for that durable sync; a background thread performs syncs on a schedule (for example, about once per second). So individual commits can return faster than with `1`, because they skip the per-commit sync wait, at the cost of possibly losing the last second of commits after a crash. The outcomes are as follows.
1864+
1865+
* Recording each commit in the WAL without blocking the commit on a full durable sync every time.
1866+
1867+
* Balancing performance and durability.
1868+
1869+
* Risking the loss of up to about one second of committed transactions after a crash.
17781870

17791871
### `rocksdb_flush_memtable_on_analyze`
17801872

@@ -1815,10 +1907,34 @@ This provides better accuracy, but may reduce performance.
18151907
| Dynamic | Yes |
18161908
| Scope | Global |
18171909
| Data type | Numeric |
1818-
| Default | 60000000 |
1910+
| Default | 60000000 (60 seconds) |
1911+
1912+
This variable determines how long (in microseconds) MyRocks caches statistics
1913+
gathered from the memtables for the query optimizer. When the optimizer
1914+
evaluates a query, it needs row-count estimates; data not yet flushed to disk
1915+
requires scanning memtables for accurate statistics.
1916+
1917+
#### How it works
1918+
1919+
**The cache:** To avoid the CPU cost of re-scanning memtables for every query,
1920+
MyRocks stores the results in a cache.
18191921

1820-
Specifies for how long the cached value of memtable statistics should
1821-
be used instead of computing it every time during the query plan analysis.
1922+
**The timer:** This variable defines the expiration of that cache.
1923+
1924+
Default is `60000000` (60 seconds).
1925+
1926+
Specifies for how long the cached value of memtable statistics should be used
1927+
instead of computing it on every query plan analysis.
1928+
1929+
#### Key trade-offs
1930+
1931+
**Higher value (for example, several minutes):** Improves performance in
1932+
high-query-rate environments by reducing how often statistics collection runs.
1933+
The optimizer may use stale data if the table is being updated rapidly.
1934+
1935+
**Lower value (for example, 1 second):** Gives the optimizer a near-real-time
1936+
view of the data and can yield better plans on volatile workloads, at the cost
1937+
of more CPU use during query optimization.
18221938

18231939
### `rocksdb_force_flush_memtable_and_lzero_now`
18241940

@@ -2387,10 +2503,32 @@ Allowed range is up to `64`.
23872503
| Data type | Numeric |
23882504
| Default | 2 GB |
23892505

2390-
Specifies the maximum total size of WAL (write-ahead log) files,
2391-
after which memtables are flushed.
2392-
Default value is `2 GB`
2393-
The allowed range is up to `9223372036854775807`.
2506+
This setting limits the total disk space consumed by Write Ahead Log (WAL)
2507+
files across all column families. The limit helps prevent log files from
2508+
exhausting disk capacity.
2509+
2510+
Specifies the maximum total size of WAL files, after which memtables are
2511+
flushed. Default value is `2 GB`. The allowed range is up to
2512+
`9223372036854775807`.
2513+
2514+
#### How it works
2515+
2516+
**The trigger:** When the combined size of all WAL files exceeds this
2517+
threshold, RocksDB identifies the oldest logs and forces a flush of their
2518+
associated memtables to SST files.
2519+
2520+
**The result:** Once the data is safely in an SST file, the corresponding
2521+
WAL files are deleted or archived, bringing total usage back under the
2522+
limit.
2523+
2524+
#### Key trade-offs
2525+
2526+
**Higher limit:** Improves write performance by allowing larger, infrequent
2527+
flushes. Disk usage increases and recovery time after a crash
2528+
lengthens (more log data to replay).
2529+
2530+
**Lower limit:** Keeps disk footprint small and recovery fast, but may
2531+
cause frequent forced flushes, which can throttle write throughput.
23942532

23952533
### `rocksdb_merge_buf_size`
23962534

@@ -2547,7 +2685,37 @@ The dafault value is `ON` which means this variable is enabled.
25472685
| Data type | Unsigned Integer |
25482686
| Default | 0 |
25492687

2550-
The variable was implemented in [Percona Server for MySQL 8.0.27-18](release-notes/Percona-Server-8.0.27-18.md). Maximum memory to use when sorting an unmaterialized group for partial indexes. The 0(zero) value is defined as no limit.
2688+
The variable was implemented in [Percona Server for MySQL 8.0.27-18](release-notes/Percona-Server-8.0.27-18.md).
2689+
2690+
This variable sets the memory threshold (in bytes) for MyRocks to perform an
2691+
in-memory sort when a query is only partially satisfied by an index.
2692+
2693+
**The default: `0` (uncapped)**
2694+
2695+
When set to `0`, the memory limit is effectively removed.
2696+
2697+
**The result:** MyRocks may use as much RAM as needed to perform the sort
2698+
in-memory.
2699+
2700+
**The benefit:** Maximum performance for partial index scans by avoiding slow
2701+
disk-based filesorts.
2702+
2703+
**The risk:** Without a cap, a large query, or many concurrent queries, could
2704+
consume all available system memory, potentially leading to an out-of-memory
2705+
(OOM) crash.
2706+
2707+
#### Why change it
2708+
2709+
Setting this to a non-zero value (for example, `16777216` for 16 MB) introduces
2710+
a safety governor.
2711+
2712+
**Control:** MyRocks uses the optimized in-memory sort path only if the
2713+
result set fits within the defined memory budget.
2714+
2715+
**Stability:** If a sort requires more than the cap, MyRocks falls back to a
2716+
standard filesort. That path avoids unbounded memory use and protects overall
2717+
server stability, but affected queries often take longer to complete because
2718+
sorting uses disk (or temp files) instead of staying entirely in memory.
25512719

25522720
### `rocksdb_pause_background_work`
25532721

@@ -3291,9 +3459,34 @@ Disabled by default.
32913459

32923460
The variable was implemented in [Percona Server for MySQL 8.0.33-25](release-notes/8.0.33-25.md).
32933461

3294-
If enabled, this variable uses HyperClockCache instead of default LRUCache for RocksDB.
3462+
This setting replaces the standard LRU (Least Recently Used) block cache with
3463+
a lock-free HyperClockCache implementation.
32953464

3296-
This variable is disabled (OFF) by default.
3465+
If enabled, MyRocks uses HyperClockCache instead of the default LRUCache for
3466+
RocksDB. The variable is disabled (`OFF`) by default.
3467+
3468+
#### Key benefits:
3469+
3470+
**High concurrency:** Intended for many-core systems (16+ cores). Reduces the
3471+
global lock bottleneck found in traditional LRU caches.
3472+
3473+
**CPU efficiency:** Uses a clock algorithm instead of a linked list, avoiding
3474+
expensive memory writes and synchronization on every cache hit.
3475+
3476+
#### Trade-offs:
3477+
3478+
**Performance:** Can offer significantly higher throughput under heavy read or
3479+
scan workloads.
3480+
3481+
**Memory:** Uses a fixed-size hash table, which can have slightly higher
3482+
per-entry memory overhead than a standard LRU cache.
3483+
3484+
**Precision:** Approximate LRU ordering is less precise but faster to maintain.
3485+
3486+
#### When to use
3487+
3488+
Enable if CPU profiling shows high mutex contention within the
3489+
RocksDB block cache or on high core-count servers.
32973490

32983491
### `rocksdb_use_io_uring`
32993492

@@ -3456,10 +3649,31 @@ Allowed range is up to `9223372036854775807`.
34563649
| Data type | Boolean |
34573650
| Default | ON |
34583651

3459-
Specifies whether the bloomfilter should use the whole key for filtering
3460-
instead of just the prefix.
3461-
Enabled by default.
3462-
Make sure that lookups use the whole key for matching.
3652+
The `rocksdb_whole_key_filtering` variable determines whether the Bloom filter
3653+
stores a hash of the entire key or just the prefix. The option is part of
3654+
RocksDB `BlockBasedTableOptions` and is enabled (`ON`) by default in MyRocks.
3655+
3656+
Specifies whether the Bloom filter should use the whole key for filtering
3657+
instead of just the prefix. Make sure that lookups use the whole key for
3658+
matching when whole-key filtering is enabled.
3659+
3660+
#### How it works
3661+
3662+
* **Enabled (default):** Both the whole key and the prefix are added to the Bloom
3663+
filter. Storing both yields the most accurate filtering for point lookups (for
3664+
example, `WHERE pk = 10`), so the engine can skip SST files that definitely do
3665+
not contain the key.
3666+
3667+
* **Disabled:** Only the prefix is stored in the Bloom filter. Because there are
3668+
typically fewer unique prefixes than unique keys, Bloom filters are much
3669+
smaller, saving significant memory.
3670+
3671+
#### The trade-off
3672+
3673+
Disabling whole-key filtering suits memory-constrained
3674+
environments or workloads dominated by prefix scans. Point lookups see a
3675+
higher false positive rate—the database may occasionally read from disk
3676+
because the prefix matched even though the full key did not.
34633677

34643678
### `rocksdb_write_batch_flush_threshold`
34653679

0 commit comments

Comments
 (0)