Commit 3ab5d8b
authored
[opt](variant) Reduce sparse variant parse memory (#63970)
## Proposed changes
This PR addresses Variant import memory for sparse dynamic keys.
- Parse plain dynamic non-doc Variant JSON into doc-value KV during
storage parse instead of eagerly expanding every path into parse-time
subcolumns.
- Keep the eager subcolumn parse path for cases that still depend on
parse-time path/type metadata: nested group, deprecated flatten nested,
predefined typed paths, and parent inverted index columns.
- For legacy multi-subcolumn `ColumnVariant` blocks that already reach
the segment writer, append them into a doc-value intermediate when the
writer buffer is still root-only. This avoids copying thousands of
sparse subcolumns into the writer append buffer.
- Stream large doc-value materialization sets one path at a time when
selected materialized doc paths exceed 64, instead of holding all
materialized sparse subcolumns in memory.
- Gate serialized Variant doc-value block payloads by BE exec version,
so WAL blocks written with `be_exec_version=9` replay correctly.
- Add focused Release-gated BE UT/perf coverage. The perf tests stay
skipped by default unless `DORIS_RUN_VARIANT_WRITE_PERF_TEST=1` is set.
## Problem
The downloaded WAL first block is `be_exec_version=9`. It deserializes
as a root-only Variant column:
```text
source_rows=9234 source_subcolumns=1 source_sparse_entries=0 source_doc_value_entries=0
```
So this real WAL does not reproduce the 1000-subcolumn writer-buffer
shape directly. It did expose the old-version doc-value serialization
compatibility issue, which is fixed by the version gate.
Full first-block Release memory result:
```text
cir20431_wal_variant_memory rows=9234 source_rows=9234 source_subcolumns=1 source_sparse_entries=0 source_doc_value_entries=0 source_allocated=67452928 legacy_append_allocated=67649536 doc_value_append_allocated=67518464 doc_vs_legacy=0.998062
```
## Release Writer Perf
Release BE UT build (`BUILD_TYPE_UT=Release`, `-O3 -DNDEBUG`). Best of 3
measured runs after 1 warmup. JSON generation and JSON-to-Variant parse
are excluded; the measured window covers conversion, append, finish,
data/index writes, and file close.
```text
variant_write_perf case=sparse_keys rows=8192 paths_per_row=32 max_subcolumns=2 legacy_us=156634 kv_us=28341 kv_vs_legacy=0.180938 legacy_input_allocated=148013056 kv_input_allocated=10747904 legacy_append_buffer_allocated=148013056 optimized_append_buffer_bytes=6346316 kv_append_buffer_bytes=6346316 legacy_footer_columns=4 kv_footer_columns=4 legacy_materialized=2 kv_materialized=2 legacy_sparse=1 kv_sparse=1 legacy_doc_value=0 kv_doc_value=0 legacy_file_size=245659 kv_file_size=254370
variant_write_perf case=dense_keys rows=8192 paths_per_row=32 max_subcolumns=32 legacy_us=25043 kv_us=17450 kv_vs_legacy=0.696802 legacy_input_allocated=4980736 kv_input_allocated=10747904 legacy_append_buffer_allocated=4980736 optimized_append_buffer_bytes=5816320 kv_append_buffer_bytes=5816320 legacy_footer_columns=34 kv_footer_columns=34 legacy_materialized=32 kv_materialized=32 legacy_sparse=1 kv_sparse=1 legacy_doc_value=0 kv_doc_value=0 legacy_file_size=12963 kv_file_size=12963
```
Interpretation:
- Sparse-key shape: writer append buffer drops from `148013056` bytes to
`6346316` bytes, and write time is `0.181x` legacy.
- Dense-key shape: all 32 paths are materialized, so memory is roughly
comparable (`4980736` bytes vs `5816320` bytes) and write time is
`0.697x` legacy in this Release run.
- Both cases report `legacy_doc_value=0 kv_doc_value=0`, with identical
footer column counts in each case. The non-doc path does not persist
both doc-value and materialized subcolumns; doc-value is used as an
intermediate before writer-side materialization/sparse writing.
## Testing
Current head: `e27bdc09407033191d7d9770dda1ed60d2bb55ef`
- `clang-format` on modified C++ files before commit.
- `git diff --check`
- `env BUILD_TYPE_UT=Release DORIS_RUN_VARIANT_WRITE_PERF_TEST=1
DORIS_CIR20431_WAL_FILE=/tmp/cir20431_wal/walbak/1_1778071896877_16715020621810688_group_commit_b94fdc3cd7568b18_30cfcd56d2cfeb8f
DORIS_CLANG_HOME=/mnt/disk1/claude-max/ldb_toolchain20
PATH=/mnt/disk1/claude-max/ldb_toolchain20/bin:$PATH ./run-be-ut.sh
--run
--filter='VariantColumnWriterReaderTest.test_legacy_subcolumn_append_as_doc_value_buffer:VariantColumnWriterReaderTest.test_storage_parse_kv_write_materialized_and_sparse:VariantColumnWriterReaderTest.test_cir20431_wal_doc_value_buffer_memory:VariantColumnWriterReaderTest.test_storage_parse_kv_write_perf'`
- `env BUILD_TYPE_UT=Release
DORIS_CIR20431_WAL_FILE=/tmp/cir20431_wal/walbak/1_1778071896877_16715020621810688_group_commit_b94fdc3cd7568b18_30cfcd56d2cfeb8f
DORIS_CIR20431_WAL_ROWS=9234
DORIS_CLANG_HOME=/mnt/disk1/claude-max/ldb_toolchain20
PATH=/mnt/disk1/claude-max/ldb_toolchain20/bin:$PATH ./run-be-ut.sh
--run
--filter='VariantColumnWriterReaderTest.test_cir20431_wal_doc_value_buffer_memory'`
The Release test script used its default parallelism (`PARALLEL -- 39`);
no manual `-j` was passed.1 parent dc7b50c commit 3ab5d8b
8 files changed
Lines changed: 1716 additions & 105 deletions
File tree
- be
- src
- common
- exec/common
- storage/segment
- variant
- test/storage/segment
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1166 | 1166 | | |
1167 | 1167 | | |
1168 | 1168 | | |
| 1169 | + | |
| 1170 | + | |
| 1171 | + | |
| 1172 | + | |
1169 | 1173 | | |
1170 | 1174 | | |
1171 | 1175 | | |
1172 | 1176 | | |
1173 | 1177 | | |
1174 | 1178 | | |
| 1179 | + | |
| 1180 | + | |
1175 | 1181 | | |
1176 | 1182 | | |
1177 | 1183 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1428 | 1428 | | |
1429 | 1429 | | |
1430 | 1430 | | |
| 1431 | + | |
| 1432 | + | |
| 1433 | + | |
1431 | 1434 | | |
1432 | 1435 | | |
1433 | 1436 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1988 | 1988 | | |
1989 | 1989 | | |
1990 | 1990 | | |
1991 | | - | |
1992 | 1991 | | |
1993 | 1992 | | |
1994 | 1993 | | |
| |||
1998 | 1997 | | |
1999 | 1998 | | |
2000 | 1999 | | |
| 2000 | + | |
| 2001 | + | |
| 2002 | + | |
| 2003 | + | |
| 2004 | + | |
| 2005 | + | |
| 2006 | + | |
| 2007 | + | |
2001 | 2008 | | |
2002 | 2009 | | |
2003 | 2010 | | |
| |||
2217 | 2224 | | |
2218 | 2225 | | |
2219 | 2226 | | |
| 2227 | + | |
| 2228 | + | |
| 2229 | + | |
| 2230 | + | |
| 2231 | + | |
| 2232 | + | |
| 2233 | + | |
| 2234 | + | |
| 2235 | + | |
| 2236 | + | |
| 2237 | + | |
| 2238 | + | |
| 2239 | + | |
| 2240 | + | |
| 2241 | + | |
| 2242 | + | |
| 2243 | + | |
| 2244 | + | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
| 2248 | + | |
| 2249 | + | |
| 2250 | + | |
| 2251 | + | |
| 2252 | + | |
| 2253 | + | |
| 2254 | + | |
| 2255 | + | |
| 2256 | + | |
| 2257 | + | |
| 2258 | + | |
| 2259 | + | |
| 2260 | + | |
| 2261 | + | |
| 2262 | + | |
| 2263 | + | |
| 2264 | + | |
| 2265 | + | |
| 2266 | + | |
| 2267 | + | |
| 2268 | + | |
| 2269 | + | |
2220 | 2270 | | |
2221 | 2271 | | |
2222 | 2272 | | |
| |||
2247 | 2297 | | |
2248 | 2298 | | |
2249 | 2299 | | |
2250 | | - | |
2251 | | - | |
2252 | | - | |
2253 | | - | |
2254 | | - | |
2255 | | - | |
2256 | | - | |
| 2300 | + | |
2257 | 2301 | | |
2258 | 2302 | | |
2259 | 2303 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1244 | 1244 | | |
1245 | 1245 | | |
1246 | 1246 | | |
1247 | | - | |
| 1247 | + | |
1248 | 1248 | | |
1249 | 1249 | | |
1250 | 1250 | | |
1251 | | - | |
| 1251 | + | |
1252 | 1252 | | |
1253 | 1253 | | |
1254 | 1254 | | |
| |||
0 commit comments