[fix](wal) Fix WAL replay loading 0 rows due to columns being marked as missing#62658
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes WAL replay returning 0 rows by ensuring WAL readers correctly report existing WAL columns during the unified init_reader(ReaderInitContext*) flow, preventing TableFormatReader::on_before_init_reader() from incorrectly treating them as “missing”.
Changes:
- Move
_tuple_descriptorinitialization (fromWalInitContext::output_tuple_descriptor) into_open_file_reader()so it is available beforeon_before_init_reader()callsget_columns(). - Enhance
_get_columns_impl()to populatename_to_typefor columns present in the WAL header, mapped viaTupleDescriptorslot unique IDs.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Status WalReader::_open_file_reader(ReaderInitContext* ctx) { | ||
| auto* wal_ctx = checked_context_cast<WalInitContext>(ctx); | ||
| _tuple_descriptor = wal_ctx->output_tuple_descriptor; |
There was a problem hiding this comment.
wal_ctx->output_tuple_descriptor is dereferenced later via _tuple_descriptor->slots() (both in _do_get_next_block and in _get_columns_impl). If it is ever null, this will crash and/or silently reintroduce the “all columns missing” behavior. Please add an explicit validation here (return an error Status with a clear message) instead of relying on implicit invariants.
| // Report WAL columns so on_before_init_reader does not mark them as missing. | ||
| if (_tuple_descriptor) { | ||
| for (auto* slot_desc : _tuple_descriptor->slots()) { | ||
| if (_column_pos_map.contains(slot_desc->col_unique_id())) { | ||
| name_to_type->emplace(slot_desc->col_name(), | ||
| slot_desc->get_data_type_ptr()); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This change alters how WAL schema is reported to TableFormatReader::on_before_init_reader() (via _get_columns_impl), which was previously causing runtime “missing column” behavior. Please add/enable a unit/regression test that reproduces the 0-rows WAL replay issue and asserts the reader no longer marks present WAL columns as missing (e.g., via missing_cols() or by verifying the output block row count is preserved).
8611798 to
95e3fed
Compare
|
/review |
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
I traced the WAL replay path end-to-end (WalReader -> TableFormatReader::on_before_init_reader -> FileScanner::_set_fill_or_truncate_columns -> http_stream(format="wal")). The code change does fix the immediate bug: _tuple_descriptor is now initialized before on_before_init_reader() calls get_columns(), so WAL columns are no longer all marked missing and later overwritten by on_fill_missing_columns().
Blocking issue:
- This regression is still untested. The only scanner-based WAL reader coverage nearby is in
be/test/format/wal/wal_manager_test.cpp, and the relevant cases are currently disabled. Because the bug sits in the unifiedGenericReader/TableFormatReaderinit flow, we need an automated test that exercisesFileScanner + FORMAT_WAL(or an equivalent replay path) and proves rows are returned instead of being replaced with NULL/default values.
Critical checkpoints:
- Goal / correctness: The patch appears to accomplish the immediate replay fix for the reviewed path, but that behavior is not proven by a new automated test in this PR.
- Small / focused: Yes. The code change is minimal and directly targets the initialization-order bug.
- Concurrency: No new concurrency or locking risk was introduced here.
- Lifecycle / init order: This is the key improvement in the patch; moving
_tuple_descriptorsetup beforeon_before_init_reader()matches theGenericReader::init_reader()template-method order. - Config changes: None.
- Compatibility / storage format: None. No WAL format change is introduced.
- Parallel paths: I did not find another active WAL reader init path in BE that still depends on the broken ordering.
- Special conditions: The new
_tuple_descriptor-guarded schema population is straightforward and matches the current replay flow. - Test coverage: Missing for the exact regression; existing nearby scanner tests are disabled.
- Observability: No additional logging or metrics are needed for this localized fix.
- Transaction / persistence / data correctness: No transaction or persistence protocol changes. The data-correctness fix itself looks right for preventing replayed rows from being nulled out.
- Performance: Negligible impact; only a one-time slot scan during schema discovery.
Please add or re-enable a WAL scanner/replay regression test before merging.
| @@ -131,6 +131,14 @@ Status WalReader::_get_columns_impl(std::unordered_map<std::string, DataTypePtr> | |||
| } catch (const std::invalid_argument& e) { | |||
| return Status::InvalidArgument("Invalid format, {}", e.what()); | |||
| } | |||
There was a problem hiding this comment.
Please add a regression test for this path. This patch fixes the on_before_init_reader() / get_columns() ordering bug, but the nearby scanner-based WAL tests in be/test/format/wal/wal_manager_test.cpp are currently disabled, and this PR does not add replacement coverage. Without a FileScanner + FORMAT_WAL test (or equivalent replay coverage), this exact all columns marked missing -> replay writes NULL/defaults regression can slip back in unnoticed.
|
skip check_coverage |
WalReader::_get_columns_impl did not populate the name_to_type output
parameter. This caused TableFormatReader::on_before_init_reader to mark
all WAL columns as missing, and on_after_read_block then replaced valid
deserialized data with NULL default values via on_fill_missing_columns.