Skip to content

[fix](wal) Fix WAL replay loading 0 rows due to columns being marked as missing#62658

Merged
hello-stephen merged 1 commit intoapache:masterfrom
mymeiyi:fix-replay-wal-0421
Apr 21, 2026
Merged

[fix](wal) Fix WAL replay loading 0 rows due to columns being marked as missing#62658
hello-stephen merged 1 commit intoapache:masterfrom
mymeiyi:fix-replay-wal-0421

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented Apr 21, 2026

WalReader::_get_columns_impl did not populate the name_to_type output
parameter. This caused TableFormatReader::on_before_init_reader to mark
all WAL columns as missing, and on_after_read_block then replaced valid
deserialized data with NULL default values via on_fill_missing_columns.

Copilot AI review requested due to automatic review settings April 21, 2026 02:55
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes WAL replay returning 0 rows by ensuring WAL readers correctly report existing WAL columns during the unified init_reader(ReaderInitContext*) flow, preventing TableFormatReader::on_before_init_reader() from incorrectly treating them as “missing”.

Changes:

  • Move _tuple_descriptor initialization (from WalInitContext::output_tuple_descriptor) into _open_file_reader() so it is available before on_before_init_reader() calls get_columns().
  • Enhance _get_columns_impl() to populate name_to_type for columns present in the WAL header, mapped via TupleDescriptor slot unique IDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to +46
Status WalReader::_open_file_reader(ReaderInitContext* ctx) {
auto* wal_ctx = checked_context_cast<WalInitContext>(ctx);
_tuple_descriptor = wal_ctx->output_tuple_descriptor;
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wal_ctx->output_tuple_descriptor is dereferenced later via _tuple_descriptor->slots() (both in _do_get_next_block and in _get_columns_impl). If it is ever null, this will crash and/or silently reintroduce the “all columns missing” behavior. Please add an explicit validation here (return an error Status with a clear message) instead of relying on implicit invariants.

Copilot uses AI. Check for mistakes.
Comment on lines +134 to +142
// Report WAL columns so on_before_init_reader does not mark them as missing.
if (_tuple_descriptor) {
for (auto* slot_desc : _tuple_descriptor->slots()) {
if (_column_pos_map.contains(slot_desc->col_unique_id())) {
name_to_type->emplace(slot_desc->col_name(),
slot_desc->get_data_type_ptr());
}
}
}
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change alters how WAL schema is reported to TableFormatReader::on_before_init_reader() (via _get_columns_impl), which was previously causing runtime “missing column” behavior. Please add/enable a unit/regression test that reproduces the 0-rows WAL replay issue and asserts the reader no longer marks present WAL columns as missing (e.g., via missing_cols() or by verifying the output block row count is preserved).

Copilot uses AI. Check for mistakes.
@mymeiyi mymeiyi force-pushed the fix-replay-wal-0421 branch from 8611798 to 95e3fed Compare April 21, 2026 03:03
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 21, 2026

/review

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 21, 2026

run buildall

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Apr 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I traced the WAL replay path end-to-end (WalReader -> TableFormatReader::on_before_init_reader -> FileScanner::_set_fill_or_truncate_columns -> http_stream(format="wal")). The code change does fix the immediate bug: _tuple_descriptor is now initialized before on_before_init_reader() calls get_columns(), so WAL columns are no longer all marked missing and later overwritten by on_fill_missing_columns().

Blocking issue:

  • This regression is still untested. The only scanner-based WAL reader coverage nearby is in be/test/format/wal/wal_manager_test.cpp, and the relevant cases are currently disabled. Because the bug sits in the unified GenericReader / TableFormatReader init flow, we need an automated test that exercises FileScanner + FORMAT_WAL (or an equivalent replay path) and proves rows are returned instead of being replaced with NULL/default values.

Critical checkpoints:

  • Goal / correctness: The patch appears to accomplish the immediate replay fix for the reviewed path, but that behavior is not proven by a new automated test in this PR.
  • Small / focused: Yes. The code change is minimal and directly targets the initialization-order bug.
  • Concurrency: No new concurrency or locking risk was introduced here.
  • Lifecycle / init order: This is the key improvement in the patch; moving _tuple_descriptor setup before on_before_init_reader() matches the GenericReader::init_reader() template-method order.
  • Config changes: None.
  • Compatibility / storage format: None. No WAL format change is introduced.
  • Parallel paths: I did not find another active WAL reader init path in BE that still depends on the broken ordering.
  • Special conditions: The new _tuple_descriptor-guarded schema population is straightforward and matches the current replay flow.
  • Test coverage: Missing for the exact regression; existing nearby scanner tests are disabled.
  • Observability: No additional logging or metrics are needed for this localized fix.
  • Transaction / persistence / data correctness: No transaction or persistence protocol changes. The data-correctness fix itself looks right for preventing replayed rows from being nulled out.
  • Performance: Negligible impact; only a one-time slot scan during schema discovery.

Please add or re-enable a WAL scanner/replay regression test before merging.

@@ -131,6 +131,14 @@ Status WalReader::_get_columns_impl(std::unordered_map<std::string, DataTypePtr>
} catch (const std::invalid_argument& e) {
return Status::InvalidArgument("Invalid format, {}", e.what());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a regression test for this path. This patch fixes the on_before_init_reader() / get_columns() ordering bug, but the nearby scanner-based WAL tests in be/test/format/wal/wal_manager_test.cpp are currently disabled, and this PR does not add replacement coverage. Without a FileScanner + FORMAT_WAL test (or equivalent replay coverage), this exact all columns marked missing -> replay writes NULL/defaults regression can slip back in unnoticed.

@hello-stephen
Copy link
Copy Markdown
Contributor

skip check_coverage

@hello-stephen hello-stephen merged commit 369180e into apache:master Apr 21, 2026
32 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants