Skip to content

[CORE-6913] Cloud Storage Scrubber: Fix false positive for replaced segments#30062

Merged
oleiman merged 2 commits intodevfrom
ci/core-6913/internal-scrub
Apr 9, 2026
Merged

[CORE-6913] Cloud Storage Scrubber: Fix false positive for replaced segments#30062
oleiman merged 2 commits intodevfrom
ci/core-6913/internal-scrub

Conversation

@oleiman
Copy link
Copy Markdown
Member

@oleiman oleiman commented Apr 3, 2026

The scrub false-positive filter in process_anomalies() only checked whether a segment with the same offset range existed in the manifest. A compacted reupload produces a replacement segment at the same offset range but with a different name (different size). When GC deleted the old segment from cloud storage while the scrubber was still referencing a stale manifest, the filter kept the anomaly because the offset range still matched—even though the current segment at that range was a different (replacement) object that existed in cloud storage.

Compare generate_remote_segment_name() for the manifest entry and the reported-missing segment so that replacements with the same offset range but different identity are correctly recognized as false positives.

Fixes CORE-6913.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Improvements

  • Improves false positive detection in the cloud storage scrubber to filter out compacted reuploads.

@oleiman oleiman self-assigned this Apr 3, 2026
Copilot AI review requested due to automatic review settings April 3, 2026 03:32
@oleiman oleiman mentioned this pull request Apr 3, 2026
7 tasks
@oleiman oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label Apr 3, 2026
Comment thread src/v/cloud_storage/partition_manifest.cc Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 3, 2026

Code Review

Summary: Clean, well-motivated fix. The root cause analysis is sound — the old filter only checked offset-range existence, which is insufficient when a compacted reupload produces a replacement segment at the same range but with a different identity (encoded in the segment name via size_bytes). Comparing generate_remote_segment_name() is the right approach since it already encodes the distinguishing fields (base_offset, committed_offset, size_bytes, term) for v2/v3 name formats.

Logic correctness: Verified both erase_if predicates. The three cases are handled correctly:

  1. Offset range no longer in manifest → erase (false positive) ✅
  2. Offset range matches but segment name differs (replacement) → erase (false positive) ✅
  3. Offset range and name both match → keep (real anomaly) ✅

Test: The new test_filtering_of_compacted_reupload test is thorough — it exercises the exact race condition described in the PR (scrubber sees stale manifest → compacted reupload replaces segment → GC deletes original → filter must recognize false positive). The assertion that generate_remote_segment_name actually differs between original and compacted segment is a nice sanity check.

One minor item: Left an inline comment — segment_with_offset_range_exists() is now unused and can be removed.

Overall this looks good to merge.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a cloud storage scrubber false-positive where a missing-segment anomaly could be retained after a compacted reupload replaced the segment at the same offset range but with a different remote identity.

Changes:

  • Update partition_manifest::process_anomalies() to treat “same offset range but different remote segment name” as a false positive (filters the anomaly).
  • Apply the same replacement-aware filtering to both missing_segments and segment_metadata_anomalies.
  • Add a unit test reproducing the compacted reupload race and asserting the anomaly is filtered.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/v/cloud_storage/partition_manifest.cc Enhances anomaly filtering by comparing generated remote segment names (identity) in addition to offset range matching.
src/v/cloud_storage/tests/anomalies_detector_test.cc Adds a test case covering the compacted reupload replacement scenario to prevent regressions.

Comment thread src/v/cloud_storage/partition_manifest.cc Outdated
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#82714
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL level_zero_gc_mt_test start_pause_start_cycle unit https://buildkite.com/redpanda/redpanda/builds/82714#019d5167-16a2-4f26-a9fa-23f191cae713 0/1

The scrub false-positive filter in process_anomalies() only checked
whether a segment with the same offset range existed in the manifest.
A compacted reupload produces a replacement segment at the same
offset range but with a different name (different size). When GC
deleted the old segment from cloud storage while the scrubber was
still referencing a stale manifest, the filter kept the anomaly
because the offset range still matched—even though the current
segment at that range was a different (replacement) object that
existed in cloud storage.

Compare generate_remote_segment_name() for the manifest entry and
the reported-missing segment so that replacements with the same
offset range but different identity are correctly recognized as
false positives.

Fixes CORE-6913.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ci/core-6913/internal-scrub branch from 5cfa54c to 00d8e4f Compare April 8, 2026 03:23
@oleiman oleiman requested a review from Copilot April 8, 2026 03:23
@oleiman oleiman added claude-review Adding this label to a PR will trigger a workflow to review the code using claude. and removed claude-review Adding this label to a PR will trigger a workflow to review the code using claude. labels Apr 8, 2026
Comment thread src/v/cloud_storage/partition_manifest.h
@claude
Copy link
Copy Markdown

claude Bot commented Apr 8, 2026

Code Review

Overall: This is a clean, well-motivated fix. LGTM with one minor nit.

Summary

The fix correctly addresses the false-positive race between the scrubber and compacted segment reuploads. The old segment_with_offset_range_exists only compared offset range, so when a compacted reupload replaced a segment at the same offset range (but with different size), the filter failed to recognize the anomaly as a false positive. The new segment_with_same_identity_exists compares generate_remote_segment_name() which encodes size_bytes (for v2/v3 format), correctly distinguishing the replacement.

What I verified

  • Logic correctness: Traced through all erase_if call sites. The predicate logic is correct:
    • Genuinely missing segment (same identity still in manifest) → anomaly kept ✅
    • Segment merged into a new range → anomaly removed as false positive ✅
    • Compacted reupload (same offset range, different size) → anomaly removed as false positive ✅
  • No regressions for the merge case: The committed_offset check is still in the early return, so merged segments (different offset range) are still filtered correctly.
  • Performance: generate_remote_segment_name creates small strings via ssx::sformat but is only called inside anomaly filtering which processes a small number of entries. No concern.

Tests

Both new test cases are well-structured and clearly document the race scenario. Good use of BOOST_REQUIRE_NE to assert the segment names actually differ — this makes the test independent of name format internals.

Minor nit

Left one inline comment about the doc comment on segment_with_same_identity_exists — for sname_format::v1, the generated name doesn't include size_bytes or committed_offset, so the "identity" comparison is weaker than documented. Not a real concern since v1 is legacy, but worth a comment clarification.

Test for race between scrubber and compacted segment reupload:
1. Create manifest with 3 segments, remove the middle one
   from cloud storage so the detector reports it missing
2. Replace it in the manifest with a compacted version at
   the same offset range but different size_bytes
3. Assert generate_remote_segment_name() differs for the
   original vs compacted segment (v2/v3 names encode size)
4. Call process_anomalies() and assert the anomaly is
   filtered out as a false positive

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ci/core-6913/internal-scrub branch from 00d8e4f to 34185af Compare April 8, 2026 03:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@oleiman oleiman requested a review from a team April 9, 2026 22:09
@oleiman oleiman merged commit cb396de into dev Apr 9, 2026
19 checks passed
@oleiman oleiman deleted the ci/core-6913/internal-scrub branch April 9, 2026 23:42
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v26.1.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/redpanda claude-review Adding this label to a PR will trigger a workflow to review the code using claude.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants