Skip to content

[v26.1.x] group/tx/stm: limit local snapshot to max_removable_local_log_offset#30080

Merged
bharathv merged 2 commits intoredpanda-data:v26.1.xfrom
vbotbuildovich:backport-pr-30071-v26.1.x-497
Apr 9, 2026
Merged

[v26.1.x] group/tx/stm: limit local snapshot to max_removable_local_log_offset#30080
bharathv merged 2 commits intoredpanda-data:v26.1.xfrom
vbotbuildovich:backport-pr-30071-v26.1.x-497

Conversation

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Backport of PR #30071

bharathv added 2 commits April 6, 2026 18:50
The group_tx_tracker_stm snapshot could capture open transaction state
(begin_offsets/producer_states) at an offset where the corresponding
commit batch had not yet been written. If compaction later removed that
commit batch — which is allowed once max_removable_offset advances past
it — the open transaction could never be resolved on restart, permanently
blocking max_removable_offset and preventing further compaction.

The sequence:
1. Snapshot taken while a tx is open (fence at F, snapshot offset >= F)
2. Tx commits at offset C, max_removable advances past C
3. Compaction removes the commit batch at C
4. On restart, snapshot loads stale open tx at F, replay cannot find
   the commit -> max_removable stuck at prev(F) forever

Fix: snapshot at max_removable_local_log_offset with an empty
transactions map. Since this STM's sole purpose is tracking open
transactions for max_removable_local_log_offset, and closed transactions
leave no state, all meaningful state can be reconstructed from log
replay. Open transactions are re-discovered from fence batches in the
log, which are guaranteed to be present since compaction is bounded by
max_removable while the STM is live.

Also adds a regression test that reproduces the scenario by taking a
snapshot during an open tx, committing, compacting, re-persisting the
stale snapshot, and restarting.

To fix existing setups that have stale snapshots, this commit also bumps
supported_local_snapshot_version, this invalidates saved snapshots upon
upgrade and applies everything from log and reconstructs the correct
snapshots the next time with the newer logic.

(cherry picked from commit 45fff0d)
@vbotbuildovich vbotbuildovich added this to the v26.1.x-next milestone Apr 6, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Apr 6, 2026
@vbotbuildovich vbotbuildovich requested a review from bharathv April 6, 2026 18:50
@bharathv bharathv marked this pull request as draft April 6, 2026 18:55
@bharathv
Copy link
Copy Markdown
Contributor

bharathv commented Apr 6, 2026

Converting to draft temporarily so the main change gets some bake time, please don't merge.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator Author

CI test results

test results on build#82788
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/82788#019d6432-fec1-45ef-b4ce-cd0f42dbedfb 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0021, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
FLAKY(PASS) JavaCompressionTest test_upgrade_java_compression {"compression_type": "zstd"} integration https://buildkite.com/redpanda/redpanda/builds/82788#019d6432-fec1-45ef-b4ce-cd0f42dbedfb 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0016, p0=0.0314, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=JavaCompressionTest&test_method=test_upgrade_java_compression

@bharathv bharathv marked this pull request as ready for review April 9, 2026 16:28
@bharathv bharathv enabled auto-merge April 9, 2026 16:28
@bharathv bharathv merged commit 10a9442 into redpanda-data:v26.1.x Apr 9, 2026
20 checks passed
@tyson-redpanda tyson-redpanda modified the milestones: v26.1.x-next, v26.1.3 Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda kind/backport PRs targeting a stable branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants