Skip to content

fix(amber): bound region termination retry instead of looping forever#5737

Open
Ma77Ball wants to merge 14 commits into
apache:mainfrom
Ma77Ball:fix/region-termination-bounded-retry
Open

fix(amber): bound region termination retry instead of looping forever#5737
Ma77Ball wants to merge 14 commits into
apache:mainfrom
Ma77Ball:fix/region-termination-bounded-retry

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • RegionExecutionCoordinator.terminateWorkersWithRetry retried worker termination with no limit, so a worker that never finished draining (its EndWorker kept failing) left the region's termination future unresolved and the workflow never reached COMPLETED, surfacing as an opaque ~1 minute timeout in DataProcessingSpec with no indication of the stuck region or workers.
  • Bounded the retry by a new maxTerminationAttempts budget (default 150, about 30s at the existing 200ms delay); on exhaustion the termination future fails with a descriptive error naming the region and the still-unterminated workers, instead of retrying indefinitely.
  • Made maxTerminationAttempts and killRetryDelay constructor parameters with production defaults so the behavior is unit-testable without long waits.
  • Scope note: this is a fail-fast/diagnosability change (it converts a silent hang into a fast, explicit failure, matching the pattern in fix(amber): surface writer-thread failure as FatalError instead of silent hang #4683), not a guaranteed elimination of the underlying termination flake.

Any related issues, documentation, discussions?

Closes: #5614

How was this PR tested?

  • Run sbt "WorkflowExecutionService/testOnly *RegionExecutionCoordinatorSpec" (under JDK 17); expect 3 tests succeeded, 0 failed.
  • New test give up with a descriptive error once the EndWorker retry budget is exhausted: forces EndWorker to always fail, then asserts the completion future fails with IllegalStateException containing "could not be terminated after 3 attempts", the coordinator is not marked completed, exactly 3 EndWorker calls were made, and the worker actor ref is retained.
  • Existing tests in the same spec (gracefulStop ordering, transient-failure retry-then-succeed) still pass, confirming the normal and transient paths are unchanged.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 0 better · 🔴 5 worse · ⚪ 10 noise (<±5%) · 0 without baseline

Compared against main e270f83 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 412 0.251 22,362/39,001/39,001 us 🔴 +23.5% / 🔴 +11.5%
bs=100 sw=10 sl=64 830 0.507 118,277/144,869/144,869 us ⚪ within ±5% / 🔴 -6.9%
bs=1000 sw=10 sl=64 966 0.589 1,038,474/1,063,384/1,063,384 us ⚪ within ±5% / 🔴 -7.3%
Baseline details

Latest main e270f83 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 412 tuples/sec 460 tuples/sec 410.82 tuples/sec -10.4% +0.3%
bs=10 sw=10 sl=64 MB/s 0.251 MB/s 0.28 MB/s 0.251 MB/s -10.4% +0.1%
bs=10 sw=10 sl=64 p50 22,362 us 21,057 us 23,785 us +6.2% -6.0%
bs=10 sw=10 sl=64 p95 39,001 us 31,568 us 34,980 us +23.5% +11.5%
bs=10 sw=10 sl=64 p99 39,001 us 31,568 us 34,980 us +23.5% +11.5%
bs=100 sw=10 sl=64 throughput 830 tuples/sec 840 tuples/sec 891.94 tuples/sec -1.2% -6.9%
bs=100 sw=10 sl=64 MB/s 0.507 MB/s 0.513 MB/s 0.544 MB/s -1.2% -6.9%
bs=100 sw=10 sl=64 p50 118,277 us 115,741 us 112,277 us +2.2% +5.3%
bs=100 sw=10 sl=64 p95 144,869 us 149,686 us 139,802 us -3.2% +3.6%
bs=100 sw=10 sl=64 p99 144,869 us 149,686 us 139,802 us -3.2% +3.6%
bs=1000 sw=10 sl=64 throughput 966 tuples/sec 986 tuples/sec 1,041 tuples/sec -2.0% -7.2%
bs=1000 sw=10 sl=64 MB/s 0.589 MB/s 0.602 MB/s 0.635 MB/s -2.2% -7.3%
bs=1000 sw=10 sl=64 p50 1,038,474 us 1,012,624 us 972,714 us +2.6% +6.8%
bs=1000 sw=10 sl=64 p95 1,063,384 us 1,065,211 us 1,023,057 us -0.2% +3.9%
bs=1000 sw=10 sl=64 p99 1,063,384 us 1,065,211 us 1,023,057 us -0.2% +3.9%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,485.95,200,128000,412,0.251,22362.38,39000.94,39000.94
1,100,10,64,20,2409.21,2000,1280000,830,0.507,118276.71,144868.57,144868.57
2,1000,10,64,20,20709.53,20000,12800000,966,0.589,1038473.52,1063383.95,1063383.95

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.22%. Comparing base (e270f83) to head (bcfb688).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ecture/scheduling/RegionExecutionCoordinator.scala 90.47% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5737      +/-   ##
============================================
- Coverage     54.39%   54.22%   -0.18%     
+ Complexity     2860     2845      -15     
============================================
  Files          1107     1102       -5     
  Lines         42767    42598     -169     
  Branches       4599     4587      -12     
============================================
- Hits          23264    23098     -166     
+ Misses        18152    18150       -2     
+ Partials       1351     1350       -1     
Flag Coverage Δ *Carryforward flag
access-control-service 70.44% <ø> (ø) Carriedforward from f2d5737
agent-service 34.36% <ø> (ø) Carriedforward from f2d5737
amber 56.18% <90.47%> (-0.01%) ⬇️
computing-unit-managing-service 1.65% <ø> (ø) Carriedforward from f2d5737
config-service 56.06% <ø> (-1.30%) ⬇️ Carriedforward from f2d5737
file-service 57.36% <ø> (-1.24%) ⬇️ Carriedforward from f2d5737
frontend 48.01% <ø> (-0.30%) ⬇️ Carriedforward from f2d5737
pyamber 90.09% <ø> (-0.11%) ⬇️ Carriedforward from f2d5737
python 90.76% <ø> (ø) Carriedforward from f2d5737
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from f2d5737

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

Copy link
Copy Markdown
Contributor

Automated Reviewer Suggestions

Based on the git blame history of the changed files, we recommend the following reviewers:

  • Contributors with relevant context: @aglinxinyuan, @Yicong-Huang, @kunwp1
    You can notify them by mentioning @aglinxinyuan, @Yicong-Huang, @kunwp1 in a comment.

@Ma77Ball Ma77Ball marked this pull request as ready for review June 23, 2026 10:12
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

/request-review @aglinxinyuan

@Ma77Ball Ma77Ball force-pushed the fix/region-termination-bounded-retry branch from f16b303 to 7ec4e83 Compare June 23, 2026 10:15
@github-actions github-actions Bot requested a review from aglinxinyuan June 23, 2026 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky DataProcessingSpec intermittently fails the amber CI job

2 participants