fix(amber): bound region termination retry instead of looping forever#5737
fix(amber): bound region termination retry instead of looping forever#5737Ma77Ball wants to merge 14 commits into
Conversation
…or user to see and fix
…nation-bounded-retry
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 412 | 0.251 | 22,362/39,001/39,001 us | 🔴 +23.5% / 🔴 +11.5% |
| ⚪ | bs=100 sw=10 sl=64 | 830 | 0.507 | 118,277/144,869/144,869 us | ⚪ within ±5% / 🔴 -6.9% |
| ⚪ | bs=1000 sw=10 sl=64 | 966 | 0.589 | 1,038,474/1,063,384/1,063,384 us | ⚪ within ±5% / 🔴 -7.3% |
Baseline details
Latest main e270f83 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 412 tuples/sec | 460 tuples/sec | 410.82 tuples/sec | -10.4% | +0.3% |
| bs=10 sw=10 sl=64 | MB/s | 0.251 MB/s | 0.28 MB/s | 0.251 MB/s | -10.4% | +0.1% |
| bs=10 sw=10 sl=64 | p50 | 22,362 us | 21,057 us | 23,785 us | +6.2% | -6.0% |
| bs=10 sw=10 sl=64 | p95 | 39,001 us | 31,568 us | 34,980 us | +23.5% | +11.5% |
| bs=10 sw=10 sl=64 | p99 | 39,001 us | 31,568 us | 34,980 us | +23.5% | +11.5% |
| bs=100 sw=10 sl=64 | throughput | 830 tuples/sec | 840 tuples/sec | 891.94 tuples/sec | -1.2% | -6.9% |
| bs=100 sw=10 sl=64 | MB/s | 0.507 MB/s | 0.513 MB/s | 0.544 MB/s | -1.2% | -6.9% |
| bs=100 sw=10 sl=64 | p50 | 118,277 us | 115,741 us | 112,277 us | +2.2% | +5.3% |
| bs=100 sw=10 sl=64 | p95 | 144,869 us | 149,686 us | 139,802 us | -3.2% | +3.6% |
| bs=100 sw=10 sl=64 | p99 | 144,869 us | 149,686 us | 139,802 us | -3.2% | +3.6% |
| bs=1000 sw=10 sl=64 | throughput | 966 tuples/sec | 986 tuples/sec | 1,041 tuples/sec | -2.0% | -7.2% |
| bs=1000 sw=10 sl=64 | MB/s | 0.589 MB/s | 0.602 MB/s | 0.635 MB/s | -2.2% | -7.3% |
| bs=1000 sw=10 sl=64 | p50 | 1,038,474 us | 1,012,624 us | 972,714 us | +2.6% | +6.8% |
| bs=1000 sw=10 sl=64 | p95 | 1,063,384 us | 1,065,211 us | 1,023,057 us | -0.2% | +3.9% |
| bs=1000 sw=10 sl=64 | p99 | 1,063,384 us | 1,065,211 us | 1,023,057 us | -0.2% | +3.9% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,485.95,200,128000,412,0.251,22362.38,39000.94,39000.94
1,100,10,64,20,2409.21,2000,1280000,830,0.507,118276.71,144868.57,144868.57
2,1000,10,64,20,20709.53,20000,12800000,966,0.589,1038473.52,1063383.95,1063383.95…7Ball/texera into fix/region-termination-bounded-retry
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5737 +/- ##
============================================
- Coverage 54.39% 54.22% -0.18%
+ Complexity 2860 2845 -15
============================================
Files 1107 1102 -5
Lines 42767 42598 -169
Branches 4599 4587 -12
============================================
- Hits 23264 23098 -166
+ Misses 18152 18150 -2
+ Partials 1351 1350 -1
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Automated Reviewer SuggestionsBased on the
|
|
/request-review @aglinxinyuan |
f16b303 to
7ec4e83
Compare
What changes were proposed in this PR?
RegionExecutionCoordinator.terminateWorkersWithRetryretried worker termination with no limit, so a worker that never finished draining (itsEndWorkerkept failing) left the region's termination future unresolved and the workflow never reached COMPLETED, surfacing as an opaque ~1 minute timeout inDataProcessingSpecwith no indication of the stuck region or workers.maxTerminationAttemptsbudget (default 150, about 30s at the existing 200ms delay); on exhaustion the termination future fails with a descriptive error naming the region and the still-unterminated workers, instead of retrying indefinitely.maxTerminationAttemptsandkillRetryDelayconstructor parameters with production defaults so the behavior is unit-testable without long waits.Any related issues, documentation, discussions?
Closes: #5614
How was this PR tested?
sbt "WorkflowExecutionService/testOnly *RegionExecutionCoordinatorSpec"(under JDK 17); expect 3 tests succeeded, 0 failed.give up with a descriptive error once the EndWorker retry budget is exhausted: forcesEndWorkerto always fail, then asserts the completion future fails withIllegalStateExceptioncontaining "could not be terminated after 3 attempts", the coordinator is not marked completed, exactly 3EndWorkercalls were made, and the worker actor ref is retained.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF