fix: keep rate-limited records retrying#703
Conversation
Code Review: PR #703 — fix: keep rate-limited records retryingSummaryThis PR fixes a correctness bug in The change is small, well-scoped, and accompanied by a focused regression test. The mechanism — tracking the last error per deferred task in a FindingsCorrectness
Risk: tight loop / livelock under sustained rate limitingThe outer dispatch loop will now Recommendations to harden:
Logging
Style / minor
Test coverage
Project conventions
VerdictApprove with comments. The fix is correct, narrowly scoped, and addresses a real data-loss bug. The main concern is operational: under sustained provider rate-limiting, the new |
Greptile SummaryThis PR fixes a data-loss bug where records hitting HTTP 429 after the salvage round cap were silently dropped. Rate-limited exhausted tasks are now kept in
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | Core change: rate-limited exhausted tasks are kept in _deferred instead of dropped, and new pacing/throttling logic retries them with backoff. Logic is sound; one pre-existing warning log is now emitted per salvage cycle instead of once. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py | Adds five regression tests: delayed-not-dropped 429s, paced resalvage cycling, mixed retryable exhaustion, zero salvage rounds rejection, and request-cooldown pacing. Coverage is thorough and assertions are appropriate. |
| architecture/dataset-builders.md | Single-sentence documentation update reflecting that salvage-exhausted 429 tasks are deferred rather than dropped. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Task dispatched] --> B{Fails?}
B -- No --> C[Complete / checkpoint]
B -- Yes --> D{Retryable?}
D -- No --> E[Drop row]
D -- Yes --> F[Append to _deferred]
F --> G[_salvage_stalled_row_groups]
G --> H[_salvage_rounds up to max]
H --> I{Still failing after all rounds?}
I -- No --> C
I -- Yes: rate-limit --> J[Preserve in _deferred]
I -- Yes: other retryable --> E
J --> K{has rate-limited deferred?}
K -- Yes --> L[_wait_before_rate_limit_resalvage]
L --> G
K -- No --> M{Early shutdown?}
M -- Yes --> N[break]
M -- No --> O[Normal loop continues]
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py:1072-1080
**Unthrottled early-shutdown warning per retry cycle**
Before this PR the early-shutdown branch always `break`ed, so `logger.warning("Early shutdown triggered …")` fired exactly once. With the new `continue`, this line executes on every salvage cycle until all rate-limited tasks clear. At the minimum fallback delay of 50 ms that is ~20 log entries per second; with a real 429 back-off of several seconds the rate is lower, but a sustained rate-limit period can still produce hundreds of repetitions.
The PR explicitly adds throttling for `_record_rate_limit_preservations` (firing on the first preservation and every `RATE_LIMIT_PRESERVATION_WARNING_INTERVAL` cycles thereafter). The same pattern should be applied here — e.g. guard the warning with a flag that is set on the first entry into the early-shutdown block, so it fires once and the repeating cycles are silent.
Reviews (10): Last reviewed commit: "Merge branch 'main' into codex/fix-429-r..." | Re-trigger Greptile
b16957b to
f6de784
Compare
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
f6de784 to
e1c4896
Compare
|
Update summary after reviewing the automated feedback:
Verification run on the latest head:
|
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
|
Nice work on this one, @eric-tramel — the scheduler change is nicely targeted and the regression coverage hits the important behavioral edges. SummaryThis PR changes async scheduler salvage so exhausted FindingsWarnings — Worth addressing
Suggestions — Take it or leave it
What Looks Good
VerdictNeeds changes — I’d address the wall-clock-sensitive scheduler test before merge. The docs update is a light nit, but worth keeping close to the behavior change. This review was generated by an AI assistant. |
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
|
MkDocs preview: https://b6c9e2ef.dd-docs-preview.pages.dev Fern preview: https://nvidia-preview-pr-703.docs.buildwithfern.com/nemo/datadesigner
|
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
nabinchha
left a comment
There was a problem hiding this comment.
Nice follow-up, @eric-tramel — the latest updates address the prior scheduler-review concerns cleanly.
Summary
This PR keeps salvage-exhausted ModelRateLimitError tasks deferred so sustained 429s delay records instead of dropping them, while preserving the existing drop behavior for other exhausted retryable failures. I re-reviewed the latest head after the test and docs adjustments; the implementation matches the PR intent.
Findings
No findings. The prior timing-sensitive pacing test has been made deterministic with an async gate, and the dataset-builder architecture note now documents the new rate-limit preservation behavior.
What Looks Good
- The deferred-error tracking stays tightly scoped to scheduler internals and is cleaned up on success, checkpoint, and non-rate-limit exhaustion paths.
- The pacing path uses request-admission cooldown when available, with a small fallback delay, so preserved 429 work avoids the hot-loop risk raised earlier.
- The tests now cover delayed-not-dropped 429s, early shutdown interaction, mixed retryable exhaustion, cleanup invariants, and cooldown-based delay selection.
Verdict
Ship it — ready to merge from my side.
This review was generated by an AI assistant.
📋 Summary
This PR prevents async scheduler rate-limit backpressure from turning records into dropped rows. Records that keep hitting HTTP 429 /
ModelRateLimitErrorafter the finite salvage round cap now remain deferred and continue retrying, so they are delayed instead of lost.🔗 Related Issue
N/A
🔄 Changes
ModelRateLimitErrortasks in the deferred queue instead of dropping their row or row group.salvage_max_rounds >= 1so preserved 429 work always has retry attempts.🧪 Testing
make testpasses (not run; focused engine coverage below)Focused checks run:
uv run --group dev pytest packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.pyuv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/generators/test_custom.py -k "retryable_model_errors_pass_through"uv run --group dev ruff check packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.pyuv run --group dev ruff format --check packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.pygit diff --check✅ Checklist