Skip to content

fix(batch): retry R2 upload on transient failure in BatchPayloadProce…#1

Open
erberkson wants to merge 1 commit intobefore-retryfrom
demo-retry
Open

fix(batch): retry R2 upload on transient failure in BatchPayloadProce…#1
erberkson wants to merge 1 commit intobefore-retryfrom
demo-retry

Conversation

@erberkson
Copy link
Copy Markdown
Owner

…ssor (triggerdotdev#3331)

A single "fetch failed" from the object store was aborting the entire batch stream with no retry. Added p-retry (3 attempts, 500ms-2s backoff) around ploadPacketToObjectStore so transient network errors self-heal server-side instead of propagating to the SDK.

Closes #

✅ Checklist

  • I have followed every step in the contributing guide
  • The PR title follows the convention.
  • I ran and tested the code works

Testing

[Describe the steps you took to test this change]


Changelog

[Short description of what has changed]


Screenshots

[Screenshots]

💯

…ssor (triggerdotdev#3331)

A single "fetch failed" from the object store was aborting the entire
batch stream with no retry. Added p-retry (3 attempts, 500ms-2s backoff)
around ploadPacketToObjectStore so transient network errors self-heal
server-side instead of propagating to the SDK.
@matrixreview
Copy link
Copy Markdown

matrixreview bot commented Apr 12, 2026

🔴 MatrixReview — RED

🔎 = doc-backed finding  ·  💭 = AI suggestion  ·  📖 = doc citation  ·  📝 = PR location

Findings: 8 (10 doc-backed)

🔴 SECURITY — 2 findings (2 doc-backed) · expand 🔽
  • 🔎 [SECURITY] The PR introduces a new dependency 'p-retry' which is not listed in the security documentation's pinned dependencies (requirements_security_section.md). This could introduce a supply chain risk if ...

    Read more · expand 🔽

    ...the package is malicious, has known vulnerabilities, or is not vetted.

    📖 *requirements_security_section.md lines 1-150* 📝 `apps/webapp/app/runEngine/concerns/batchPayloads.server.ts line 2`
  • 🔎 [BUG] The error handling in the retry logic may expose internal error details to the client. The error message from the failed upload is propagated directly to the client in the 500 response, which could...

    Read more · expand 🔽

    ... leak internal object store paths or implementation details.

    - *Also flagged by: ARCHITECTURE* 📖 *README_security_section.md lines 1-3* 📝 `apps/webapp/app/routes/api.v3.batches.$batchId.items.ts line 107`
🟡 ARCHITECTURE — 3 findings (3 doc-backed) · expand 🔽
  • 🔎 [ARCHITECTURE] The PR adds a direct dependency on 'p-retry' package in the BatchPayloadProcessor, but the architecture documentation shows that retry logic should be handled through the existing retry mechanisms ...

    Read more · expand 🔽

    ...in the system. The Supervisor architecture (CLAUDE.md) mentions retry management as part of the task execution lifecycle, and the RunEngine architecture (README_architecture_section.md) shows retry management as a core system responsibility. Adding ad-hoc retry logic at the payload processing level violates the centralized retry management pattern.

    📖 *CLAUDE.md lines 1-15, README_architecture_section.md lines 1-30* 📝 `apps/webapp/app/runEngine/concerns/batchPayloads.server.ts line 107-136`
  • 🔎 [PERFORMANCE] The retry configuration uses fixed 3 attempts with 500ms-2s exponential backoff, which could significantly increase latency for batch processing under transient failures. The batch queue stress tes...

    Read more · expand 🔽

    ...t documentation shows extensive performance testing with various configurations, but this retry logic wasn't part of those tests. Adding up to 4.5 seconds of additional latency (500ms + 1000ms + 2000ms) per failed upload could impact batch completion times.

    📖 *batch-queue-stress-test-plan.md lines 100-150* 📝 `apps/webapp/app/runEngine/concerns/batchPayloads.server.ts line 119-125`
  • 🔎 [ARCHITECTURE] The test file imports and tests implementation details (p-retry behavior) rather than testing the public contract. The architecture documentation emphasizes testing through public APIs and contract...

    Read more · expand 🔽

    ...s. The test verifies specific retry counts and timing behavior which couples tests to implementation details.

    📖 *runsReplicationBenchmark_architecture_section.md lines 1-30* 📝 `apps/webapp/test/engine/batchPayloads.test.ts line 61-77`

🟢 LEGAL — No issues found

🟡 STYLE — 1 findings (1 doc-backed) · expand 🔽
  • 🔎 [STYLE] Test file location does not follow the recommended test execution pattern. The style guide states to prefer running a single test file from within its directory, but the test file is placed in `app...

    Read more · expand 🔽

    ...s/webapp/test/engine/` instead of being colocated with the source file.

    📖 *AGENTS_style_section.md lines 2-5* 📝 `apps/webapp/test/engine/batchPayloads.test.ts line 1`
🔴 ONBOARDING — 2 findings (2 doc-backed) · expand 🔽
  • 🔎 [POLICY_VIOLATION] Missing required documentation update for the server change. According to RELEASE.md, server-only changes require a .server-changes/ file. While a .server-changes/batch-r2-upload-retry.md file ...

    Read more · expand 🔽

    ...was added, the CONTRIBUTING_onboarding_section.md (v8) specifies that the file must include a one-line description in the body. The current body is a multi-line paragraph, not a concise one-line description.

    📖 *CONTRIBUTING_onboarding_section.md lines 1-30* 📝 `.server-changes/batch-r2-upload-retry.md line 1`
  • 🔎 [CHORE] PR description checklist is not filled out. The 'I have followed every step in the contributing guide' and 'I ran and tested the code works' checkboxes are unchecked, and the 'Testing', 'Changelog'...

    Read more · expand 🔽

    ..., and 'Screenshots' sections contain placeholder text.

    📖 *CONTRIBUTING_onboarding_section.md lines 1-8*

👆 Click expand on any gate above to see full findings with evidence and citations.


Powered by MatrixReview · Report incorrect finding

⚙️ Generate Fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants