Skip to content

pdd fix: workflow retries deterministically failed steps across cycles, wasting cost and time #661

@gltanaka

Description

@gltanaka

Summary

The pdd fix workflow orchestrator has no memory of step outcomes across cycles. When a step fails (e.g., due to an LLM provider timeout), the workflow retries the exact same step in subsequent cycles with nothing changed — no different provider, no skip logic, no fallback strategy. This causes predictable, repeated waste.

Evidence

During pdd fix on promptdriven/pdd_cloud#600 (Job ID: 86M6dCRHG7LS95gwt41H), the E2E test step (Step 2) failed identically across all 3 cycles:

Cycle 1, Step 2: Completed, but noted "E2E Tests — not run (requires dev server + browsers)"
Cycle 2, Step 2: "FAILED: All agent providers failed: anthropic: Timeout expired"
Cycle 3, Step 2: Timed out again — Step 3 notes: "Steps 2-3 timed out, so I performed my own root cause assessment"

The LLM provider timed out in Cycles 2-3 (an #492-class issue). But the workflow blindly re-attempted Step 2 each cycle with the same provider configuration, the same environment, and no knowledge that the previous attempt failed. Nothing changed between retries, so failure was guaranteed.

Note: The original version of this issue incorrectly attributed the failure to missing Playwright/browser infrastructure. The actual error is LLM provider timeouts (anthropic: Timeout expired), not a missing dev server. Cycle 1's Step 2 completed fine and correctly identified E2E tests couldn't run in the environment.

Impact

  • Duration: ~15-20 minutes wasted per failed step retry. Over 3 cycles, Step 2 timeouts added ~30-40 minutes to the 127-minute total.
  • Cost: The pdd fix job cost $16.02 total; a meaningful portion was consumed by retried timeouts that could never succeed.
  • No value: Repeated failures produced zero additional signal beyond what Cycle 1 already established.

Root Cause

The pdd fix orchestrator (e2e_fix workflow) tracks step_outputs per cycle in its state, but does not use prior cycle outcomes to inform the current cycle's strategy. Each cycle starts fresh with no awareness of what already failed.

From the workflow state:

"step_outputs": {
    "1": "All 15 unit tests pass...",
    "2": "FAILED: All agent providers failed: anthropic: Timeout expired"
}

This state is recorded but never read back to skip or adapt Step 2 in the next cycle.

Expected Behavior

  1. Carry forward step failure context: If Step N failed with a provider timeout in Cycle K, the orchestrator should know this in Cycle K+1 and either skip the step or try a different provider/strategy.
  2. Distinguish transient vs. deterministic failures: A provider timeout with the same provider config in the same environment is likely to recur. Don't retry it identically.
  3. Accept partial verification: When unit tests pass but E2E steps can't execute (for any reason), the workflow should be able to proceed with a "unit tests sufficient" fallback rather than blocking on E2E every cycle.

Reproduction

# Any pdd fix job where an LLM provider times out on a step
# The timeout will be retried identically on every subsequent cycle
pdd fix https://github.com/promptdriven/pdd_cloud/issues/600

Related

  • promptdriven/pdd_cloud#600 — the bug being fixed when this was discovered
  • promptdriven/pdd_cloud#492 — LLM provider auth/timeout failures (the underlying cause of individual step timeouts)

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions