pdd fix: workflow retries deterministically failed steps across cycles, wasting cost and time

## Summary

The `pdd fix` workflow orchestrator has no memory of step outcomes across cycles. When a step fails (e.g., due to an LLM provider timeout), the workflow retries the exact same step in subsequent cycles with nothing changed — no different provider, no skip logic, no fallback strategy. This causes predictable, repeated waste.

## Evidence

During `pdd fix` on promptdriven/pdd_cloud#600 (Job ID: `86M6dCRHG7LS95gwt41H`), the E2E test step (Step 2) failed identically across all 3 cycles:

**Cycle 1, Step 2:** Completed, but noted "E2E Tests — not run (requires dev server + browsers)"
**Cycle 2, Step 2:** `"FAILED: All agent providers failed: anthropic: Timeout expired"`
**Cycle 3, Step 2:** Timed out again — Step 3 notes: "Steps 2-3 timed out, so I performed my own root cause assessment"

The LLM provider timed out in Cycles 2-3 (an #492-class issue). But the workflow blindly re-attempted Step 2 each cycle with the same provider configuration, the same environment, and no knowledge that the previous attempt failed. Nothing changed between retries, so failure was guaranteed.

**Note:** The original version of this issue incorrectly attributed the failure to missing Playwright/browser infrastructure. The actual error is LLM provider timeouts (`anthropic: Timeout expired`), not a missing dev server. Cycle 1's Step 2 completed fine and correctly identified E2E tests couldn't run in the environment.

## Impact

- **Duration:** ~15-20 minutes wasted per failed step retry. Over 3 cycles, Step 2 timeouts added ~30-40 minutes to the 127-minute total.
- **Cost:** The `pdd fix` job cost $16.02 total; a meaningful portion was consumed by retried timeouts that could never succeed.
- **No value:** Repeated failures produced zero additional signal beyond what Cycle 1 already established.

## Root Cause

The `pdd fix` orchestrator (`e2e_fix` workflow) tracks `step_outputs` per cycle in its state, but does not use prior cycle outcomes to inform the current cycle's strategy. Each cycle starts fresh with no awareness of what already failed.

From the workflow state:
```json
"step_outputs": {
    "1": "All 15 unit tests pass...",
    "2": "FAILED: All agent providers failed: anthropic: Timeout expired"
}
```

This state is recorded but never read back to skip or adapt Step 2 in the next cycle.

## Expected Behavior

1. **Carry forward step failure context:** If Step N failed with a provider timeout in Cycle K, the orchestrator should know this in Cycle K+1 and either skip the step or try a different provider/strategy.
2. **Distinguish transient vs. deterministic failures:** A provider timeout with the same provider config in the same environment is likely to recur. Don't retry it identically.
3. **Accept partial verification:** When unit tests pass but E2E steps can't execute (for any reason), the workflow should be able to proceed with a "unit tests sufficient" fallback rather than blocking on E2E every cycle.

## Reproduction

```bash
# Any pdd fix job where an LLM provider times out on a step
# The timeout will be retried identically on every subsequent cycle
pdd fix https://github.com/promptdriven/pdd_cloud/issues/600
```

## Related

- promptdriven/pdd_cloud#600 — the bug being fixed when this was discovered
- promptdriven/pdd_cloud#492 — LLM provider auth/timeout failures (the underlying cause of individual step timeouts)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdd fix: workflow retries deterministically failed steps across cycles, wasting cost and time #661

Summary

Evidence

Impact

Root Cause

Expected Behavior

Reproduction

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

pdd fix: workflow retries deterministically failed steps across cycles, wasting cost and time #661

Description

Summary

Evidence

Impact

Root Cause

Expected Behavior

Reproduction

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions