Summary
Two coupled issues that together make full-pipeline regeneration unreliable when the LLM rewords any error message.
(a) On regeneration, the LLM keeps all contractual behavior (function signatures, exception types, numeric results) but reworded all five ValueError messages, e.g., "Matrix multiplication shapes incompatible" instead of "Incompatible shapes for matrix multiplication." PDD-generated tests assert on exact wording via pytest.raises(match=...), so all five tests fail simultaneously.
(b) Inspection of .pdd/backups/linalg_engine/20260511_124217/ shows the fix loop was making progress — iteration 3 (code_3_0_1_0.py) correctly restored matmul's error string and passed that test. But the LLM addresses roughly one failure per iteration, and the breaker fires at 5 consecutive fix operations. With 5+ simultaneous wording mismatches, the loop cannot converge before being killed.
Reproduction
Run A: pdd --local --force --verbose sync --skip-tests --skip-verify --no-steer linalg_engine
→ Success, 61.88s, $0.0202
Run B: pdd --local --force --verbose sync --no-steer linalg_engine
→ Failed, 160.62s, $0.1175
→ "Detected 5 consecutive fix operations. Breaking infinite fix loop."
Evidence: evidence/bug9_full_sync.log, .pdd/backups/linalg_engine/20260511_124217/{code_1..code_3}.py showing per-iteration progress on the wording fix.
Expected
A regeneration that mismatches N error strings should be able to converge in N iterations.
Suggested fix (either is sufficient; both would be better)
(a) Make the breaker progress-sensitive: don't terminate if the failure count strictly decreased in the prior iteration. Cap by total iterations or wall-clock instead.
(b) Update the test-generation prompt to discourage pytest.raises(match="") in favor of pattern-based matches (match=r"shape|incompatible|dimension") or assertions on exc.args/exception type alone. PDD's own prompting guidance advises this ("Prefer observable behavior over private implementation details"), but PDD's test generator doesn't follow it.
Summary
Two coupled issues that together make full-pipeline regeneration unreliable when the LLM rewords any error message.
(a) On regeneration, the LLM keeps all contractual behavior (function signatures, exception types, numeric results) but reworded all five ValueError messages, e.g., "Matrix multiplication shapes incompatible" instead of "Incompatible shapes for matrix multiplication." PDD-generated tests assert on exact wording via pytest.raises(match=...), so all five tests fail simultaneously.
(b) Inspection of .pdd/backups/linalg_engine/20260511_124217/ shows the fix loop was making progress — iteration 3 (code_3_0_1_0.py) correctly restored matmul's error string and passed that test. But the LLM addresses roughly one failure per iteration, and the breaker fires at 5 consecutive fix operations. With 5+ simultaneous wording mismatches, the loop cannot converge before being killed.
Reproduction
Run A: pdd --local --force --verbose sync --skip-tests --skip-verify --no-steer linalg_engine
→ Success, 61.88s, $0.0202
Run B: pdd --local --force --verbose sync --no-steer linalg_engine
→ Failed, 160.62s, $0.1175
→ "Detected 5 consecutive fix operations. Breaking infinite fix loop."
Evidence: evidence/bug9_full_sync.log, .pdd/backups/linalg_engine/20260511_124217/{code_1..code_3}.py showing per-iteration progress on the wording fix.
Expected
A regeneration that mismatches N error strings should be able to converge in N iterations.
Suggested fix (either is sufficient; both would be better)
(a) Make the breaker progress-sensitive: don't terminate if the failure count strictly decreased in the prior iteration. Cap by total iterations or wall-clock instead.
(b) Update the test-generation prompt to discourage pytest.raises(match="") in favor of pattern-based matches (match=r"shape|incompatible|dimension") or assertions on exc.args/exception type alone. PDD's own prompting guidance advises this ("Prefer observable behavior over private implementation details"), but PDD's test generator doesn't follow it.