feat: post success-path step comments via trusted credentials for change/test/checkup/e2e_fix by Serhan-Asad · Pull Request #1117 · promptdriven/pdd

Serhan-Asad · 2026-05-20T22:03:26Z

Summary

The GitHub App's worker container installs the CLI from promptdriven/pdd, but only pdd bug already had reliable orchestrator-owned visible per-step comments. pdd change / pdd test / pdd checkup / pdd fix step prompts could still rely on agent-side shell comments, which fails across providers when the model subprocess cannot access GitHub credentials.

This PR moves success-path step comments into the orchestrators and posts them with trusted PDD credentials. If an agent omits a <step_report> block, the orchestrator now posts a short fallback visible comment and keeps the raw output in workflow state.

Acceptance criteria mapping (from #1109)

pdd bug: already works upstream via existing _maybe_post_step_comment — unchanged
pdd change: wired here, including the 13-step flow and iterated review loop
pdd test: wired here, including fractional step 5.5 and the iterated manual loop
pdd checkup: wired here, including fractional 6.1/6.2/6.3 and the iterated fix-verify loop
pdd fix: wired here through agentic_e2e_fix_orchestrator, cycle-keyed across the 11-step loop
pdd sync: already uses an orchestrator-owned rolling visible PATCH comment rather than per-step comments

What changed

Helpers: pdd/agentic_common.py now exposes:

extract_step_report(text) -> Optional[str]
post_step_comment_once(...) -> bool
normalize_step_comments_state(raw) -> Set[int]

Orchestrators: change, test, checkup, and e2e_fix hydrate persisted step-comment state on resume, post each successful step once with trusted credentials, persist sorted posted-step keys, and log-and-continue if comment posting fails.

Fallback behavior: Missing <step_report> no longer suppresses visible progress. The orchestrator posts: _Step N completed; no <step_report> block returned by agent. Raw output retained in workflow state._

Prompt specs: Updated the matching *_python.prompt files so generated implementations preserve the trusted-comment and fallback behavior.

Tests

Local:

pytest tests/test_agentic_common.py::TestPostStepCommentOnce tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q — 19 passed
PDD_PATH=$PWD pytest tests/test_agentic_common.py tests/test_agentic_change_orchestrator.py tests/test_agentic_test_orchestrator.py tests/test_agentic_checkup_orchestrator.py tests/test_agentic_e2e_fix_orchestrator.py -q — 934 passed
git diff --check — clean

GitHub Actions on latest commit 2f1e641ad:

Run Unit Tests — passed
Public CLI Regression — passed
Package Preprocess Smoke — passed
heal — passed

Supersedes

Closes #1109

Add three public helpers so orchestrators can own per-step `## Step N/T:` GitHub issue comments via trusted PDD credentials, instead of asking step prompts to shell out to `gh issue comment` (which is unreliable across providers, especially Gemini's sandboxed shell that cannot read GH_TOKEN). - `extract_step_report(text) -> Optional[str]` — public alias of the existing `_extract_step_report`; returns the LAST `<step_report>...</step_report>` block, DOTALL + case-insensitive, whitespace-stripped. - `post_step_comment_once(*, repo_owner, repo_name, issue_number, step_num, body, posted_steps, cwd) -> bool` — keyword-only, idempotent. Returns True immediately if `step_num` already in `posted_steps`; otherwise sanitizes/truncates the body via the existing `_sanitize_comment_body`, shells out to `gh issue comment`, and on success mutates `posted_steps` in place. Never raises; transient failures are logged and surfaced as False. - `normalize_step_comments_state(raw) -> Set[int]` — coerces any persisted shape (None, list/tuple/set/frozenset, legacy bug-orchestrator dict, malformed) into a `Set[int]` so resume is robust across state-file generations. Rejects `bool` (legacy flag value) and `float` (caller must project composite keys). Bug-orchestrator path (already shipping success-comment posting on upstream main) is untouched — these helpers cover the change/test/checkup/e2e_fix orchestrators, wired in a follow-up commit. architecture.json grows three function entries for the new public helpers under the agentic_common interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…entials Provider-side `gh issue comment` posts from step prompts broke across non-Anthropic providers, leaving change/test/checkup/e2e_fix workflows silent from the GitHub UI even though Cloud Run logs showed progress. Wire the per-step `## Step N/T:` posting into the orchestrator path behind trusted PDD credentials. Step prompts emit `<step_report>...</step_report>` blocks; the orchestrator extracts, sanitizes, truncates, and posts once per step via `gh issue comment`, mutating a `Set[int]` of posted-step keys for resume idempotency. Wiring per orchestrator: - **change** (13 steps + iterated review loop 11/12): linear steps 1-10 keyed by `step_num`; review iterations keyed by `iteration * 100 + step_num` (so a second-iteration step 11 doesn't dedupe against the first); step 13 keyed by 13. All hard-stop / abort / failure early-returns persist `state["step_comments"]` before saving. - **test** (18 steps with fractional 5.5 + iterated manual loop 6-11): composite-keyed via `iteration * 10000 + int(round(step_num * 10))` so 5.5 -> 55, step 8 of iteration 2 -> 20080. `save_state` grows `description` and `iteration` kwargs to thread the description through. - **checkup** (8 steps with fractional 6.1/6.2/6.3 + iterated fix-verify loop): same composite encoding as test. Step comments persist through `_build_state(..., step_comments=...)`. - **e2e_fix** (11 steps × cycle loop): cycle-keyed encoding `current_cycle * 10000 + step_num`. `current_cycle` is normalized to >=1 before the step loop. All three state-save sites (main inner loop, KeyboardInterrupt, fatal Exception) persist `sorted(step_comments_set)`. Tests: - 24 new unit tests in `test_agentic_common.py` cover the three helpers — extract success/missing/empty/multi/case/multiline; normalize across every persisted shape (None, list, tuple, set, frozenset, legacy dict, malformed, garbage); once idempotency, successful post, gh-failure, missing-gh, secret redaction, oversize truncation, exception non-raise. - Three orchestrator-level tests per command: success-path posts, no-`<step_report>` skips, helper-exception is log-and-continue. - `tests/test_post_step_comment_once_real.py` — gated by `PDD_RUN_REAL_LLM_TESTS=1` and `PDD_REAL_GH_ISSUE_URL`. Posts a real comment, asserts idempotency on second call, then deletes the comment via `gh api`. Bug-orchestrator path (already shipping success-comment posting upstream via `_maybe_post_step_comment`) is untouched. `agentic_sync_runner` is out of scope — its single rolling PATCH'd comment is not per-step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Serhan-Asad · 2026-05-21T01:11:32Z

Staging1 end-to-end verification

Built worker image from this PR's HEAD (2f1e641a) and deployed to all three
staging1 services (worker, solving-worker, executor-job). Then ran a baseline on
the pre-PR pinned SHA (5c077752) and the same workflows on the new image.

Build / unit

Local unit tests (Python 3.12, PDD_PATH=$PWD): 934 passed, 0 failed.
New helper tests (TestPostStepCommentOnce + four TestTrustedStepCommentPosting): 19 passed.
Cloud Build of Dockerfile.worker pinned to 2f1e641a: SUCCESS in ~6 min.
One worktree-only Dockerfile tweak required: git clone … && git checkout SHA
on the prompt-repo clone can't reach an unmerged PR SHA, so added a
git fetch origin "+refs/pull/*/head:refs/pulls/*" fallback. Benign for
merged SHAs; not committed to pdd_cloud main.

Bug reproduction (pre-PR worker, `5c077752`)

pdd change with pdd-gemini-flash + pdd-effort-low on
promptdriven/test_repo#3213:
posted Steps 1–8 in agent format (## Step N: …), then 22 minutes silent
while the executor's Step 9 was still running. Matches the user-visible
silence symptom from #1092 / #1103 (which were on the same pinned SHA in
production). Cancelled before Step 9 finished, so I observed the symptom but
not the underlying credential-failure mechanism directly — that piece is
anchored to the upstream production evidence.

Fix verification (PR worker, `2f1e641a`)

pdd change on test_repo#3214
— same labels, same shape of issue. Orchestrator-side trusted post path is
immediately visible:

[00:07:36Z] ## Step 1/13: Search for duplicate issues       ← ORCH ✓
[00:09:01Z] ## Step 2/13: Check if already implemented      ← ORCH ✓
[00:10:59Z] ## Step 3/13: Research to clarify specifications ← ORCH ✓
[00:12:52Z] ## Step 4/13: Verify requirements are clear     ← ORCH ✓
[00:15:18Z] ## Step 5/13: Analyze documentation changes needed ← ORCH ✓

The agent ALSO continues to post its ## Step N: … headers (per-step prompts
unchanged), so end-users see ~2 visible comments per step. That's cosmetic, not
a correctness issue — the second comment comes from a credential path that
cannot fail silently.

Cross-command spot-check

pdd checkup on test_repo#3215:

[00:33:21Z] ## Step 1/8: Discovering project structure and tech stack ← ORCH ✓
[00:36:28Z] ## Step 2/8: Auditing dependencies                         ← ORCH ✓
[00:38:29Z] ## Step 3/8: Running build/compile checks                  ← ORCH ✓
[00:43:12Z] ## Step 4/8: Checking cross-module interfaces              ← ORCH ✓

Headers match the orchestrator's step_definitions (not the agent's free-form
titles), confirming post_step_comment_once is firing from the checkup
orchestrator code added by this PR.

Regression analysis

pdd/agentic_bug_orchestrator.py is unchanged in this PR.
agentic_common.py changes are purely additive (new public helpers; the
pre-existing _extract_step_report is preserved and extract_step_report
is a new public alias).
No code in pdd_cloud extensions/github_pdd_app/src/ or tests/ imports the
new helpers — no integration-side breakage risk there.
Live executor flow on both staging runs (#3214, #3215) handled secrets,
branch checkout, credential waterfall, and step-comment posting cleanly with
the new image.

Gap acknowledgement (not live-tested)

These rely on the PR's unit-test coverage, not staging E2E:

Review loop Steps 11/12 with iteration*100 + step_num keying (the most
novel code in the PR).
Step 13 PR-creation comment site.
Fallback "Step N completed; no <step_report> block returned" path —
Gemini consistently emitted <step_report> blocks in steps 1–5 of #3214 so
the fallback never fired.

The iterated fix-verify loop on agentic_checkup_orchestrator (Steps 6.1/6.2/6.3)
will be partially live-tested if #3215 reaches Step 6.

Verdict

Safe to merge from this verification's perspective. The trusted-credential
safety net is real, exercises cleanly on two of the four touched orchestrators,
unit tests pass, and pdd-bug is strictly untouched.

Two suggested follow-ups (not blockers):

Update individual step prompts to drop the model-side gh issue comment
call now that the orchestrator owns the success comment, eliminating the
duplicate-comment cosmetics.
Quick staging exercise on pdd test and pdd e2e_fix before they hit
production — unit tests cover the helpers but those two orchestrators
weren't part of this scope.

gltanaka

I think the underlying change is needed for #1109: moving user-visible step progress out of model-side gh issue comment calls and into the orchestrators is the right direction.

I would not merge this PR as-is. Required changes:

Resolve the current merge conflicts. GitHub reports the PR as CONFLICTING, and git merge-tree origin/main origin/pr/1117 shows conflicts in pdd/agentic_checkup_orchestrator.py and tests/test_agentic_e2e_fix_orchestrator.py.
Fix the pdd change review-loop posting path. In pdd/agentic_change_orchestrator.py, the new post_step_comment_once(...) calls for Step 11 and Step 12 run immediately after run_agentic_task without checking s11_success / s12_success (pdd/agentic_change_orchestrator.py:2005 and :2043 in this PR). That means a provider failure or failed review/fix task can still post a visible “Step completed” fallback and mark the composite key as posted in state["step_comments"]. A later resume/retry can then suppress the real successful report for that same iteration key. Please gate these success comments on the step success flag, and add regression tests proving failed Step 11/12 outputs do not post or mark success comments.
Align e2e_fix with the claimed 11-step coverage. The PR introduces ## Step N/11 comments, but the actual posting hook is only inside the inner for step_num in range(1, 10) loop (pdd/agentic_e2e_fix_orchestrator.py:2483-2505). The post-success Step 11 cleanup and Step 10 CI validation paths (:2903-2943) do not use the trusted step-comment path, and the new tests only assert call_count >= 1, so this gap is not covered. Either add trusted visible progress for Steps 10 and 11, or narrow the scope/header/tests so the PR does not claim complete 11-step pdd fix coverage.

I ran the targeted PR tests locally from the PR head:

pytest tests/test_agentic_common.py::TestPostStepCommentOnce tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

Result: 19 passed.

Resolves conflicts with main's checkup hardening (round-3/4/5 Step 7 gate, pr_head_sha state, --no-fix gate, final-checkup head-SHA safety). Files touched: - pdd/agentic_checkup_orchestrator.py: keep both branches' import additions (_sanitize_comment_body / post_pr_comment / post_step_comment from main; normalize_step_comments_state / post_step_comment_once from this branch). _save_state, _build_state: keep both new params (pr_head_sha, step_comments). --no-fix step 7 call site: preserve main's nofix_step7_output capture AND this branch's description=desc7 argument. - tests/test_agentic_e2e_fix_orchestrator.py: additive append of both branches' new test classes (TestTrustedStepCommentPosting from here, TestFinalCheckupForwardsCwd / TestFinalCheckupHeadSafetyChecks from main). - tests/test_agentic_checkup_orchestrator.py: trusted-step-comment tests now wrap mock outputs in ALL_ISSUES_FIXED so main's _step7_passed JSON gate doesn't reject them. 945 tests pass across the conflict-touched files (test_agentic_common.py, *_change_orchestrator.py, *_test_orchestrator.py, *_checkup_orchestrator.py, *_e2e_fix_orchestrator.py).

…N gate accepts them Post-merge fix-up: main's Step 7 gate (`_step7_passed`) parses a structured JSON verdict and rejects step 7 outputs that don't include it. The trusted step-comment tests added by this PR were written against the pre-gate behavior and used `"All Issues Fixed"` alone. Wrap them with the `ALL_ISSUES_FIXED` helper defined at the top of the test module so the mocks pass the new gate. All three TestTrustedStepCommentPosting tests now pass against the merged checkup orchestrator.

1. Gate trusted Step 11/12 posts on success flag (agentic_change_orchestrator). Previously the change orchestrator called post_step_comment_once inside the review loop unconditionally after run_agentic_task, even when s11_success / s12_success was False. A failed task would post a "Step N completed; no <step_report> block" fallback comment AND mark the composite key (review_iteration * 100 + step_num) as posted. A later resume/retry of the same iteration would then dedup against that failed key and silently suppress the real successful report. Gate both posts behind their success flag. Add two regression tests (test_step11_failure_does_not_post_or_mark_iter_key, test_step12_failure_does_not_post_or_mark_iter_key) that prove a failed Step 11/12 does not post anything and does not burn keys 111 / 112. 2. Extend trusted step-comment coverage to Steps 10 + 11 (agentic_e2e_fix_orchestrator) so the PR's "## Step N/11:" claim matches reality. The original PR only posted from the linear `for step_num in range(1, 10)` loop (steps 1-9). Step 10 (CI validation) and Step 11 (cleanup) ran after the loop and never invoked post_step_comment_once. Add a trusted post for Step 10 immediately after run_ci_validation_loop returns ci_success=True, keyed by current_cycle * 10000 + 10. Add a trusted post for Step 11 immediately after _run_step11_code_cleanup returns, keyed by current_cycle * 10000 + 11, with a body that reports the actual cleanup file delta (or a "no additional files modified" summary when the cleanup pass found nothing to do or reverted itself). Both posts use the orchestrator-format "## Step N/11:" header so downstream tooling can recognize them as orchestrator-owned. Add two regression tests (test_step10_ci_path_posts_trusted_comment, test_step11_cleanup_path_posts_trusted_comment) that prove the composite keys 10010 / 10011 appear in mock_post_once.call_args_list with the right header on the happy path. 949 tests pass across the affected files (test_agentic_common.py, test_agentic_change_orchestrator.py, test_agentic_test_orchestrator.py, test_agentic_checkup_orchestrator.py, test_agentic_e2e_fix_orchestrator.py).

Serhan-Asad · 2026-05-21T02:24:29Z

Thanks for the review — addressed all three.

1. Merge conflicts resolved. Brought origin/main in via merge commit a93cecf519. Conflicts were in pdd/agentic_checkup_orchestrator.py (5 hunks, all additive — kept both branches' import additions, both new state fields pr_head_sha + step_comments, and preserved main's nofix_step7_output capture while keeping this branch's description=desc7 arg) and tests/test_agentic_e2e_fix_orchestrator.py (one large hunk, both branches appended new test classes — kept both). One follow-up commit b3b02aa76 wrapped the new TestTrustedStepCommentPosting mock outputs in main's ALL_ISSUES_FIXED so the _step7_passed JSON gate accepts them.

2. Gated Step 11/12 trusted posts on s11_success / s12_success (agentic_change_orchestrator.py). A failed task would otherwise post a fallback "Step N completed" comment AND mark state["step_comments"] with the composite iteration key, suppressing the real successful report on resume. Added two regression tests:

test_step11_failure_does_not_post_or_mark_iter_key — asserts that s11_success=False produces no post_step_comment_once call for key 111 (review_iteration=1, step=11).
test_step12_failure_does_not_post_or_mark_iter_key — same shape for key 112.

3. Aligned e2e_fix orchestrator with the claimed 11-step ## Step N/11: coverage (agentic_e2e_fix_orchestrator.py).

Step 10 (CI validation): trusted post added immediately after run_ci_validation_loop returns ci_success=True, keyed current_cycle * 10000 + 10, body header ## Step 10/11: CI validation.
Step 11 (cleanup): trusted post added at the call site immediately after _run_step11_code_cleanup returns, keyed current_cycle * 10000 + 11, header ## Step 11/11: Code cleanup. Body reports the actual cleanup file delta (pre_cleanup_files vs post_cleanup_files), or a "no additional files modified" summary when cleanup found nothing to do or reverted itself when verification failed.

Strengthened the existing tests with two new ones that assert the specific composite keys appear in mock_post_once.call_args_list with the right header:

test_step10_ci_path_posts_trusted_comment — happy-path pattern from test_final_checkup_pr_gate_runs_after_ci_success, asserts key 10010 is posted with ## Step 10/11: header.
test_step11_cleanup_path_posts_trusted_comment — same pattern, asserts key 10011 with ## Step 11/11: header.

Test results locally (Python 3.12, PDD_PATH=$PWD):

PDD_PATH=$PWD pytest tests/test_agentic_common.py \
    tests/test_agentic_change_orchestrator.py \
    tests/test_agentic_test_orchestrator.py \
    tests/test_agentic_checkup_orchestrator.py \
    tests/test_agentic_e2e_fix_orchestrator.py -q

949 passed

Trusted-step-comment tests alone:

PDD_PATH=$PWD pytest \
    tests/test_agentic_common.py::TestPostStepCommentOnce \
    tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

23 passed  (was 19, +4 new regression tests)

Commits in this push:

a93cecf519 — merge: bring main into fix/orchestrator-step-comments-upstream
b3b02aa76 — test(checkup): wrap step-comment mocks in ALL_ISSUES_FIXED so the JSON gate accepts them
d98173c900 — fix(orchestrators): address Greg's PR feat: post success-path step comments via trusted credentials for change/test/checkup/e2e_fix #1117 review (round 2)

CI rerunning on d98173c900.

…g + step 10/11 trusted posts PDD prompts are the source of truth for the generated orchestrator code. After the code changes that gate Step 11/12 trusted posts on ``s11_success``/``s12_success`` (change orchestrator) and add Step 10 (CI validation) + Step 11 (cleanup) trusted posts via the same cycle-keyed encoding (e2e_fix orchestrator), the matching prompt specs needed to spell out the new contract — otherwise the next ``pdd sync``/regen would re-emit code that drops the gate and the new post sites. agentic_change_orchestrator_python.prompt: - Posting protocol §4 (review loop): add a REQUIRED success-gate clause stating that ``post_step_comment_once`` for Step 11 and Step 12 MUST be guarded by ``s11_success`` / ``s12_success``. Spell out the resume/dedup failure mode so a future regen does not weaken the gate to "log-and-continue". agentic_e2e_fix_orchestrator_python.prompt: - Narrow the in-loop posting protocol to "steps 1-9" so it is no longer ambiguous whether the helper covers the post-loop steps. - Add explicit Step 10 (CI validation) post protocol gated on ``ci_success=True``, keyed ``current_cycle * 10000 + 10``, with the ``## Step 10/11:`` header. - Add explicit Step 11 (cleanup) post protocol that snapshots ``changed_files`` before ``_run_step11_code_cleanup`` and computes the cleanup file delta for the body, keyed ``current_cycle * 10000 + 11``, header ``## Step 11/11:``. - Add a separate REQUIRED success-gate clause for the in-loop posts so failed steps don't burn composite keys (mirror of the change orchestrator gate). No code change in this commit — the code already implements all of the above. This is the prompt-side of the same contract so the ``architecture.json`` interface declaration and the generated code stay consistent.

gltanaka

Thanks for the update. The previous blockers are mostly addressed: the PR is now mergeable, the Step 11/12 review-loop posts are gated on success, and Step 10/11 e2e_fix comment sites were added with tests. CI is green, and I also reran the trusted-step-comment subset locally on the updated head:

PDD_PATH=$PWD pytest tests/test_agentic_common.py::TestPostStepCommentOnce tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

Result: 23 passed.

One remaining required change before I would merge:

pdd/agentic_e2e_fix_orchestrator.py posts trusted Step 11 cleanup and Step 10 CI comments outside the main inner-loop save path, but those post sites only mutate the in-memory step_comments_set. They do not persist the updated state["step_comments"] / workflow state before planned failure returns.

Concrete failure modes:

Step 11 cleanup posts key current_cycle * 10000 + 11, then run_ci_validation_loop returns ci_success=False and the function returns False at the CI failure path. The existing workflow state still only has the earlier inner-loop keys, so a resume/retry can repost Step 11.
Step 10 CI posts key current_cycle * 10000 + 10, then _run_final_checkup_on_pr(...) returns False and the function returns before clearing state. The saved state again lacks the Step 10 key, so a resume/retry can repost Step 10 (and likely Step 11 too).

This violates the PR/issue requirement that visible step comments be idempotent across resume/retry. Please persist the updated step_comments after successful Step 10/11 trusted posts on any path where workflow state may remain for a later retry, and add regression coverage for the CI-failure / final-checkup-failure cases so those keys survive in saved state.

…d posts Round-3 review surfaced a remaining idempotency hole on the post-loop trusted-comment sites. Failure modes addressed: 1. Step 11 cleanup post followed by ``run_ci_validation_loop`` returning ``ci_success=False``. The orchestrator returns at the CI-failure path without touching workflow state, so saved state still only has the inner-loop keys. Resume rehydrates ``step_comments_set`` from disk, misses the Step 11 key ``current_cycle * 10000 + 11``, and re-posts the comment. 2. Step 10 CI post followed by ``_run_final_checkup_on_pr`` returning ``checkup_success=False``. Same shape: the failure return bypasses state save, the Step 10 key ``current_cycle * 10000 + 10`` is not on disk, and resume re-posts (and likely re-runs Step 11 too). Both violate the PR/issue contract that visible step comments be idempotent across resume/retry. Fix: * After each post-loop Step 10/11 trusted ``post_step_comment_once`` call, mutate ``state_data["step_comments"] = sorted(step_comments_set)`` (also refresh ``total_cost``, ``changed_files``, ``last_saved_at``) and call ``save_workflow_state(...)`` BEFORE the next code path that may fail-and-return. Both saves live inside the same ``try/except`` block that guards the post, so a transient persistence failure logs-and-continues rather than crashing the workflow. Tests added (regression coverage for the two failure modes above): * ``test_step11_post_persists_step_comments_before_ci_failure_return`` - mocks ``run_ci_validation_loop`` to return ``ci_success=False``, asserts ``success=False`` propagates, AND asserts ``mock_save`` was called with ``state["step_comments"]`` containing key ``10011``. * ``test_step10_post_persists_step_comments_before_checkup_failure_return`` - mocks the final checkup to return ``checkup_success=False``, asserts ``success=False`` propagates AND asserts ``mock_save`` was called with ``state["step_comments"]`` containing key ``10010``. Both tests use a ``post_step_comment_once`` side-effect that faithfully mirrors the real helper's ``posted_steps.add(step_num)`` behavior — otherwise a bare ``return_value=True`` mock would leave the in-memory set empty and silently pass the assertion. Prompt: * ``agentic_e2e_fix_orchestrator_python.prompt`` gains a REQUIRED post-loop persistence clause that spells out exactly which save state fields to refresh, why the save must be inline (CI/ final-checkup failure paths don't touch state), and that the save must live inside the same try/except as the post. 951 tests pass across the affected files (test_agentic_common.py, test_agentic_change_orchestrator.py, test_agentic_test_orchestrator.py, test_agentic_checkup_orchestrator.py, test_agentic_e2e_fix_orchestrator.py).

Serhan-Asad · 2026-05-21T03:56:48Z

Addressed in edb818235.

The fix:

After each post-loop Step 10 / Step 11 trusted post_step_comment_once call, the orchestrator now refreshes state_data["step_comments"] = sorted(step_comments_set) (alongside the running total_cost, changed_files, last_saved_at) and calls save_workflow_state(...) inline — BEFORE the next code path that may return False, ... without touching state. Both new save calls live inside the same try/except block that guards the post, so a transient persistence failure logs-and-continues rather than aborting the workflow.

That closes both of the failure modes you described:

run_ci_validation_loop returning ci_success=False after the Step 11 cleanup post: the Step 11 key (current_cycle * 10000 + 11) is now on disk before the CI-failure return.
_run_final_checkup_on_pr returning checkup_success=False after the Step 10 CI post: the Step 10 key (current_cycle * 10000 + 10) is on disk before the final-checkup-failure return.

Regression tests added (live in tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting):

test_step11_post_persists_step_comments_before_ci_failure_return — mocks run_ci_validation_loop to return ci_success=False, asserts success=False propagates AND that save_workflow_state was called at least once with state["step_comments"] containing key 10011.
test_step10_post_persists_step_comments_before_checkup_failure_return — mocks the final checkup to return checkup_success=False, asserts success=False propagates AND that save_workflow_state was called at least once with state["step_comments"] containing key 10010.

Both tests use a post_step_comment_once side_effect that mirrors the real helper's posted_steps.add(step_num) behavior — a bare return_value=True mock would have left the in-memory set empty and silently passed.

Prompt updated too (agentic_e2e_fix_orchestrator_python.prompt) so pdd sync-regenerated code keeps the inline save: a "REQUIRED post-loop persistence" clause names the failure paths, the state fields to refresh, and the requirement that the save live inside the same try/except as the post.

Local test results from PR head edb818235:

PDD_PATH=$PWD pytest tests/test_agentic_common.py \
    tests/test_agentic_change_orchestrator.py \
    tests/test_agentic_test_orchestrator.py \
    tests/test_agentic_checkup_orchestrator.py \
    tests/test_agentic_e2e_fix_orchestrator.py -q

951 passed

PDD_PATH=$PWD pytest \
    tests/test_agentic_common.py::TestPostStepCommentOnce \
    tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

25 passed  (was 23 — +2 new persistence regression tests)

Catch-up merge after main advanced 6 commits since the previous merge (``a93cecf519``). No content conflicts expected — this preserves the linear feature history while keeping the PR fast-forwardable.

gltanaka

The prior round-3 fresh-run failure cases are now covered, and the updated trusted-step-comment subset passes locally (25 passed). CI is also green. I still would not merge yet because the new persistence code has a resume-path hole.

Remaining required change:

pdd/agentic_e2e_fix_orchestrator.py now persists post-loop Step 10/11 keys by mutating state_data["step_comments"] after the trusted post. That works only when state_data was created by the inner for step_num in range(1, 10) loop. On resumed success paths, the inner loop can be skipped entirely:

cached Step 9 SUCCESS_FALL_THROUGH sets success = True before the outer loop (_resume_deferred_action == "SUCCESS_FALL_THROUGH"), then while current_cycle <= max_cycles and not success does not run;
resume_terminal_success also sets success = True before the loop.

In those cases state_data has not been initialized in this invocation (it is first assigned inside the inner loop). The post-loop Step 11 / Step 10 comments can still be posted, but the subsequent state_data["step_comments"] = ... raises UnboundLocalError inside the broad try/except, gets logged as a post_step_comment_once failed warning, and the key is not persisted. A later retry can repost the same visible comment, which reopens the idempotency problem for exactly the resume/retry paths this state is meant to protect.

Please build or refresh a post-loop state object independently of the inner-loop local before these Step 10/11 saves, and add regression coverage for a resumed cached-success path (for example: loaded state at last_completed_step=9 with Step 9 success, independent re-verification succeeds, then CI or final checkup fails) asserting keys 10011 / 10010 are persisted.

…tence saves Round-4 review surfaced a resume-path hole in the round-3 persistence fix. The inner ``for step_num in range(1, 10)`` loop is where the orchestrator's ``state_data`` dict was first assigned, but the SUCCESS_FALL_THROUGH and resume-terminal-success resume paths both set ``success = True`` BEFORE the outer ``while`` loop runs. On those paths the inner loop is skipped entirely, ``state_data`` is never assigned in this invocation, and the round-3 mutation lines (``state_data["step_comments"] = ...``) would raise ``UnboundLocalError`` inside the broad try/except around the trusted post. The exception got swallowed as a ``post_step_comment_once failed`` warning and the composite key ``current_cycle * 10000 + 10`` (or ``+ 11``) was never written to disk — a later retry rehydrated ``step_comments_set`` empty and re-posted the same visible step comment. Fix: * Introduce ``_build_post_loop_state(...)`` at module scope. It takes only function-scope locals that are unconditionally initialized at orchestrator entry (``step_outputs``, ``step_comments_set``, ``changed_files``, ``total_cost``, ``model_used``, ``dev_unit_states``, ``skipped_steps``, ``github_comment_id``, ``initial_file_hashes``, ``initial_sha``) and returns the same shape as the inner-loop save. * Replace both post-loop ``state_data["step_comments"] = ...`` mutations with a call to ``_build_post_loop_state`` whose result is handed directly to ``save_workflow_state``. The save works the same on the linear-execution path AND on the SUCCESS_FALL_THROUGH / resume-terminal-success resume paths. Tests added (regression coverage for the resume hole): * ``test_step11_persists_on_success_fall_through_resume_path`` — ``load_workflow_state`` returns a state with ``last_completed_step=9`` and a cached ALL_TESTS_PASS Step 9 output, forcing the SUCCESS_FALL_THROUGH path. Mocks ``run_ci_validation_loop`` to ``ci_success=False``, asserts the Step 11 composite key ``10011`` is in at least one ``save_workflow_state`` call's ``state["step_comments"]``. * ``test_step10_persists_on_success_fall_through_resume_path`` — same resume setup, mocks the final checkup to fail. Asserts key ``10010`` is persisted before the failure return. Both tests would fail on pre-fix code because the round-3 code mutated the never-assigned inner-loop ``state_data`` local. Prompt updated (``agentic_e2e_fix_orchestrator_python.prompt``) so ``pdd sync``-regenerated code stays resume-safe: a new "Resume-safe state construction (REQUIRED)" clause names the failure mode, explicitly forbids mutating the inner-loop ``state_data`` local, and suggests the ``_build_post_loop_state`` helper shape. 953 tests pass across the affected files (test_agentic_common.py, test_agentic_change_orchestrator.py, test_agentic_test_orchestrator.py, test_agentic_checkup_orchestrator.py, test_agentic_e2e_fix_orchestrator.py).

Serhan-Asad · 2026-05-21T04:52:43Z

Addressed in 6db04d5fe. You're right — the round-3 fix mutated the inner-loop state_data local, which is undefined on the SUCCESS_FALL_THROUGH / resume-terminal-success paths because the inner for step_num in range(1, 10) loop never runs.

The fix:

Introduced a module-level helper _build_post_loop_state(...) that constructs the workflow-state dict from function-scope locals only — step_outputs, step_comments_set, changed_files, total_cost, model_used, dev_unit_states, skipped_steps, github_comment_id, initial_file_hashes, initial_sha — all of which ARE unconditionally initialized at orchestrator entry. Both post-loop Step 10 / Step 11 saves now go through this helper and hand the result directly to save_workflow_state(...), so the save shape is identical across the linear-execution path, the SUCCESS_FALL_THROUGH resume path, and the resume-terminal-success path. The broad try/except still wraps the post, but it no longer needs to silently absorb an UnboundLocalError.

Regression tests added (live in tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting):

test_step11_persists_on_success_fall_through_resume_path — load_workflow_state returns state with last_completed_step=9 and cached ALL_TESTS_PASS Step 9 output (the canonical SUCCESS_FALL_THROUGH trigger). Independent re-verification passes via _verify_tests_independently mock. Then run_ci_validation_loop returns ci_success=False. Asserts success=False propagates AND that at least one save_workflow_state call carries key 10011 in its state["step_comments"]. Would fail on pre-fix code because the round-3 mutation lines hit UnboundLocalError, the broad except swallowed it, and the key never made it to disk.
test_step10_persists_on_success_fall_through_resume_path — same resume setup, mocks the final checkup to return checkup_success=False. Asserts key 10010 is persisted before the failure return.

Both tests use the same posted_steps.add(step_num) side_effect as the round-3 tests so the in-memory set state matches the real helper.

Prompt updated too (agentic_e2e_fix_orchestrator_python.prompt) so pdd sync regen stays resume-safe: a new "Resume-safe state construction (REQUIRED)" clause names the failure mode, explicitly forbids mutating the inner-loop state_data local from post-loop sites, and suggests the _build_post_loop_state(...) helper shape (matching the signature in this PR).

Local test results from PR head 6db04d5fe:

PDD_PATH=$PWD pytest tests/test_agentic_common.py \
    tests/test_agentic_change_orchestrator.py \
    tests/test_agentic_test_orchestrator.py \
    tests/test_agentic_checkup_orchestrator.py \
    tests/test_agentic_e2e_fix_orchestrator.py -q

953 passed

PDD_PATH=$PWD pytest \
    tests/test_agentic_common.py::TestPostStepCommentOnce \
    tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

27 passed  (was 25 — +2 new resume-path regression tests)

gltanaka

The latest update addresses the remaining resume-path idempotency blocker.

I verified that the post-loop Step 10/11 saves now use _build_post_loop_state(...) instead of mutating the inner-loop state_data local, so the SUCCESS_FALL_THROUGH resume path can persist keys 10010 / 10011 even when the inner loop is skipped. The prompt contract was updated to preserve that behavior on regeneration, and the new regression tests cover the resume cases.

I reran the trusted-step-comment subset locally on the latest PR head:

PDD_PATH=$PWD pytest tests/test_agentic_common.py::TestPostStepCommentOnce tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

Result: 27 passed.

GitHub checks are green and the PR is mergeable. Approved.

1. Gate trusted Step 11/12 posts on success flag (agentic_change_orchestrator). Previously the change orchestrator called post_step_comment_once inside the review loop unconditionally after run_agentic_task, even when s11_success / s12_success was False. A failed task would post a "Step N completed; no <step_report> block" fallback comment AND mark the composite key (review_iteration * 100 + step_num) as posted. A later resume/retry of the same iteration would then dedup against that failed key and silently suppress the real successful report. Gate both posts behind their success flag. Add two regression tests (test_step11_failure_does_not_post_or_mark_iter_key, test_step12_failure_does_not_post_or_mark_iter_key) that prove a failed Step 11/12 does not post anything and does not burn keys 111 / 112. 2. Extend trusted step-comment coverage to Steps 10 + 11 (agentic_e2e_fix_orchestrator) so the PR's "## Step N/11:" claim matches reality. The original PR only posted from the linear `for step_num in range(1, 10)` loop (steps 1-9). Step 10 (CI validation) and Step 11 (cleanup) ran after the loop and never invoked post_step_comment_once. Add a trusted post for Step 10 immediately after run_ci_validation_loop returns ci_success=True, keyed by current_cycle * 10000 + 10. Add a trusted post for Step 11 immediately after _run_step11_code_cleanup returns, keyed by current_cycle * 10000 + 11, with a body that reports the actual cleanup file delta (or a "no additional files modified" summary when the cleanup pass found nothing to do or reverted itself). Both posts use the orchestrator-format "## Step N/11:" header so downstream tooling can recognize them as orchestrator-owned. Add two regression tests (test_step10_ci_path_posts_trusted_comment, test_step11_cleanup_path_posts_trusted_comment) that prove the composite keys 10010 / 10011 appear in mock_post_once.call_args_list with the right header on the happy path. 949 tests pass across the affected files (test_agentic_common.py, test_agentic_change_orchestrator.py, test_agentic_test_orchestrator.py, test_agentic_checkup_orchestrator.py, test_agentic_e2e_fix_orchestrator.py).

…r-step-comments-upstream feat: post success-path step comments via trusted credentials for change/test/checkup/e2e_fix

Serhan-Asad and others added 3 commits May 20, 2026 15:02

fix(orchestrators): post fallback step comments

2f1e641

Serhan-Asad marked this pull request as ready for review May 20, 2026 22:42

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

gltanaka requested changes May 21, 2026

View reviewed changes

Serhan-Asad added 3 commits May 20, 2026 19:02

gltanaka requested changes May 21, 2026

View reviewed changes

merge: bring main into fix/orchestrator-step-comments-upstream

1f5662d

Catch-up merge after main advanced 6 commits since the previous merge (``a93cecf519``). No content conflicts expected — this preserves the linear feature history while keeping the PR fast-forwardable.

gltanaka requested changes May 21, 2026

View reviewed changes

gltanaka approved these changes May 21, 2026

View reviewed changes

Merge branch 'main' into fix/orchestrator-step-comments-upstream

98c3490

gltanaka merged commit 9cfce9e into promptdriven:main May 21, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: post success-path step comments via trusted credentials for change/test/checkup/e2e_fix#1117

feat: post success-path step comments via trusted credentials for change/test/checkup/e2e_fix#1117
gltanaka merged 11 commits into
promptdriven:mainfrom
Serhan-Asad:fix/orchestrator-step-comments-upstream

Serhan-Asad commented May 20, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

gltanaka left a comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

gltanaka left a comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

gltanaka left a comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

gltanaka left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Serhan-Asad commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Acceptance criteria mapping (from #1109)

What changed

Tests

Supersedes

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Serhan-Asad commented May 21, 2026

Staging1 end-to-end verification

Build / unit

Bug reproduction (pre-PR worker, 5c077752)

Fix verification (PR worker, 2f1e641a)

Cross-command spot-check

Regression analysis

Gap acknowledgement (not live-tested)

Verdict

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Serhan-Asad commented May 20, 2026 •

edited

Loading

Bug reproduction (pre-PR worker, `5c077752`)

Fix verification (PR worker, `2f1e641a`)