Skip to content

feat: post success-path step comments via trusted credentials for change/test/checkup/e2e_fix#1117

Merged
gltanaka merged 11 commits into
promptdriven:mainfrom
Serhan-Asad:fix/orchestrator-step-comments-upstream
May 21, 2026
Merged

feat: post success-path step comments via trusted credentials for change/test/checkup/e2e_fix#1117
gltanaka merged 11 commits into
promptdriven:mainfrom
Serhan-Asad:fix/orchestrator-step-comments-upstream

Conversation

@Serhan-Asad
Copy link
Copy Markdown
Collaborator

@Serhan-Asad Serhan-Asad commented May 20, 2026

Summary

The GitHub App's worker container installs the CLI from promptdriven/pdd, but only pdd bug already had reliable orchestrator-owned visible per-step comments. pdd change / pdd test / pdd checkup / pdd fix step prompts could still rely on agent-side shell comments, which fails across providers when the model subprocess cannot access GitHub credentials.

This PR moves success-path step comments into the orchestrators and posts them with trusted PDD credentials. If an agent omits a <step_report> block, the orchestrator now posts a short fallback visible comment and keeps the raw output in workflow state.

Acceptance criteria mapping (from #1109)

  • pdd bug: already works upstream via existing _maybe_post_step_comment — unchanged
  • pdd change: wired here, including the 13-step flow and iterated review loop
  • pdd test: wired here, including fractional step 5.5 and the iterated manual loop
  • pdd checkup: wired here, including fractional 6.1/6.2/6.3 and the iterated fix-verify loop
  • pdd fix: wired here through agentic_e2e_fix_orchestrator, cycle-keyed across the 11-step loop
  • pdd sync: already uses an orchestrator-owned rolling visible PATCH comment rather than per-step comments

What changed

Helpers: pdd/agentic_common.py now exposes:

  • extract_step_report(text) -> Optional[str]
  • post_step_comment_once(...) -> bool
  • normalize_step_comments_state(raw) -> Set[int]

Orchestrators: change, test, checkup, and e2e_fix hydrate persisted step-comment state on resume, post each successful step once with trusted credentials, persist sorted posted-step keys, and log-and-continue if comment posting fails.

Fallback behavior: Missing <step_report> no longer suppresses visible progress. The orchestrator posts: _Step N completed; no <step_report> block returned by agent. Raw output retained in workflow state._

Prompt specs: Updated the matching *_python.prompt files so generated implementations preserve the trusted-comment and fallback behavior.

Tests

Local:

  • pytest tests/test_agentic_common.py::TestPostStepCommentOnce tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q — 19 passed
  • PDD_PATH=$PWD pytest tests/test_agentic_common.py tests/test_agentic_change_orchestrator.py tests/test_agentic_test_orchestrator.py tests/test_agentic_checkup_orchestrator.py tests/test_agentic_e2e_fix_orchestrator.py -q — 934 passed
  • git diff --check — clean

GitHub Actions on latest commit 2f1e641ad:

  • Run Unit Tests — passed
  • Public CLI Regression — passed
  • Package Preprocess Smoke — passed
  • heal — passed

Supersedes

Closes #1109

Serhan-Asad and others added 3 commits May 20, 2026 15:02
Add three public helpers so orchestrators can own per-step
`## Step N/T:` GitHub issue comments via trusted PDD credentials,
instead of asking step prompts to shell out to `gh issue comment`
(which is unreliable across providers, especially Gemini's
sandboxed shell that cannot read GH_TOKEN).

- `extract_step_report(text) -> Optional[str]` — public alias of
  the existing `_extract_step_report`; returns the LAST
  `<step_report>...</step_report>` block, DOTALL + case-insensitive,
  whitespace-stripped.
- `post_step_comment_once(*, repo_owner, repo_name, issue_number,
  step_num, body, posted_steps, cwd) -> bool` — keyword-only,
  idempotent. Returns True immediately if `step_num` already in
  `posted_steps`; otherwise sanitizes/truncates the body via the
  existing `_sanitize_comment_body`, shells out to `gh issue
  comment`, and on success mutates `posted_steps` in place. Never
  raises; transient failures are logged and surfaced as False.
- `normalize_step_comments_state(raw) -> Set[int]` — coerces any
  persisted shape (None, list/tuple/set/frozenset, legacy
  bug-orchestrator dict, malformed) into a `Set[int]` so resume
  is robust across state-file generations. Rejects `bool` (legacy
  flag value) and `float` (caller must project composite keys).

Bug-orchestrator path (already shipping success-comment posting on
upstream main) is untouched — these helpers cover the
change/test/checkup/e2e_fix orchestrators, wired in a follow-up
commit.

architecture.json grows three function entries for the new public
helpers under the agentic_common interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…entials

Provider-side `gh issue comment` posts from step prompts broke
across non-Anthropic providers, leaving change/test/checkup/e2e_fix
workflows silent from the GitHub UI even though Cloud Run logs
showed progress.

Wire the per-step `## Step N/T:` posting into the orchestrator
path behind trusted PDD credentials. Step prompts emit
`<step_report>...</step_report>` blocks; the orchestrator
extracts, sanitizes, truncates, and posts once per step via
`gh issue comment`, mutating a `Set[int]` of posted-step keys for
resume idempotency.

Wiring per orchestrator:

- **change** (13 steps + iterated review loop 11/12): linear steps
  1-10 keyed by `step_num`; review iterations keyed by
  `iteration * 100 + step_num` (so a second-iteration step 11
  doesn't dedupe against the first); step 13 keyed by 13. All
  hard-stop / abort / failure early-returns persist
  `state["step_comments"]` before saving.
- **test** (18 steps with fractional 5.5 + iterated manual loop
  6-11): composite-keyed via
  `iteration * 10000 + int(round(step_num * 10))` so 5.5 -> 55,
  step 8 of iteration 2 -> 20080. `save_state` grows `description`
  and `iteration` kwargs to thread the description through.
- **checkup** (8 steps with fractional 6.1/6.2/6.3 + iterated
  fix-verify loop): same composite encoding as test. Step
  comments persist through `_build_state(..., step_comments=...)`.
- **e2e_fix** (11 steps × cycle loop): cycle-keyed encoding
  `current_cycle * 10000 + step_num`. `current_cycle` is normalized
  to >=1 before the step loop. All three state-save sites (main
  inner loop, KeyboardInterrupt, fatal Exception) persist
  `sorted(step_comments_set)`.

Tests:

- 24 new unit tests in `test_agentic_common.py` cover the three
  helpers — extract success/missing/empty/multi/case/multiline;
  normalize across every persisted shape (None, list, tuple, set,
  frozenset, legacy dict, malformed, garbage); once idempotency,
  successful post, gh-failure, missing-gh, secret redaction,
  oversize truncation, exception non-raise.
- Three orchestrator-level tests per command: success-path posts,
  no-`<step_report>` skips, helper-exception is log-and-continue.
- `tests/test_post_step_comment_once_real.py` — gated by
  `PDD_RUN_REAL_LLM_TESTS=1` and `PDD_REAL_GH_ISSUE_URL`. Posts a
  real comment, asserts idempotency on second call, then deletes
  the comment via `gh api`.

Bug-orchestrator path (already shipping success-comment posting
upstream via `_maybe_post_step_comment`) is untouched.
`agentic_sync_runner` is out of scope — its single rolling
PATCH'd comment is not per-step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Serhan-Asad Serhan-Asad marked this pull request as ready for review May 20, 2026 22:42
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Staging1 end-to-end verification

Built worker image from this PR's HEAD (2f1e641a) and deployed to all three
staging1 services (worker, solving-worker, executor-job). Then ran a baseline on
the pre-PR pinned SHA (5c077752) and the same workflows on the new image.

Build / unit

  • Local unit tests (Python 3.12, PDD_PATH=$PWD): 934 passed, 0 failed.
  • New helper tests (TestPostStepCommentOnce + four TestTrustedStepCommentPosting): 19 passed.
  • Cloud Build of Dockerfile.worker pinned to 2f1e641a: SUCCESS in ~6 min.
  • One worktree-only Dockerfile tweak required: git clone … && git checkout SHA
    on the prompt-repo clone can't reach an unmerged PR SHA, so added a
    git fetch origin "+refs/pull/*/head:refs/pulls/*" fallback. Benign for
    merged SHAs; not committed to pdd_cloud main.

Bug reproduction (pre-PR worker, 5c077752)

pdd change with pdd-gemini-flash + pdd-effort-low on
promptdriven/test_repo#3213:
posted Steps 1–8 in agent format (## Step N: …), then 22 minutes silent
while the executor's Step 9 was still running. Matches the user-visible
silence symptom from #1092 / #1103 (which were on the same pinned SHA in
production). Cancelled before Step 9 finished, so I observed the symptom but
not the underlying credential-failure mechanism directly — that piece is
anchored to the upstream production evidence.

Fix verification (PR worker, 2f1e641a)

pdd change on test_repo#3214
— same labels, same shape of issue. Orchestrator-side trusted post path is
immediately visible:

[00:07:36Z] ## Step 1/13: Search for duplicate issues       ← ORCH ✓
[00:09:01Z] ## Step 2/13: Check if already implemented      ← ORCH ✓
[00:10:59Z] ## Step 3/13: Research to clarify specifications ← ORCH ✓
[00:12:52Z] ## Step 4/13: Verify requirements are clear     ← ORCH ✓
[00:15:18Z] ## Step 5/13: Analyze documentation changes needed ← ORCH ✓

The agent ALSO continues to post its ## Step N: … headers (per-step prompts
unchanged), so end-users see ~2 visible comments per step. That's cosmetic, not
a correctness issue — the second comment comes from a credential path that
cannot fail silently.

Cross-command spot-check

pdd checkup on test_repo#3215:

[00:33:21Z] ## Step 1/8: Discovering project structure and tech stack ← ORCH ✓
[00:36:28Z] ## Step 2/8: Auditing dependencies                         ← ORCH ✓
[00:38:29Z] ## Step 3/8: Running build/compile checks                  ← ORCH ✓
[00:43:12Z] ## Step 4/8: Checking cross-module interfaces              ← ORCH ✓

Headers match the orchestrator's step_definitions (not the agent's free-form
titles), confirming post_step_comment_once is firing from the checkup
orchestrator code added by this PR.

Regression analysis

  • pdd/agentic_bug_orchestrator.py is unchanged in this PR.
  • agentic_common.py changes are purely additive (new public helpers; the
    pre-existing _extract_step_report is preserved and extract_step_report
    is a new public alias).
  • No code in pdd_cloud extensions/github_pdd_app/src/ or tests/ imports the
    new helpers — no integration-side breakage risk there.
  • Live executor flow on both staging runs (#3214, #3215) handled secrets,
    branch checkout, credential waterfall, and step-comment posting cleanly with
    the new image.

Gap acknowledgement (not live-tested)

These rely on the PR's unit-test coverage, not staging E2E:

  • Review loop Steps 11/12 with iteration*100 + step_num keying (the most
    novel code in the PR).
  • Step 13 PR-creation comment site.
  • Fallback "Step N completed; no <step_report> block returned" path —
    Gemini consistently emitted <step_report> blocks in steps 1–5 of #3214 so
    the fallback never fired.

The iterated fix-verify loop on agentic_checkup_orchestrator (Steps 6.1/6.2/6.3)
will be partially live-tested if #3215 reaches Step 6.

Verdict

Safe to merge from this verification's perspective. The trusted-credential
safety net is real, exercises cleanly on two of the four touched orchestrators,
unit tests pass, and pdd-bug is strictly untouched.

Two suggested follow-ups (not blockers):

  1. Update individual step prompts to drop the model-side gh issue comment
    call now that the orchestrator owns the success comment, eliminating the
    duplicate-comment cosmetics.
  2. Quick staging exercise on pdd test and pdd e2e_fix before they hit
    production — unit tests cover the helpers but those two orchestrators
    weren't part of this scope.

Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the underlying change is needed for #1109: moving user-visible step progress out of model-side gh issue comment calls and into the orchestrators is the right direction.

I would not merge this PR as-is. Required changes:

  1. Resolve the current merge conflicts. GitHub reports the PR as CONFLICTING, and git merge-tree origin/main origin/pr/1117 shows conflicts in pdd/agentic_checkup_orchestrator.py and tests/test_agentic_e2e_fix_orchestrator.py.

  2. Fix the pdd change review-loop posting path. In pdd/agentic_change_orchestrator.py, the new post_step_comment_once(...) calls for Step 11 and Step 12 run immediately after run_agentic_task without checking s11_success / s12_success (pdd/agentic_change_orchestrator.py:2005 and :2043 in this PR). That means a provider failure or failed review/fix task can still post a visible “Step completed” fallback and mark the composite key as posted in state["step_comments"]. A later resume/retry can then suppress the real successful report for that same iteration key. Please gate these success comments on the step success flag, and add regression tests proving failed Step 11/12 outputs do not post or mark success comments.

  3. Align e2e_fix with the claimed 11-step coverage. The PR introduces ## Step N/11 comments, but the actual posting hook is only inside the inner for step_num in range(1, 10) loop (pdd/agentic_e2e_fix_orchestrator.py:2483-2505). The post-success Step 11 cleanup and Step 10 CI validation paths (:2903-2943) do not use the trusted step-comment path, and the new tests only assert call_count >= 1, so this gap is not covered. Either add trusted visible progress for Steps 10 and 11, or narrow the scope/header/tests so the PR does not claim complete 11-step pdd fix coverage.

I ran the targeted PR tests locally from the PR head:

pytest tests/test_agentic_common.py::TestPostStepCommentOnce tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

Result: 19 passed.

Resolves conflicts with main's checkup hardening (round-3/4/5 Step 7
gate, pr_head_sha state, --no-fix gate, final-checkup head-SHA safety).

Files touched:
- pdd/agentic_checkup_orchestrator.py: keep both branches' import additions
  (_sanitize_comment_body / post_pr_comment / post_step_comment from main;
  normalize_step_comments_state / post_step_comment_once from this branch).
  _save_state, _build_state: keep both new params (pr_head_sha, step_comments).
  --no-fix step 7 call site: preserve main's nofix_step7_output capture AND
  this branch's description=desc7 argument.
- tests/test_agentic_e2e_fix_orchestrator.py: additive append of both
  branches' new test classes (TestTrustedStepCommentPosting from here,
  TestFinalCheckupForwardsCwd / TestFinalCheckupHeadSafetyChecks from main).
- tests/test_agentic_checkup_orchestrator.py: trusted-step-comment tests
  now wrap mock outputs in ALL_ISSUES_FIXED so main's _step7_passed JSON
  gate doesn't reject them.

945 tests pass across the conflict-touched files
(test_agentic_common.py, *_change_orchestrator.py, *_test_orchestrator.py,
*_checkup_orchestrator.py, *_e2e_fix_orchestrator.py).
…N gate accepts them

Post-merge fix-up: main's Step 7 gate (`_step7_passed`) parses a structured
JSON verdict and rejects step 7 outputs that don't include it. The trusted
step-comment tests added by this PR were written against the pre-gate
behavior and used `"All Issues Fixed"` alone. Wrap them with the
`ALL_ISSUES_FIXED` helper defined at the top of the test module so the
mocks pass the new gate.

All three TestTrustedStepCommentPosting tests now pass against the merged
checkup orchestrator.
1. Gate trusted Step 11/12 posts on success flag (agentic_change_orchestrator).

   Previously the change orchestrator called post_step_comment_once
   inside the review loop unconditionally after run_agentic_task, even
   when s11_success / s12_success was False. A failed task would post a
   "Step N completed; no <step_report> block" fallback comment AND mark
   the composite key (review_iteration * 100 + step_num) as posted. A
   later resume/retry of the same iteration would then dedup against
   that failed key and silently suppress the real successful report.

   Gate both posts behind their success flag. Add two regression tests
   (test_step11_failure_does_not_post_or_mark_iter_key,
   test_step12_failure_does_not_post_or_mark_iter_key) that prove a
   failed Step 11/12 does not post anything and does not burn keys 111
   / 112.

2. Extend trusted step-comment coverage to Steps 10 + 11
   (agentic_e2e_fix_orchestrator) so the PR's "## Step N/11:" claim
   matches reality.

   The original PR only posted from the linear `for step_num in
   range(1, 10)` loop (steps 1-9). Step 10 (CI validation) and Step 11
   (cleanup) ran after the loop and never invoked post_step_comment_once.

   Add a trusted post for Step 10 immediately after run_ci_validation_loop
   returns ci_success=True, keyed by current_cycle * 10000 + 10.

   Add a trusted post for Step 11 immediately after _run_step11_code_cleanup
   returns, keyed by current_cycle * 10000 + 11, with a body that reports
   the actual cleanup file delta (or a "no additional files modified"
   summary when the cleanup pass found nothing to do or reverted itself).

   Both posts use the orchestrator-format "## Step N/11:" header so
   downstream tooling can recognize them as orchestrator-owned.

   Add two regression tests
   (test_step10_ci_path_posts_trusted_comment,
   test_step11_cleanup_path_posts_trusted_comment) that prove the
   composite keys 10010 / 10011 appear in mock_post_once.call_args_list
   with the right header on the happy path.

949 tests pass across the affected files (test_agentic_common.py,
test_agentic_change_orchestrator.py, test_agentic_test_orchestrator.py,
test_agentic_checkup_orchestrator.py, test_agentic_e2e_fix_orchestrator.py).
@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Thanks for the review — addressed all three.

1. Merge conflicts resolved. Brought origin/main in via merge commit a93cecf519. Conflicts were in pdd/agentic_checkup_orchestrator.py (5 hunks, all additive — kept both branches' import additions, both new state fields pr_head_sha + step_comments, and preserved main's nofix_step7_output capture while keeping this branch's description=desc7 arg) and tests/test_agentic_e2e_fix_orchestrator.py (one large hunk, both branches appended new test classes — kept both). One follow-up commit b3b02aa76 wrapped the new TestTrustedStepCommentPosting mock outputs in main's ALL_ISSUES_FIXED so the _step7_passed JSON gate accepts them.

2. Gated Step 11/12 trusted posts on s11_success / s12_success (agentic_change_orchestrator.py). A failed task would otherwise post a fallback "Step N completed" comment AND mark state["step_comments"] with the composite iteration key, suppressing the real successful report on resume. Added two regression tests:

  • test_step11_failure_does_not_post_or_mark_iter_key — asserts that s11_success=False produces no post_step_comment_once call for key 111 (review_iteration=1, step=11).
  • test_step12_failure_does_not_post_or_mark_iter_key — same shape for key 112.

3. Aligned e2e_fix orchestrator with the claimed 11-step ## Step N/11: coverage (agentic_e2e_fix_orchestrator.py).

  • Step 10 (CI validation): trusted post added immediately after run_ci_validation_loop returns ci_success=True, keyed current_cycle * 10000 + 10, body header ## Step 10/11: CI validation.
  • Step 11 (cleanup): trusted post added at the call site immediately after _run_step11_code_cleanup returns, keyed current_cycle * 10000 + 11, header ## Step 11/11: Code cleanup. Body reports the actual cleanup file delta (pre_cleanup_files vs post_cleanup_files), or a "no additional files modified" summary when cleanup found nothing to do or reverted itself when verification failed.

Strengthened the existing tests with two new ones that assert the specific composite keys appear in mock_post_once.call_args_list with the right header:

  • test_step10_ci_path_posts_trusted_comment — happy-path pattern from test_final_checkup_pr_gate_runs_after_ci_success, asserts key 10010 is posted with ## Step 10/11: header.
  • test_step11_cleanup_path_posts_trusted_comment — same pattern, asserts key 10011 with ## Step 11/11: header.

Test results locally (Python 3.12, PDD_PATH=$PWD):

PDD_PATH=$PWD pytest tests/test_agentic_common.py \
    tests/test_agentic_change_orchestrator.py \
    tests/test_agentic_test_orchestrator.py \
    tests/test_agentic_checkup_orchestrator.py \
    tests/test_agentic_e2e_fix_orchestrator.py -q

949 passed

Trusted-step-comment tests alone:

PDD_PATH=$PWD pytest \
    tests/test_agentic_common.py::TestPostStepCommentOnce \
    tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

23 passed  (was 19, +4 new regression tests)

Commits in this push:

CI rerunning on d98173c900.

…g + step 10/11 trusted posts

PDD prompts are the source of truth for the generated orchestrator
code. After the code changes that gate Step 11/12 trusted posts on
``s11_success``/``s12_success`` (change orchestrator) and add Step 10
(CI validation) + Step 11 (cleanup) trusted posts via the same
cycle-keyed encoding (e2e_fix orchestrator), the matching prompt
specs needed to spell out the new contract — otherwise the next
``pdd sync``/regen would re-emit code that drops the gate and the
new post sites.

agentic_change_orchestrator_python.prompt:
- Posting protocol §4 (review loop): add a REQUIRED success-gate
  clause stating that ``post_step_comment_once`` for Step 11 and
  Step 12 MUST be guarded by ``s11_success`` / ``s12_success``.
  Spell out the resume/dedup failure mode so a future regen does not
  weaken the gate to "log-and-continue".

agentic_e2e_fix_orchestrator_python.prompt:
- Narrow the in-loop posting protocol to "steps 1-9" so it is no
  longer ambiguous whether the helper covers the post-loop steps.
- Add explicit Step 10 (CI validation) post protocol gated on
  ``ci_success=True``, keyed ``current_cycle * 10000 + 10``, with the
  ``## Step 10/11:`` header.
- Add explicit Step 11 (cleanup) post protocol that snapshots
  ``changed_files`` before ``_run_step11_code_cleanup`` and computes
  the cleanup file delta for the body, keyed
  ``current_cycle * 10000 + 11``, header ``## Step 11/11:``.
- Add a separate REQUIRED success-gate clause for the in-loop posts
  so failed steps don't burn composite keys (mirror of the change
  orchestrator gate).

No code change in this commit — the code already implements all of
the above. This is the prompt-side of the same contract so the
``architecture.json`` interface declaration and the generated code
stay consistent.
Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The previous blockers are mostly addressed: the PR is now mergeable, the Step 11/12 review-loop posts are gated on success, and Step 10/11 e2e_fix comment sites were added with tests. CI is green, and I also reran the trusted-step-comment subset locally on the updated head:

PDD_PATH=$PWD pytest tests/test_agentic_common.py::TestPostStepCommentOnce tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

Result: 23 passed.

One remaining required change before I would merge:

pdd/agentic_e2e_fix_orchestrator.py posts trusted Step 11 cleanup and Step 10 CI comments outside the main inner-loop save path, but those post sites only mutate the in-memory step_comments_set. They do not persist the updated state["step_comments"] / workflow state before planned failure returns.

Concrete failure modes:

  • Step 11 cleanup posts key current_cycle * 10000 + 11, then run_ci_validation_loop returns ci_success=False and the function returns False at the CI failure path. The existing workflow state still only has the earlier inner-loop keys, so a resume/retry can repost Step 11.
  • Step 10 CI posts key current_cycle * 10000 + 10, then _run_final_checkup_on_pr(...) returns False and the function returns before clearing state. The saved state again lacks the Step 10 key, so a resume/retry can repost Step 10 (and likely Step 11 too).

This violates the PR/issue requirement that visible step comments be idempotent across resume/retry. Please persist the updated step_comments after successful Step 10/11 trusted posts on any path where workflow state may remain for a later retry, and add regression coverage for the CI-failure / final-checkup-failure cases so those keys survive in saved state.

…d posts

Round-3 review surfaced a remaining idempotency hole on the post-loop
trusted-comment sites.

Failure modes addressed:

1. Step 11 cleanup post followed by ``run_ci_validation_loop``
   returning ``ci_success=False``. The orchestrator returns at the
   CI-failure path without touching workflow state, so saved state
   still only has the inner-loop keys. Resume rehydrates
   ``step_comments_set`` from disk, misses the Step 11 key
   ``current_cycle * 10000 + 11``, and re-posts the comment.

2. Step 10 CI post followed by ``_run_final_checkup_on_pr`` returning
   ``checkup_success=False``. Same shape: the failure return bypasses
   state save, the Step 10 key ``current_cycle * 10000 + 10`` is not
   on disk, and resume re-posts (and likely re-runs Step 11 too).

Both violate the PR/issue contract that visible step comments be
idempotent across resume/retry.

Fix:
* After each post-loop Step 10/11 trusted ``post_step_comment_once``
  call, mutate ``state_data["step_comments"] = sorted(step_comments_set)``
  (also refresh ``total_cost``, ``changed_files``, ``last_saved_at``)
  and call ``save_workflow_state(...)`` BEFORE the next code path that
  may fail-and-return. Both saves live inside the same ``try/except``
  block that guards the post, so a transient persistence failure
  logs-and-continues rather than crashing the workflow.

Tests added (regression coverage for the two failure modes above):
* ``test_step11_post_persists_step_comments_before_ci_failure_return``
  - mocks ``run_ci_validation_loop`` to return ``ci_success=False``,
    asserts ``success=False`` propagates, AND asserts ``mock_save``
    was called with ``state["step_comments"]`` containing key
    ``10011``.
* ``test_step10_post_persists_step_comments_before_checkup_failure_return``
  - mocks the final checkup to return ``checkup_success=False``,
    asserts ``success=False`` propagates AND asserts ``mock_save``
    was called with ``state["step_comments"]`` containing key
    ``10010``.

Both tests use a ``post_step_comment_once`` side-effect that
faithfully mirrors the real helper's ``posted_steps.add(step_num)``
behavior — otherwise a bare ``return_value=True`` mock would leave
the in-memory set empty and silently pass the assertion.

Prompt:
* ``agentic_e2e_fix_orchestrator_python.prompt`` gains a REQUIRED
  post-loop persistence clause that spells out exactly which save
  state fields to refresh, why the save must be inline (CI/
  final-checkup failure paths don't touch state), and that the save
  must live inside the same try/except as the post.

951 tests pass across the affected files
(test_agentic_common.py, test_agentic_change_orchestrator.py,
test_agentic_test_orchestrator.py, test_agentic_checkup_orchestrator.py,
test_agentic_e2e_fix_orchestrator.py).
@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Addressed in edb818235.

The fix:

After each post-loop Step 10 / Step 11 trusted post_step_comment_once call, the orchestrator now refreshes state_data["step_comments"] = sorted(step_comments_set) (alongside the running total_cost, changed_files, last_saved_at) and calls save_workflow_state(...) inline — BEFORE the next code path that may return False, ... without touching state. Both new save calls live inside the same try/except block that guards the post, so a transient persistence failure logs-and-continues rather than aborting the workflow.

That closes both of the failure modes you described:

  • run_ci_validation_loop returning ci_success=False after the Step 11 cleanup post: the Step 11 key (current_cycle * 10000 + 11) is now on disk before the CI-failure return.
  • _run_final_checkup_on_pr returning checkup_success=False after the Step 10 CI post: the Step 10 key (current_cycle * 10000 + 10) is on disk before the final-checkup-failure return.

Regression tests added (live in tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting):

  • test_step11_post_persists_step_comments_before_ci_failure_return — mocks run_ci_validation_loop to return ci_success=False, asserts success=False propagates AND that save_workflow_state was called at least once with state["step_comments"] containing key 10011.
  • test_step10_post_persists_step_comments_before_checkup_failure_return — mocks the final checkup to return checkup_success=False, asserts success=False propagates AND that save_workflow_state was called at least once with state["step_comments"] containing key 10010.

Both tests use a post_step_comment_once side_effect that mirrors the real helper's posted_steps.add(step_num) behavior — a bare return_value=True mock would have left the in-memory set empty and silently passed.

Prompt updated too (agentic_e2e_fix_orchestrator_python.prompt) so pdd sync-regenerated code keeps the inline save: a "REQUIRED post-loop persistence" clause names the failure paths, the state fields to refresh, and the requirement that the save live inside the same try/except as the post.

Local test results from PR head edb818235:

PDD_PATH=$PWD pytest tests/test_agentic_common.py \
    tests/test_agentic_change_orchestrator.py \
    tests/test_agentic_test_orchestrator.py \
    tests/test_agentic_checkup_orchestrator.py \
    tests/test_agentic_e2e_fix_orchestrator.py -q

951 passed
PDD_PATH=$PWD pytest \
    tests/test_agentic_common.py::TestPostStepCommentOnce \
    tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

25 passed  (was 23 — +2 new persistence regression tests)

Catch-up merge after main advanced 6 commits since the previous merge
(``a93cecf519``). No content conflicts expected — this preserves the
linear feature history while keeping the PR fast-forwardable.
Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior round-3 fresh-run failure cases are now covered, and the updated trusted-step-comment subset passes locally (25 passed). CI is also green. I still would not merge yet because the new persistence code has a resume-path hole.

Remaining required change:

pdd/agentic_e2e_fix_orchestrator.py now persists post-loop Step 10/11 keys by mutating state_data["step_comments"] after the trusted post. That works only when state_data was created by the inner for step_num in range(1, 10) loop. On resumed success paths, the inner loop can be skipped entirely:

  • cached Step 9 SUCCESS_FALL_THROUGH sets success = True before the outer loop (_resume_deferred_action == "SUCCESS_FALL_THROUGH"), then while current_cycle <= max_cycles and not success does not run;
  • resume_terminal_success also sets success = True before the loop.

In those cases state_data has not been initialized in this invocation (it is first assigned inside the inner loop). The post-loop Step 11 / Step 10 comments can still be posted, but the subsequent state_data["step_comments"] = ... raises UnboundLocalError inside the broad try/except, gets logged as a post_step_comment_once failed warning, and the key is not persisted. A later retry can repost the same visible comment, which reopens the idempotency problem for exactly the resume/retry paths this state is meant to protect.

Please build or refresh a post-loop state object independently of the inner-loop local before these Step 10/11 saves, and add regression coverage for a resumed cached-success path (for example: loaded state at last_completed_step=9 with Step 9 success, independent re-verification succeeds, then CI or final checkup fails) asserting keys 10011 / 10010 are persisted.

…tence saves

Round-4 review surfaced a resume-path hole in the round-3
persistence fix. The inner ``for step_num in range(1, 10)`` loop is
where the orchestrator's ``state_data`` dict was first assigned, but
the SUCCESS_FALL_THROUGH and resume-terminal-success resume paths
both set ``success = True`` BEFORE the outer ``while`` loop runs.
On those paths the inner loop is skipped entirely, ``state_data`` is
never assigned in this invocation, and the round-3 mutation lines
(``state_data["step_comments"] = ...``) would raise
``UnboundLocalError`` inside the broad try/except around the
trusted post. The exception got swallowed as a
``post_step_comment_once failed`` warning and the composite key
``current_cycle * 10000 + 10`` (or ``+ 11``) was never written to
disk — a later retry rehydrated ``step_comments_set`` empty and
re-posted the same visible step comment.

Fix:
* Introduce ``_build_post_loop_state(...)`` at module scope. It
  takes only function-scope locals that are unconditionally
  initialized at orchestrator entry (``step_outputs``,
  ``step_comments_set``, ``changed_files``, ``total_cost``,
  ``model_used``, ``dev_unit_states``, ``skipped_steps``,
  ``github_comment_id``, ``initial_file_hashes``, ``initial_sha``)
  and returns the same shape as the inner-loop save.
* Replace both post-loop ``state_data["step_comments"] = ...``
  mutations with a call to ``_build_post_loop_state`` whose result
  is handed directly to ``save_workflow_state``. The save works the
  same on the linear-execution path AND on the
  SUCCESS_FALL_THROUGH / resume-terminal-success resume paths.

Tests added (regression coverage for the resume hole):
* ``test_step11_persists_on_success_fall_through_resume_path`` —
  ``load_workflow_state`` returns a state with
  ``last_completed_step=9`` and a cached ALL_TESTS_PASS Step 9
  output, forcing the SUCCESS_FALL_THROUGH path. Mocks
  ``run_ci_validation_loop`` to ``ci_success=False``, asserts the
  Step 11 composite key ``10011`` is in at least one
  ``save_workflow_state`` call's ``state["step_comments"]``.
* ``test_step10_persists_on_success_fall_through_resume_path`` —
  same resume setup, mocks the final checkup to fail. Asserts key
  ``10010`` is persisted before the failure return.

Both tests would fail on pre-fix code because the round-3 code
mutated the never-assigned inner-loop ``state_data`` local.

Prompt updated (``agentic_e2e_fix_orchestrator_python.prompt``) so
``pdd sync``-regenerated code stays resume-safe: a new "Resume-safe
state construction (REQUIRED)" clause names the failure mode,
explicitly forbids mutating the inner-loop ``state_data`` local,
and suggests the ``_build_post_loop_state`` helper shape.

953 tests pass across the affected files
(test_agentic_common.py, test_agentic_change_orchestrator.py,
test_agentic_test_orchestrator.py, test_agentic_checkup_orchestrator.py,
test_agentic_e2e_fix_orchestrator.py).
@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Addressed in 6db04d5fe. You're right — the round-3 fix mutated the inner-loop state_data local, which is undefined on the SUCCESS_FALL_THROUGH / resume-terminal-success paths because the inner for step_num in range(1, 10) loop never runs.

The fix:

Introduced a module-level helper _build_post_loop_state(...) that constructs the workflow-state dict from function-scope locals only — step_outputs, step_comments_set, changed_files, total_cost, model_used, dev_unit_states, skipped_steps, github_comment_id, initial_file_hashes, initial_sha — all of which ARE unconditionally initialized at orchestrator entry. Both post-loop Step 10 / Step 11 saves now go through this helper and hand the result directly to save_workflow_state(...), so the save shape is identical across the linear-execution path, the SUCCESS_FALL_THROUGH resume path, and the resume-terminal-success path. The broad try/except still wraps the post, but it no longer needs to silently absorb an UnboundLocalError.

Regression tests added (live in tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting):

  • test_step11_persists_on_success_fall_through_resume_pathload_workflow_state returns state with last_completed_step=9 and cached ALL_TESTS_PASS Step 9 output (the canonical SUCCESS_FALL_THROUGH trigger). Independent re-verification passes via _verify_tests_independently mock. Then run_ci_validation_loop returns ci_success=False. Asserts success=False propagates AND that at least one save_workflow_state call carries key 10011 in its state["step_comments"]. Would fail on pre-fix code because the round-3 mutation lines hit UnboundLocalError, the broad except swallowed it, and the key never made it to disk.
  • test_step10_persists_on_success_fall_through_resume_path — same resume setup, mocks the final checkup to return checkup_success=False. Asserts key 10010 is persisted before the failure return.

Both tests use the same posted_steps.add(step_num) side_effect as the round-3 tests so the in-memory set state matches the real helper.

Prompt updated too (agentic_e2e_fix_orchestrator_python.prompt) so pdd sync regen stays resume-safe: a new "Resume-safe state construction (REQUIRED)" clause names the failure mode, explicitly forbids mutating the inner-loop state_data local from post-loop sites, and suggests the _build_post_loop_state(...) helper shape (matching the signature in this PR).

Local test results from PR head 6db04d5fe:

PDD_PATH=$PWD pytest tests/test_agentic_common.py \
    tests/test_agentic_change_orchestrator.py \
    tests/test_agentic_test_orchestrator.py \
    tests/test_agentic_checkup_orchestrator.py \
    tests/test_agentic_e2e_fix_orchestrator.py -q

953 passed
PDD_PATH=$PWD pytest \
    tests/test_agentic_common.py::TestPostStepCommentOnce \
    tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting \
    tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

27 passed  (was 25 — +2 new resume-path regression tests)

Copy link
Copy Markdown
Contributor

@gltanaka gltanaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest update addresses the remaining resume-path idempotency blocker.

I verified that the post-loop Step 10/11 saves now use _build_post_loop_state(...) instead of mutating the inner-loop state_data local, so the SUCCESS_FALL_THROUGH resume path can persist keys 10010 / 10011 even when the inner loop is skipped. The prompt contract was updated to preserve that behavior on regeneration, and the new regression tests cover the resume cases.

I reran the trusted-step-comment subset locally on the latest PR head:

PDD_PATH=$PWD pytest tests/test_agentic_common.py::TestPostStepCommentOnce tests/test_agentic_change_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_test_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_checkup_orchestrator.py::TestTrustedStepCommentPosting tests/test_agentic_e2e_fix_orchestrator.py::TestTrustedStepCommentPosting -q

Result: 27 passed.

GitHub checks are green and the PR is mergeable. Approved.

@gltanaka gltanaka merged commit 9cfce9e into promptdriven:main May 21, 2026
5 checks passed
imradawoodani pushed a commit to imradawoodani/pdd that referenced this pull request Jun 1, 2026
1. Gate trusted Step 11/12 posts on success flag (agentic_change_orchestrator).

   Previously the change orchestrator called post_step_comment_once
   inside the review loop unconditionally after run_agentic_task, even
   when s11_success / s12_success was False. A failed task would post a
   "Step N completed; no <step_report> block" fallback comment AND mark
   the composite key (review_iteration * 100 + step_num) as posted. A
   later resume/retry of the same iteration would then dedup against
   that failed key and silently suppress the real successful report.

   Gate both posts behind their success flag. Add two regression tests
   (test_step11_failure_does_not_post_or_mark_iter_key,
   test_step12_failure_does_not_post_or_mark_iter_key) that prove a
   failed Step 11/12 does not post anything and does not burn keys 111
   / 112.

2. Extend trusted step-comment coverage to Steps 10 + 11
   (agentic_e2e_fix_orchestrator) so the PR's "## Step N/11:" claim
   matches reality.

   The original PR only posted from the linear `for step_num in
   range(1, 10)` loop (steps 1-9). Step 10 (CI validation) and Step 11
   (cleanup) ran after the loop and never invoked post_step_comment_once.

   Add a trusted post for Step 10 immediately after run_ci_validation_loop
   returns ci_success=True, keyed by current_cycle * 10000 + 10.

   Add a trusted post for Step 11 immediately after _run_step11_code_cleanup
   returns, keyed by current_cycle * 10000 + 11, with a body that reports
   the actual cleanup file delta (or a "no additional files modified"
   summary when the cleanup pass found nothing to do or reverted itself).

   Both posts use the orchestrator-format "## Step N/11:" header so
   downstream tooling can recognize them as orchestrator-owned.

   Add two regression tests
   (test_step10_ci_path_posts_trusted_comment,
   test_step11_cleanup_path_posts_trusted_comment) that prove the
   composite keys 10010 / 10011 appear in mock_post_once.call_args_list
   with the right header on the happy path.

949 tests pass across the affected files (test_agentic_common.py,
test_agentic_change_orchestrator.py, test_agentic_test_orchestrator.py,
test_agentic_checkup_orchestrator.py, test_agentic_e2e_fix_orchestrator.py).
imradawoodani pushed a commit to imradawoodani/pdd that referenced this pull request Jun 1, 2026
…r-step-comments-upstream

feat: post success-path step comments via trusted credentials for change/test/checkup/e2e_fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Post visible GitHub step comments for all PDD commands in GitHub App runs

2 participants