Skip to content

Managed Gemini agent: AfterAgent completion hook unreliable — jobs stuck in running after Gemini-side API failure or long-thinking state on macOS (CCB v6.0.7) #181

@bookandlover

Description

@bookandlover

CCB version: v6.0.7
OS: macOS 26.3.1 (Darwin 25.3.0)
Gemini CLI: model gemini-3.1-pro-preview (as shown in pane footer)
Agent config: reviewer:gemini, started with gemini --yolo --resume latest

Symptom

After ccb ask reviewer "<short prompt>", ccbd records intermediate events (job_acceptedjob_startedcompletion_itemcompletion_state_updated) within ~20 seconds, but the job stays at status=running and no JSON artifact ever appears in .ccb/agents/reviewer/provider-runtime/gemini/completion/events/. ccb pend <job_id> keeps returning an empty reply.

Additionally, an earlier job on the same session terminalized as status=completed but with reply: "[no response text]" and completion_reason: hook_after_agent — the hook fired on an empty-content turn (auth info banner), not on the actual answer turn.

Only remedy is ccb kill -f, which terminalizes the zombie as completion_reason=project_shutdown.

Root cause (corrected after pane inspection)

This is not a hook-registration bug. ~/.gemini/settings.json correctly registers:

"hooks": { "AfterAgent": [ { "hooks": [ { "command": "... ccb-provider-finish-hook --provider gemini --completion-dir .../reviewer/.../completion ..." } ] } ] }

The real failure chain, per tmux capture-pane -t %<reviewer>:

  1. ccbd sent the request as: CCB_REQ_ID: job_ea75ae6a0e77 Execute the full request from @.ccb-requests/<job>.md and reply directly.
  2. Gemini hit a transient API failure:
    ✕ [API Error: request to
      https://cloudcode-pa.googleapis.com/v1internal:streamGenerateContent?alt=sse failed,
      reason: Client network socket disconnected before secure TLS connection was established]
    ℹ This request failed. Press F12 for diagnostics, or run /settings and change "Error Verbosity" to full for full details.
    
  3. Gemini then ran an auth/re-login info banner (ℹ Authentication succeeded), which internally counted as a "turn" and triggered AfterAgent against a prior job id (job_bb7755c58f31 — a different earlier request). That job got terminalized as completed with reply: "[no response text]".
  4. Gemini then re-submitted / continued and is now stuck in:
    ⠇ Thinking... (esc to cancel, 6m 22s)
    
    — a very long thinking state. No AfterAgent fires until Gemini either finishes the turn or is interrupted, so no artifact is ever written for job_ea75ae6a0e77.

So:

  • job_bb7755c58f31 — hook fired, but on the wrong turn (auth banner, not the actual answer). Reply was lost.
  • job_ea75ae6a0e77 — hook never fires because the turn is either (a) stuck "Thinking" for minutes, or (b) aborted by API error before AfterAgent is emitted.

The common factor: ccbd's Gemini path treats "AfterAgent hook fires" as the only authoritative completion signal. When network flakes, auth re-flow, or long-thinking stalls decouple the hook from the correct job, the system has no fallback.

Evidence

Registered hook (good)

$ grep -A 3 AfterAgent ~/.gemini/settings.json
"hooks": {
  "AfterAgent": [ { "hooks": [ { "command": "/opt/homebrew/opt/python@3.13/bin/python3.13 /Users/Peng/.local/share/codex-dual/bin/ccb-provider-finish-hook --provider gemini --completion-dir /Users/Peng/LLM/claude_code_bridge/.ccb/agents/reviewer/provider-runtime/gemini/completion ...

completion/events/ content (only one artifact, wrong job)

$ ls .ccb/agents/reviewer/provider-runtime/gemini/completion/events/
job_bb7755c58f31.json            # ← only this file, despite 2+ jobs submitted
$ cat .../events/job_bb7755c58f31.json
{
  "schema_version": 1,
  "record_type": "provider_completion_hook",
  "provider": "gemini",
  "agent_name": "reviewer",
  "req_id": "job_bb7755c58f31",
  "status": "completed",
  "reply": "[no response text]",           ← empty reply
  "session_id": "74e9219e-c6eb-42e0-b521-165c5916d7f6",
  "hook_event_name": "AfterAgent",
  "timestamp": "2026-04-22T13:12:07.271171+00:00"
}

ccb pend for the two jobs

$ ccb pend job_bb7755c58f31
status: completed
reply: [no response text]
completion_reason: hook_after_agent

$ ccb pend job_ea75ae6a0e77
status: running          ← stuck after 30+ min
reply:

Pane transcript (key excerpt)

 > CCB_REQ_ID: job_bb7755c58f31 Execute the full request from @...md and reply directly.
ℹ Authentication succeeded                                   ← the turn that fired AfterAgent for job_bb
...
 > /authCCB_REQ_ID: job_ea75ae6a0e77 Execute the full request from @...md and reply directly.
✕ [API Error: request to cloudcode-pa.googleapis.com ... Client network socket disconnected ...]
ℹ This request failed. Press F12 for diagnostics ...
...
⠇ Thinking... (esc to cancel, 6m 22s)                        ← job_ea is still thinking, no hook yet

Suggested fixes

Short-term (mitigation)

  1. Req-id anchoring: ccb-provider-finish-hook currently extracts req_id via extract_req_id() and latest_req_id_from_transcript(). Make it refuse to emit a completed artifact when the detected req_id does not match the most recent CCB_REQ_ID: in the current turn's user message. This avoids mis-attributing an unrelated turn's hook fire (the root cause of job_bb returning [no response text]).
  2. Empty-reply guard: when the hook detects reply: "[no response text]" AND the associated pane turn has no assistant-visible text block, treat it as incomplete (not completed) so it doesn't prematurely burn a job id.
  3. Wall-clock watchdog in ccbd: for each in-flight Gemini job, after completion_state_updated OR N minutes without any hook-delivered artifact, mark the job incomplete with completion_reason=hook_timeout and optionally fall back to pane-capture. This unblocks users without ccb kill -f.

Medium-term

  1. Stream-disconnect retry surface: Gemini's "Client network socket disconnected before secure TLS connection was established" is a common transient failure; ccbd's bridge could detect this line in pane output and either (a) auto-retry the request, or (b) terminalize the job as failed rather than leaving it suspended while Gemini's own error flow runs.
  2. Long-thinking visibility: expose Thinking... (Xm Ys) duration in ccb ps/ccb ping <agent> so operators know when a job is alive-but-slow vs. actually stuck.

What's not the cause (ruled out)

  • Hook registration: confirmed present in ~/.gemini/settings.json.
  • Completion directory missing: Gemini's runtime does create .ccb/agents/reviewer/provider-runtime/gemini/completion/events/ at startup (unlike the sibling Codex bug, issue Managed Codex agent never writes completion artifact on macOS; ccb ask jobs stuck in running until kill -f #180). So this is a separate failure mode.
  • Prompt length / tool-intent: both job_bb and job_ea are identical-shape Execute the full request from @<file>.md style prompts. Earlier hypothesis that short prompts take a "simplified path" was wrong.

Workaround in use

Fallback "pane-capture read-through" (same as issue #180): when events/<job_id>.json is missing after ~2 heartbeat rounds, read tmux -S .ccb/ccbd/tmux.sock capture-pane -p -t %<reviewer_pane> and treat the trailing assistant output as the ground-truth reply. Clear the zombie with ccb kill -f.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions