You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Managed Gemini agent: AfterAgent completion hook unreliable — jobs stuck in running after Gemini-side API failure or long-thinking state on macOS (CCB v6.0.7) #181
CCB version: v6.0.7 OS: macOS 26.3.1 (Darwin 25.3.0) Gemini CLI: model gemini-3.1-pro-preview (as shown in pane footer) Agent config:reviewer:gemini, started with gemini --yolo --resume latest
Symptom
After ccb ask reviewer "<short prompt>", ccbd records intermediate events (job_accepted → job_started → completion_item → completion_state_updated) within ~20 seconds, but the job stays at status=running and no JSON artifact ever appears in .ccb/agents/reviewer/provider-runtime/gemini/completion/events/. ccb pend <job_id> keeps returning an empty reply.
Additionally, an earlier job on the same session terminalized as status=completed but with reply: "[no response text]" and completion_reason: hook_after_agent — the hook fired on an empty-content turn (auth info banner), not on the actual answer turn.
Only remedy is ccb kill -f, which terminalizes the zombie as completion_reason=project_shutdown.
Root cause (corrected after pane inspection)
This is not a hook-registration bug. ~/.gemini/settings.json correctly registers:
The real failure chain, per tmux capture-pane -t %<reviewer>:
ccbd sent the request as: CCB_REQ_ID: job_ea75ae6a0e77 Execute the full request from @.ccb-requests/<job>.md and reply directly.
Gemini hit a transient API failure:
✕ [API Error: request to
https://cloudcode-pa.googleapis.com/v1internal:streamGenerateContent?alt=sse failed,
reason: Client network socket disconnected before secure TLS connection was established]
ℹ This request failed. Press F12 for diagnostics, or run /settings and change "Error Verbosity" to full for full details.
Gemini then ran an auth/re-login info banner (ℹ Authentication succeeded), which internally counted as a "turn" and triggered AfterAgent against a prior job id (job_bb7755c58f31 — a different earlier request). That job got terminalized as completed with reply: "[no response text]".
Gemini then re-submitted / continued and is now stuck in:
⠇ Thinking... (esc to cancel, 6m 22s)
— a very long thinking state. No AfterAgent fires until Gemini either finishes the turn or is interrupted, so no artifact is ever written for job_ea75ae6a0e77.
So:
job_bb7755c58f31 — hook fired, but on the wrong turn (auth banner, not the actual answer). Reply was lost.
job_ea75ae6a0e77 — hook never fires because the turn is either (a) stuck "Thinking" for minutes, or (b) aborted by API error before AfterAgent is emitted.
The common factor: ccbd's Gemini path treats "AfterAgent hook fires" as the only authoritative completion signal. When network flakes, auth re-flow, or long-thinking stalls decouple the hook from the correct job, the system has no fallback.
> CCB_REQ_ID: job_bb7755c58f31 Execute the full request from @...md and reply directly.
ℹ Authentication succeeded ← the turn that fired AfterAgent for job_bb
...
> /authCCB_REQ_ID: job_ea75ae6a0e77 Execute the full request from @...md and reply directly.
✕ [API Error: request to cloudcode-pa.googleapis.com ... Client network socket disconnected ...]
ℹ This request failed. Press F12 for diagnostics ...
...
⠇ Thinking... (esc to cancel, 6m 22s) ← job_ea is still thinking, no hook yet
Suggested fixes
Short-term (mitigation)
Req-id anchoring: ccb-provider-finish-hook currently extracts req_id via extract_req_id() and latest_req_id_from_transcript(). Make it refuse to emit a completed artifact when the detected req_id does not match the most recent CCB_REQ_ID: in the current turn's user message. This avoids mis-attributing an unrelated turn's hook fire (the root cause of job_bb returning [no response text]).
Empty-reply guard: when the hook detects reply: "[no response text]" AND the associated pane turn has no assistant-visible text block, treat it as incomplete (not completed) so it doesn't prematurely burn a job id.
Wall-clock watchdog in ccbd: for each in-flight Gemini job, after completion_state_updated OR N minutes without any hook-delivered artifact, mark the job incomplete with completion_reason=hook_timeout and optionally fall back to pane-capture. This unblocks users without ccb kill -f.
Medium-term
Stream-disconnect retry surface: Gemini's "Client network socket disconnected before secure TLS connection was established" is a common transient failure; ccbd's bridge could detect this line in pane output and either (a) auto-retry the request, or (b) terminalize the job as failed rather than leaving it suspended while Gemini's own error flow runs.
Long-thinking visibility: expose Thinking... (Xm Ys) duration in ccb ps/ccb ping <agent> so operators know when a job is alive-but-slow vs. actually stuck.
What's not the cause (ruled out)
Hook registration: confirmed present in ~/.gemini/settings.json.
Prompt length / tool-intent: both job_bb and job_ea are identical-shape Execute the full request from @<file>.md style prompts. Earlier hypothesis that short prompts take a "simplified path" was wrong.
Workaround in use
Fallback "pane-capture read-through" (same as issue #180): when events/<job_id>.json is missing after ~2 heartbeat rounds, read tmux -S .ccb/ccbd/tmux.sock capture-pane -p -t %<reviewer_pane> and treat the trailing assistant output as the ground-truth reply. Clear the zombie with ccb kill -f.
Shared theme: both managed-agent paths rely on fragile provider-lifecycle hooks for authoritative completion. ccbd needs a timeout/fallback layer that does not depend solely on the provider emitting a correct hook at the right time.
CCB version: v6.0.7
OS: macOS 26.3.1 (Darwin 25.3.0)
Gemini CLI: model
gemini-3.1-pro-preview(as shown in pane footer)Agent config:
reviewer:gemini, started withgemini --yolo --resume latestSymptom
After
ccb ask reviewer "<short prompt>", ccbd records intermediate events (job_accepted→job_started→completion_item→completion_state_updated) within ~20 seconds, but the job stays atstatus=runningand no JSON artifact ever appears in.ccb/agents/reviewer/provider-runtime/gemini/completion/events/.ccb pend <job_id>keeps returning an emptyreply.Additionally, an earlier job on the same session terminalized as
status=completedbut withreply: "[no response text]"andcompletion_reason: hook_after_agent— the hook fired on an empty-content turn (auth info banner), not on the actual answer turn.Only remedy is
ccb kill -f, which terminalizes the zombie ascompletion_reason=project_shutdown.Root cause (corrected after pane inspection)
This is not a hook-registration bug.
~/.gemini/settings.jsoncorrectly registers:The real failure chain, per
tmux capture-pane -t %<reviewer>:CCB_REQ_ID: job_ea75ae6a0e77 Execute the full request from @.ccb-requests/<job>.md and reply directly.ℹ Authentication succeeded), which internally counted as a "turn" and triggeredAfterAgentagainst a prior job id (job_bb7755c58f31— a different earlier request). That job got terminalized ascompletedwithreply: "[no response text]".AfterAgentfires until Gemini either finishes the turn or is interrupted, so no artifact is ever written forjob_ea75ae6a0e77.So:
job_bb7755c58f31— hook fired, but on the wrong turn (auth banner, not the actual answer). Reply was lost.job_ea75ae6a0e77— hook never fires because the turn is either (a) stuck "Thinking" for minutes, or (b) aborted by API error beforeAfterAgentis emitted.The common factor:
ccbd's Gemini path treats "AfterAgenthook fires" as the only authoritative completion signal. When network flakes, auth re-flow, or long-thinking stalls decouple the hook from the correct job, the system has no fallback.Evidence
Registered hook (good)
completion/events/ content (only one artifact, wrong job)
ccb pendfor the two jobsPane transcript (key excerpt)
Suggested fixes
Short-term (mitigation)
ccb-provider-finish-hookcurrently extractsreq_idviaextract_req_id()andlatest_req_id_from_transcript(). Make it refuse to emit acompletedartifact when the detected req_id does not match the most recentCCB_REQ_ID:in the current turn's user message. This avoids mis-attributing an unrelated turn's hook fire (the root cause ofjob_bbreturning[no response text]).reply: "[no response text]"AND the associated pane turn has no assistant-visible text block, treat it asincomplete(notcompleted) so it doesn't prematurely burn a job id.completion_state_updatedOR N minutes without any hook-delivered artifact, mark the jobincompletewithcompletion_reason=hook_timeoutand optionally fall back to pane-capture. This unblocks users withoutccb kill -f.Medium-term
failedrather than leaving it suspended while Gemini's own error flow runs.Thinking... (Xm Ys)duration inccb ps/ccb ping <agent>so operators know when a job is alive-but-slow vs. actually stuck.What's not the cause (ruled out)
~/.gemini/settings.json..ccb/agents/reviewer/provider-runtime/gemini/completion/events/at startup (unlike the sibling Codex bug, issue Managed Codex agent never writes completion artifact on macOS; ccb ask jobs stuck in running until kill -f #180). So this is a separate failure mode.job_bbandjob_eaare identical-shapeExecute the full request from @<file>.mdstyle prompts. Earlier hypothesis that short prompts take a "simplified path" was wrong.Workaround in use
Fallback "pane-capture read-through" (same as issue #180): when
events/<job_id>.jsonis missing after ~2 heartbeat rounds, readtmux -S .ccb/ccbd/tmux.sock capture-pane -p -t %<reviewer_pane>and treat the trailing assistant output as the ground-truth reply. Clear the zombie withccb kill -f.Related
completion/directory at all (different root cause, same visible symptom).