Skip to content

feat(codex-bridge): Init Gate prevents first-prompt loss on cold start#194

Open
SevenX77 wants to merge 3 commits intobfly123:mainfrom
SevenX77:feat/q3-stage1-init-gate
Open

feat(codex-bridge): Init Gate prevents first-prompt loss on cold start#194
SevenX77 wants to merge 3 commits intobfly123:mainfrom
SevenX77:feat/q3-stage1-init-gate

Conversation

@SevenX77
Copy link
Copy Markdown

Summary

Adds a provider-agnostic Init Gate state machine to the Codex bridge so the first prompt sent to a freshly-spawned Codex agent is never delivered while the TUI is still rendering its welcome banner / authenticating / loading. Today, the bridge enters its read FIFO → paste-into-pane → Enter loop the moment the Python process spawns; if a ccb ask arrives before the Codex TUI has shown its idle prompt, the keystrokes are eaten by the splash/auth modal and the message is silently lost.

Why

We hit this reliably on cold start of a1:codex agents — the first prompt has a non-trivial loss rate, and the loss is invisible because:

  • bridge's process_request swallows any CcbDeliveryError into a history entry
  • the existing _paste_via_buffer_reception_driven retry uses pane_shows_agent_activity() to detect 'delivered', but Codex's own startup spinner (Loading..., Braille frames) false-positives that check, so the retry returns successfully even though the message never reached the prompt buffer

The send-side retry was designed to handle Enter-swallow on a pane that's already idle, not to handle 'pane has never been idle yet'. We need an explicit pre-send ready gate, which is what this PR adds.

What's in this PR

File Purpose
lib/provider_core/init_gate.py (new) Provider-agnostic InitGate class: state machine (LAUNCHED → INITIALIZING → READY | INIT_FAIL), segmented polling (200ms→500ms@5s), steady-state debounce (2× consecutive True), deadline (default 30s), bypass flag, env var loader, failure diagnostics JSON with ring buffer of last 3 captures
lib/provider_backends/codex/bridge_runtime/init_probe.py (new) CodexInitProbe implementing InitGateProbe: visible-screen capture only (no scrollback) + banner blacklist (OpenAI Codex, Sign in with ChatGPT, Trust this workspace, etc.) + 'last non-empty line starts with `› `' check (tolerates idle hint)
lib/provider_backends/codex/bridge_runtime/service.py (modified) DualBridge.__init__ opens FIFO holder fd (O_RDONLY | O_NONBLOCK) so upstream writers don't block during Init Gate; run() calls init_gate.wait_until_ready() before main loop, exits with code 3 on INIT_FAIL
lib/provider_backends/codex/bridge_runtime/runtime_state.py (modified) BridgeRuntimeState now mutable, holds init_gate + fifo_holder_fd; build_bridge_runtime_state constructs probe + gate with env-driven config
test/test_init_gate.py (new) 11 unit tests: happy path, deadline, steady debounce, segmented polling, failure JSON, ring buffer, env var loading
test/test_codex_init_probe.py (new) 7 unit tests: banner detection (positive + negative), prompt-on-last-line, all 5 banner variants, capture flag verification, exception safety
test/test_codex_communicator_init.py (new) 3 unit tests: FIFO holder open with correct flags, INIT_FAIL exit code 3 path, success path enters main loop

Total: 21 unit tests, all green (python -m pytest test/test_init_gate.py test/test_codex_init_probe.py test/test_codex_communicator_init.py).

Design contract

  • No first prompt without ready signal: bridge MUST NOT enter its read loop until probe.detect() returns True for steady_count (default 2) consecutive polls.
  • Time-bounded: 30s deadline (configurable via CCB_CODEX_INIT_DEADLINE_S or generic CCB_INIT_GATE_DEADLINE_S); on timeout, write init_gate_failure.json with last 3 captures + probe history, exit 3.
  • Visible-only capture: capture-pane -p -t <pane> (no -S flag) — historical scrollback would otherwise pin S1 (banner-gone) to false forever.
  • FIFO holder pattern: bridge opens FIFO RDONLY+NONBLOCK once at `init`, keeps fd through Init Gate. This way upstream writers' open(O_WRONLY) doesn't block during the 0–30s gate window — payloads queue into the kernel pipe buffer (default 64KB) and main loop drains them after Init Gate passes.
  • Bypass for emergencies: CCB_INIT_GATE_BYPASS=1 skips the gate entirely (logs WARN); useful for debugging or if Codex CLI banner format changes break the probe before the constant can be patched.

Env vars (all optional, sensible defaults)

Env Default Effect
CCB_INIT_GATE_DEADLINE_S 30 Total timeout
CCB_CODEX_INIT_DEADLINE_S (inherits) Codex-specific override
CCB_INIT_GATE_POLL_FAST_MS 200 First-segment polling period
CCB_INIT_GATE_POLL_SLOW_MS 500 Second-segment polling period
CCB_INIT_GATE_POLL_SWITCH_S 5 When to switch fast→slow
CCB_INIT_GATE_STEADY_COUNT 2 Consecutive True threshold
CCB_INIT_GATE_BYPASS 0 Disable gate entirely

Review history (private)

  • Plan reviewed by Gemini (Rubric A): overall 9.0/10, all dims ≥ 8.5
  • Code reviewed per-commit by Gemini (Rubric B):
    • [Q3-S1a.1] InitGate skeleton: 9.2/10
    • [Q3-S1a.2] CodexInitProbe: 8.8/10
    • [Q3-S1a.3] DualBridge wiring: 9.0/10
  • Zero must_fix items across all reviews.

Design doc + reviews available locally; happy to share if useful.

Test plan

  • Unit tests: 21 / 21 passing locally on Python 3.12.9
  • Cold-start integration test (manual N=10): planned as a follow-up — will measure first-prompt success rate with gate enabled vs CCB_INIT_GATE_BYPASS=1 to confirm bug repro
  • Gemini provider Init Gate (separate PR — Gemini uses BeforeAgent hook, different probe shape)
  • Claude Code provider Init Gate (separate PR)

Risks

  • Codex banner format drift: if OpenAI changes the welcome screen text, the banner blacklist may go stale. Mitigation: CODEX_INIT_BANNERS lives in one constant (4 lines); CCB_INIT_GATE_BYPASS=1 is the emergency hatch; the S2 prompt-on-last-line check is the stronger signal and works regardless of banner text.
  • 30s deadline on slow networks: if Codex's WebSocket auth handshake takes longer, INIT_FAIL fires. Mitigation: CCB_CODEX_INIT_DEADLINE_S=60 env override.

🤖 Generated with Claude Code

SevenX77 and others added 3 commits April 24, 2026 23:00
…dy detection

Implements the generic Init Gate skeleton per Q3 Stage 1a design:
- InitGateState enum: LAUNCHED, INITIALIZING, READY, INIT_FAIL
- InitGateProbe protocol for provider-specific detection
- InitGate class with segmented polling, steady-state debounce,
  deadline-based failure detection, and diagnostic capture
- load_init_gate_env() for env var configuration (generic + per-provider)

Features:
- Fast→slow segmented polling (poll_fast_ms→poll_slow_ms@poll_switch_s)
- Steady-state debounce (steady_count consecutive True before READY)
- Bypass mode for emergency override
- Comprehensive failure diagnostics JSON with recent pane captures ring

All 11 unit tests pass:
- Happy path, deadline exceeded, steady-state debounce
- Bypass, segmented polling, failure JSON structure
- Ring buffer, env var loading (generic + per-provider)
Provider-specific init probe for Codex CLI satisfying InitGateProbe protocol:
- S1: welcome banner strings absent in visible-screen capture
- S2: last non-empty line starts with '› ' (idle input prompt)
- Uses capture-pane -p -t <pane> (visible only, no -S scrollback)
- Conservative failure mode: any tmux/parse error returns False

7 unit tests covering banner detection, prompt-on-last-line logic,
visible-only capture invariant, and exception safety.

Module written by Codex (gpt-5.4 xhigh) in Q3-S1a.2 task.
Commit created by Claude (master) due to CCB reply truncation on
Codex side preventing it from completing the commit step itself.

Refs: .kiro/specs/q3-stage1-init-gate/DESIGN.md §4.3.1 (v2)

Co-Authored-By: Codex <noreply@openai.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Integrates Init Gate into Codex bridge lifecycle (DESIGN.md §4.6 + §4.7):

- runtime_state.py: BridgeRuntimeState mutable, init_gate + fifo_holder_fd
  fields; build_bridge_runtime_state constructs CodexInitProbe + InitGate
  with proper env loading; tmux_run_str adapter converts CompletedProcess
  to str for the probe (extracts stdout, decodes bytes, returns "" on
  any failure)
- service.py: DualBridge.__init__ creates TmuxBackend (no-arg ctor; pane
  binding happens via tmux_run_fn closure), opens FIFO holder fd with
  O_RDONLY|O_NONBLOCK before Init Gate so upstream writers don't block
  on open(O_WRONLY); run() waits on init_gate.wait_until_ready() before
  the main loop and returns 3 on INIT_FAIL; new _teardown() (idempotent)
  closes holder fd + stops binding tracker; called from _handle_signal
  and main loop finally
- init_probe.py: capture_visible_for_diagnostics() exposed for InitGate
  capture_fn

3 unit tests for DualBridge integration (FIFO holder open with correct
flags, INIT_FAIL exit code 3 path, success path enters main loop).

Implementation written by Codex (gpt-5.4 xhigh) in Q3-S1a.3 task; commit
finalized by Claude (master) under user-authorized circuit-breaker mode
after Codex CLI hit two consecutive WebSocket conversation interrupts:
- service.py:27 fix TmuxBackend signature (one-line)
- runtime_state.py: tmux_run_str adapter (CompletedProcess -> str)
- test bug fixes (O_RDONLY bitmask is 0 on POSIX; _running is instance
  attr not class attr) so the 3 unit tests pass

Refs: DESIGN.md §4.6 §4.7 + Q3-S1a.3 task spec + ccb-collaboration.md
      circuit-breaker authorization (user 2026-04-24)

Co-Authored-By: Codex <noreply@openai.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant