Skip to content

fix(shell,term): surface PTY errors + guarantee keystroke ordering#938

Open
AgentY-asaf wants to merge 5 commits into
mainfrom
agenty/dead-terminal-error-propagation
Open

fix(shell,term): surface PTY errors + guarantee keystroke ordering#938
AgentY-asaf wants to merge 5 commits into
mainfrom
agenty/dead-terminal-error-propagation

Conversation

@AgentY-asaf
Copy link
Copy Markdown
Contributor

@AgentY-asaf AgentY-asaf commented May 20, 2026

Summary

Three fixes for terminal reliability under load and long sessions:

1. PTY/spawn error propagation (original fix)

  • When PTY open or shell spawn fails (e.g. under memory pressure), writes an ANSI-colored error message directly to the `blockfile` stream so xterm.js renders it immediately
  • Previously: silent black pane with blinking cursor; user had no indication of failure

2. Portable benchmark discovery

  • Removes `if is_dev {` gate on `authkey.dev` write in `agentmux-cef/src/main.rs`
  • `bench-term-echo.mjs` can now target portable builds for long-session regression testing without manual `--ws-url`/`--auth-key` flags
  • Spec: `docs/specs/SPEC_BENCHMARK_PORTABLE_DISCOVERY_2026_05_20.md`

3. Keystroke ordering guarantee (new)

  • Root cause: `engine.handle_message()` calls `tokio::spawn()` per RPC message; Tokio work-stealing scheduler executes tasks out of order, so characters typed fast arrive at PTY stdin transposed
  • Fix: `CommandBlockInputData.seq: Option` + per-block `BTreeMap` reorder buffer in `ShellControllerInner`; frontend attaches a monotonically increasing counter per `TermViewModel` on every `sendDataToController()` call
  • Backward compatible: `seq = None` (legacy clients) forwards unconditionally
  • Buffer capped at `SHELL_INPUT_CH_SIZE` (32) entries; duplicates discarded with a warn log
  • Spec: `docs/specs/SPEC_KEYSTROKE_ORDER_GUARANTEE_2026_05_20.md`

Remaining work

  • Frontend: retry `setRTInfo` with backoff when it returns `{}` (RT info race)
  • Memory pressure warning badge at ≥85%/≥95% load
  • Benchmark `--seq-test` mode for CI regression guard

🤖 Generated with Claude Code

AgentY-asaf and others added 2 commits May 19, 2026 19:18
Server sends {eventtype:"rpc", data:<RpcMessage>}. Responses use
rpcMsg.resid (not reqid) for correlation; events have rpcMsg.command
=== "eventrecv". Previous script parsed wrong fields so pane.open
never resolved. Also adds 10s RPC timeout to surface hangs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When PTY open or shell spawn fails (e.g. under memory pressure), write
an ANSI-colored error message directly to the blockfile stream so
xterm.js renders it immediately. Previously the failure was logged at
ERROR level but the user saw only a silent black pane with a cursor.

Also adds SPEC_DEAD_TERMINAL_PANE_2026_05_20.md documenting both the
PTY-spawn and RT-info-race failure modes with proposed fixes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@a5af
Copy link
Copy Markdown
Member

a5af commented May 20, 2026

@codex review

reagentx-workflow[bot]
reagentx-workflow Bot previously approved these changes May 20, 2026
Copy link
Copy Markdown

@reagentx-workflow reagentx-workflow Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude

ReAgent Diagnostics
Field Value
ReAgent Version 5.12.53
Trigger PR opened
Project Context CLAUDE.md loaded
Model claude-opus-4-7
Effort high
Ref Repos Enabled (dev-tools, shared-infrastructure)
Merge Analysis 1 regression(s) detected
Review Time 50.5s
Timestamp 2026-05-20T10:34:15Z
Repository agentmuxai/agentmux
PR #938

LGTM

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

AgentY-asaf and others added 2 commits May 20, 2026 09:52
Removes the is_dev gate so portable and installed builds also write
authkey.dev to their data dir (~/.agentmux/versions/<v>/data/).
bench-term-echo.mjs already searches ~/.agentmux/versions/ so it
gains portable discovery with no further changes.

Also updates the benchmark error message and adds a mode=dev|portable
label to the "Using authkey.dev" output line, and adds
SPEC_BENCHMARK_PORTABLE_DISCOVERY_2026_05_20.md documenting the
long-session regression testing workflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@reagentx-workflow reagentx-workflow Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude

ReAgent Diagnostics
Field Value
ReAgent Version 5.12.53
Trigger Re-review (new commits)
Project Context CLAUDE.md loaded
Model claude-opus-4-7
Effort high
Ref Repos Enabled (dev-tools, shared-infrastructure)
Merge Analysis 2 regression(s) detected
Review Time 41.7s
Timestamp 2026-05-20T17:55:41Z
Repository agentmuxai/agentmux
PR #938

Issues:

  • [P2] tools/tests/bench-term-echo.mjs:110 - mode label conflates installed builds with portable; any path not under .agentmux/dev is labeled "portable" even though installed builds also live under .agentmux/versions/ per the spec's portable/installed distinction

Add seq: Option<u64> to CommandBlockInputData. The frontend attaches a
monotonically increasing per-TermViewModel counter on every
sendDataToController() call. The backend ShellController buffers
early-arriving inputs in a BTreeMap reorder buffer and drains in-order,
eliminating transposed characters caused by Tokio work-stealing scheduling
independent tasks spawned in engine.handle_message().

Spec: docs/specs/SPEC_KEYSTROKE_ORDER_GUARANTEE_2026_05_20.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AgentY-asaf AgentY-asaf changed the title fix(shell): surface PTY/spawn errors in terminal pane fix(shell,term): surface PTY errors + guarantee keystroke ordering May 20, 2026
Copy link
Copy Markdown

@reagentx-workflow reagentx-workflow Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude

ReAgent Diagnostics
Field Value
ReAgent Version 5.12.53
Trigger Re-review (new commits)
Project Context CLAUDE.md loaded
Model claude-opus-4-7
Effort high
Ref Repos Enabled (dev-tools, shared-infrastructure)
Merge Analysis 3 regression(s) detected
Review Time 56.4s
Timestamp 2026-05-20T18:43:47Z
Repository agentmuxai/agentmux
PR #938

Issues:

  • [P1] agentmux-srv/src/backend/blockcontroller/shell.rs:1003 - if try_send fails on an in-order input, input_seq_next is not incremented and the input is lost; all subsequent inputs will have s > next and accumulate in the reorder buffer until it fills, permanently breaking PTY input for that block
  • [P1] agentmux-srv/src/backend/blockcontroller/shell.rs:1011 - drain loop removes a buffered entry via remove(&expected) then propagates try_send errors with ?; on failure the buffered input is consumed/dropped and input_seq_next stays at the dropped seq, leaving a permanent hole that wedges sequence ordering
  • [P2] tools/tests/bench-term-echo.mjs:110 - mode label conflates installed builds with portable; any path not under .agentmux/dev is labeled "portable" even though installed builds also live under .agentmux/versions/ per SPEC_BENCHMARK_PORTABLE_DISCOVERY_2026_05_20.md (unaddressed from prior review)

Copy link
Copy Markdown

@reagentx-workflow reagentx-workflow Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude

ReAgent Diagnostics
Field Value
ReAgent Version 5.12.53
Trigger PR edited
Project Context CLAUDE.md loaded
Model claude-opus-4-7
Effort high
Ref Repos Enabled (dev-tools, shared-infrastructure)
Merge Analysis 3 regression(s) detected
Review Time 91.7s
Timestamp 2026-05-20T18:45:02Z
Repository agentmuxai/agentmux
PR #938

Issues:

  • [P1] agentmux-srv/src/backend/blockcontroller/shell.rs:177 - input_seq_next and input_seq_buf are only initialized in the constructor and never reset in start(); if the frontend's TermViewModel.inputSeq resets (page reload, WS reconnect, or new pane mounted to existing block) while the backend controller persists with input_seq_next > 0, frontend-sent seq values will hit the s < next branch and be silently discarded as "duplicates", making the terminal unresponsive
  • [P2] tools/tests/bench-term-echo.mjs:110 - mode label still conflates installed builds with portable (any path not under .agentmux/dev is labeled "portable" even though installed builds also live under .agentmux/versions/); previously flagged and not addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants