Skip to content

Commit c7ae632

Browse files
garrytanclaude
andauthored
v1.58.1.0 feat: hermetic local E2E + Conductor prose AskUserQuestion (garrytan#2004)
* feat: add shared call-time isConductor() helper Single source of truth for Conductor host detection in TS consumers (CONDUCTOR_WORKSPACE_PATH / CONDUCTOR_PORT). Reads the passed env at call time, not a module-load snapshot, so unit tests can pin the env inline without Bun --preload (esm-hoist-breaks-env-pin-bootstrap). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: harden question-preference-hook harness against ambient Conductor env runHook copied all of process.env into the hook subprocess, so running the suite inside Conductor (CONDUCTOR_WORKSPACE_PATH/PORT set) would leak those markers. Strip them so the existing cases deterministically characterize NON-Conductor behavior before the Conductor branch lands. Baseline: 15 pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: PreToolUse hook denies AskUserQuestion in Conductor, redirects to prose Conductor disables native AskUserQuestion and routes through a flaky MCP variant that returns '[Tool result missing due to internal error]'. The hook now denies any AUQ call in a Conductor session and instructs the model to render a prose decision brief instead (transport avoidance, not preference enforcement) — firing for one-way doors too, with a typed-confirmation requirement for destructive paths. Precedence: never-ask auto-decide still wins (user already settled those); Conductor prose is the fallback for everything else; non-Conductor behavior is byte-for-byte unchanged. Restructured the per-question loop to compute eligibility without early-returning so the Conductor branch can run as the fallback while preserving memoryContext on every exit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: Conductor renders AskUserQuestion decisions as prose by default In Conductor, native AskUserQuestion is disabled and the MCP variant is flaky, so skills now render every decision as a plain-text prose brief the user answers by typing a letter — proactively, not as a failure reaction. - Preamble emits CONDUCTOR_SESSION, gated on != headless so eval/CI inside Conductor still BLOCKs instead of rendering prose to nobody. - AskUserQuestion Format gains a Conductor-default-prose rule (auto-decide preferences still apply first; prose decisions log via gstack-question-log since PostToolUse never fires), a one-way/destructive typed-confirmation rule, and a typed-reply continuation protocol for split chains. - Regenerated all SKILL.md + ship golden fixtures; bumped affected carve skeleton caps to absorb the always-loaded additions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: deploy the Conductor AskUserQuestion hook (setup + upgrade migration) The PreToolUse hook only delivers its Conductor-prose guarantee if it's installed, but setup skips hook registration in non-interactive (conductor/CI) setups. Two fixes so layer 3 actually deploys: - setup: treat a Conductor workspace as an implicit opt-in for the PreToolUse hook on the silent fall-through (never overriding an explicit opt-out). - migration v1.58.0.0: re-register the hook for existing Conductor installs on /gstack-upgrade, idempotent and respecting plan_tune_hooks=no. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: E2E for Conductor prose + fix auto-decide-preserved GSTACK_HOME bug - New skill-e2e-conductor-prose (periodic): Conductor env + plan-eng-review surfaces a prose decision brief, not a silent skip. Header documents this is end-to-end behavior coverage; the deterministic Conductor guard is the question-preference-hook unit test (the PTY harness can't register the MCP variant — Codex garrytan#10). - Fix the pre-existing bug in auto-decide-preserved: it seeded the never-ask preference under GSTACK_HOME=tmpHome but never passed GSTACK_HOME into the PTY run, so the spawned claude read the real ~/.gstack and the preference was inert (Codex garrytan#9). Now passes GSTACK_HOME + CONDUCTOR_WORKSPACE_PATH to prove auto-decide still wins over the Conductor prose redirect. - Register both in touchfiles (periodic tier). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * v1.58.0.0 feat: Conductor renders AskUserQuestion decisions as prose Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: strip ambient Conductor env in memory-cache-injection hook harness Same dev-in-Conductor leak fixed for question-preference-hook: this suite's runHook copies process.env, so running it inside Conductor flipped the defer-path memoryContext assertions into the [conductor] prose deny. Strip CONDUCTOR_* so the cases characterize non-Conductor behavior. (CI is headless, so this only bit local Conductor runs.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: gstack-detach — run agent eval/bench jobs in their own session Long agent-run jobs (30-60 min evals, benchmarks) die when the harness sends SIGTERM to a background task's process group on turn boundaries / monitor stops / interruptions (observed: 'script test:gate terminated by signal SIGTERM'). gstack-detach runs the command in a fresh session (python3 os.setsid, or setsid on Linux, nohup fallback) so a group SIGTERM can't reach it, and wraps it in caffeinate -i on macOS so idle-sleep can't kill it either. Returns immediately; caller polls the logfile. Secrets stay in env, never argv. The guard test pins the contract: the command runs in a different process group than the caller and outlives the launching shell. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: eval:bg* scripts — detached eval runs for agents Agent-facing convenience scripts that launch the eval suites through gstack-detach so a harness SIGTERM can't kill a long run. eval:bg (diff-based), eval:bg:all, eval:bg:gate, eval:bg:periodic — each returns immediately and streams to /tmp/gstack-evals.log for polling. The plain test:evals / test:e2e scripts stay foreground for humans. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: CLAUDE.md — agents must run long evals via gstack-detach Codifies the detached-execution default: agent-launched eval/benchmark runs go through bin/gstack-detach (or the eval:bg* scripts) so a harness SIGTERM or macOS idle-sleep can't kill a 30-60 min run, then poll the log with a death-aware watcher. Humans keep foreground scripts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: harden gstack-detach against all four eval-infra killers The basic bash detach fixed SIGTERM but a real run on a shared dev box hit three more killers: cross-worktree API saturation (15-way concurrency x a sibling worktree mass-timed-out the suite), a silent hang (periodic bun died with no exit marker), and shared-/tmp log contamination (a concurrent worktree's agent output bled into the log). Rewrite as a portable python3 tool that bakes in all four fixes: - fork + setsid: SIGTERM-proof (own session, survives harness polite-quit) - caffeinate -i on macOS: no idle-sleep death - --lock NAME (fcntl, machine-wide): concurrent worktrees SERIALIZE instead of saturating the shared model API - run-scoped default log (~/.gstack-dev/eval-runs/<label>-<slug>-<branch>-<ts>-<pid>): no cross-worktree collision/contamination - --timeout watchdog + a guaranteed '### gstack-detach EXIT=<code> ###' sentinel on every terminal path: no silent hang, finished-vs-died always detectable Guard test pins all four: detached pgid differs + outlives launcher, run-scoped log path, watchdog EXIT=timeout, and lock serialization (second run WAITS). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: eval:bg* use run-scoped logs + machine lock + watchdog Drop the shared /tmp/gstack-evals.log path (the cross-worktree collision that contaminated a live run) for gstack-detach's run-scoped default, and add the machine-wide gstack-evals lock (concurrent worktrees serialize, no API saturation) plus per-tier watchdog timeouts (60/90/120 min). Each eval:bg* prints its run-scoped log path to poll. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags - /ship eval step (sections/tests.md): long eval suites launch via gstack-detach (own session, machine lock, EXIT sentinel) so a turn boundary can't kill a 30+ min run mid-ship — the exact failure observed during this branch's ship. - CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize worktrees, no API saturation), --timeout watchdog, run-scoped log, and the guaranteed EXIT sentinel the poller breaks on. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor: extract pure promotedEnv() from conductor-env-shim Single source of truth for GSTACK_* key promotion semantics. The ambient promoteConductorEnv() becomes a wrapper; behavior-preserving. Needed by the hermetic env builder which must not mutate process.env. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: hermetic child-env builder for E2E runners Allowlist scrub (basics/network/named-auth kept; CONDUCTOR_*, CLAUDE_*, GSTACK_*, MCP_*, GBRAIN_*, operator credentials dropped), per-runner extraAllow, overrides merge last, EVALS_HERMETIC=0 byte-identical escape hatch read at call time (ESM-hoist safe). Sync memoized singleton temp dirs (<runRoot>/.claude keeps the extractPlanFilePath contract), seeded .claude.json for non-interactive first run, pid-aware GC of crashed runs. 19 free unit tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: session-runner spawns hermetic children + isolation canaries claude -p children now get the allowlist-scrubbed env and a gated --strict-mcp-config (EVALS_HERMETIC=0 restores operator env AND args). Two gate-tier canaries make the clean room falsifiable: hermetic-canary asserts env redirect + scrub + zero MCP servers + nonzero API-key cost from the Bash tool_result (never model prose); hermetic-sentinel plants a poisoned operator config (user CLAUDE.md + MCP server) and proves the child cannot see it. Empirically verified on claude 2.1.175: print mode needs no seed config (the seed serves the PTY path); the child CLI sets CLAUDECODE for its own tools, so that scrub is pinned in unit tests, not E2E. hermetic-env.ts joins GLOBAL_TOUCHFILES. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: PTY runner spawns hermetic claude sessions launchClaudePty children get the allowlist-scrubbed env, a gated --strict-mcp-config, and the session exposes hermeticConfigDir for forensics (hermetic plan files live under <dir>/plans/ and still match extractPlanFilePath via the /.claude dir-name contract). Seeded trust state covers repo-cwd sessions; the 15s trust-watcher stays as fallback. Verified foreground via the plan-mode-no-op gate test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: codex/gemini runners spawn hermetic children Same allowlist scrub as the claude runners, with each provider's auth surface re-admitted via extraAllow (codex: OPENAI_API_KEY/CODEX_* plus its tempHome .codex copy; gemini: GEMINI_*/GOOGLE_* with real HOME for ~/.gemini auth). The gemini spawn previously inherited the full operator env with no env property at all. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: agent-sdk-runner spawns hermetic children via complete Options.env The historical 'env: breaks SDK auth' failure was partial-env replacement: Options.env replaces the child's entire environment, so objects lacking ANTHROPIC_API_KEY killed auth. Passing the complete hermetic env (key + PATH + redirected CLAUDE_CONFIG_DIR/GSTACK_HOME) works — validated live via query() with a Bash tool call (success, real cost, Conductor vars scrubbed). Per-test opts.env merges last; ambient key mutation still works because the builder reads process.env at call time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: static tripwire pins hermetic wiring in all five runners Free-tier invariants: every runner builds child env via hermeticChildEnv, no raw ...process.env spread at any spawn site, --strict-mcp-config gated on isHermeticEnabled in both claude runners, and no test callsite passes the operator env into a runner's override parameter (scoped to runner calls — unit tests spawning gstack bin scripts directly are exempt). Mirrors the terminal-agent-pid-identity / server-embedder-terminal-port tripwire idiom. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: refresh codex/factory ship goldens with detached-eval block a38089a added the gstack-detach guidance to the ship template and updated the claude golden; the codex and factory goldens missed the same 16-line block. Regenerated via bun run gen:skill-docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: hermetic local E2E is the default; retire stale SDK env warning CLAUDE.md now documents the hermetic clean room (allowlist scrub, fresh seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config), EVALS_HERMETIC=0 as the debug escape hatch, and replaces the 'never pass env: to runAgentSdkTest' rule with the verified mechanism (partial-env replacement was the failure; complete env is safe). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: operational-learning fixture copies lib/jsonl-store.ts with the bin gstack-learnings-log imports $SCRIPT_DIR/../lib/jsonl-store.ts (hasInjection, v1.57.5.0) — copying only the bin scripts into the temp fixture broke the script with exit 1 since then. Latent because diff-based selection rarely runs this test; surfaced when hermetic-env.ts joined GLOBAL_TOUCHFILES and selected everything. Reproduced outside the hermetic env to confirm blame. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: ios-qa daemon scenarios use unique pidfiles under --concurrent All scenarios shared join(workDir, 'daemon.pid') through a module-scope workDir binding that beforeEach reassigns mid-flight under bun --concurrent. First daemon claims; siblings get already_running against the test process's own always-alive pid and fail in milliseconds — the failure mode seen at 15-way gate concurrency. Per-claim unique pidfiles keep the single-instance semantics under test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: workflow judge re-appends body-carved sections after the marker slice runWorkflowJudge appended sections/*.md before slicing startMarker..endMarker. That handles skills that moved their MARKERS into sections (plan-eng, plan-design) but not document-release, which keeps its markers in the skeleton and carved the workflow BODY (Steps 2-9 -> sections/release-body.md) AFTER the endMarker — so the slice dropped it and the judge scored completeness 2 ('Steps 2-9 are in an external file'). Now any carved section the marker window excluded is re-appended, so the judge sees the full workflow the agent executes. document-release: completeness 2->5, clarity 3->4. ship/plan-ceo/plan-eng/plan-design judges unchanged (their section content is already inside the slice, so the head-dedup skips re-append). Pre-existing since the v1.57.0.0 carve (garrytan#1907); surfaced now because hermetic-env.ts is a global touchfile that selects every llm-judge test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * harden: hermetic temp-dir GC grace window + half-seed cleanup Codex adversarial review (ship) flagged two temp-dir lifecycle edges: - GC deleted any dead-pid dir; PID reuse could delete a freshly-created dir whose original pid exited and was recycled to a live process. Now requires BOTH a dead pid AND mtime older than a 1h floor. - A seed-write failure after mkdir left an unseeded dir named with our live pid that this process's GC skips, leaking until exit. Now the partial dir is torn down before the (still loud) rethrow. Two findings left as-is by design: HOME stays allowlisted (CLAUDE_CONFIG_DIR wins for claude; codex/gemini need ~/.codex|~/.gemini auth; FS sandbox is TODOS.md:454 scope; the hermetic-sentinel canary proves config isolation), and PTY extraArgs --mcp-config is a deliberate caller opt-in like env overrides. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: document hermetic-by-default E2E + eval:bg detached runs in CONTRIBUTING The Testing & evals section now tells contributors that local E2E runners spawn children through a sealed clean room (allowlist-scrubbed env, seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config) so local signal matches CI, with EVALS_HERMETIC=0 as the escape hatch. The eval-tools list gains the eval:bg* detached-run scripts (gstack-detach: SIGTERM-proof, caffeinate-wrapped, machine-locked, run-scoped logs, EXIT= sentinel). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: sync package.json to 1.58.1.0 The merge took main's package.json (1.58.0.0); gstack-version-bump repair fixed the working tree but the change was left uncommitted. Without this the committed tree disagrees with VERSION and CI's version-match test fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: regenerate diagram SKILL.md with Conductor prose preamble The diagram skill (new from main) was missing the Conductor-session prose AskUserQuestion blocks that gen-skill-docs propagates to every SKILL.md. Pure generated output; reproduced by bun run gen:skill-docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent 14fc086 commit c7ae632

89 files changed

Lines changed: 2747 additions & 221 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CHANGELOG.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,107 @@
11
# Changelog
22

3+
## [1.58.1.0] - 2026-06-14
4+
5+
## **Local evals stop lying. Spawned `claude` test children run in a sealed clean room,**
6+
## **and in Conductor every decision is a plain-text brief you answer with a letter.**
7+
8+
Two things shipped here. First, the local E2E harness is now hermetic by default:
9+
every spawned agent (claude -p, the real-PTY plan-mode runner, the Agent SDK
10+
runner, plus the codex and gemini runners) gets an allowlist-scrubbed environment,
11+
a fresh seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`.
12+
Before this, a dev machine leaked the operator's `~/.claude` config, MCP servers
13+
(gbrain, Conductor), skills, `~/.gstack` decision logs, and `CONDUCTOR_*`/`CLAUDECODE`
14+
env into every child, so local eval results disagreed with CI for reasons that had
15+
nothing to do with the code under test. Now local signal matches CI. Set
16+
`EVALS_HERMETIC=0` to debug against real operator state.
17+
18+
Second, in a Conductor session gstack no longer fights Conductor's flaky
19+
AskUserQuestion tool. It detects the session and renders every decision as a prose
20+
brief, a labeled question with a recommendation, per-option completeness scores, and
21+
"reply with a letter," enforced by a PreToolUse hook that denies the tool and
22+
redirects to prose. Destructive confirmations demand an explicit typed answer.
23+
24+
Agents that launch long eval runs get `gstack-detach`: a SIGTERM-proof, idle-sleep-proof
25+
wrapper (fresh session + `caffeinate`) with a machine-wide lock so concurrent
26+
worktrees serialize instead of saturating the model API, run-scoped logs, and a
27+
guaranteed `EXIT=` sentinel so a poller never mistakes silence for success.
28+
29+
### The numbers that matter
30+
31+
Measured against the gate eval suite on a contaminated dev box (gbrain MCP up, live
32+
Conductor session, sibling worktrees). Reproduce: `bun test` (free unit + wiring
33+
tripwire) and `EVALS=1 EVALS_TIER=gate bun test test/skill-e2e-hermetic-canary.test.ts`.
34+
35+
| Metric | Before | After | Δ |
36+
|--------|--------|-------|---|
37+
| Spawned-child env | full operator `process.env` | allowlist-scrubbed | sealed |
38+
| Runners hermeticized | 0 of 5 | 5 of 5 | +5 |
39+
| Operator MCP servers visible to child | all (gbrain, Conductor) | 0 (`--strict-mcp-config`) | isolated |
40+
| Config isolation proof | none | poisoned-operator sentinel canary | falsifiable |
41+
| Long eval runs surviving a turn-boundary SIGTERM | no | yes (`gstack-detach`) | survives |
42+
43+
The clean room is falsifiable, not asserted: a `hermetic-sentinel` gate canary
44+
plants a poisoned operator config (a user `CLAUDE.md` + an MCP server) and fails if
45+
the child can see any of it, and a free static tripwire fails CI if any runner
46+
reverts to a raw `process.env` spread.
47+
48+
### What this means for contributors
49+
50+
Run evals locally and trust the result. You no longer have to push to CI to find
51+
out whether a failure was real or just your machine bleeding context into the agent.
52+
Three latent bugs the old harness hid surfaced the moment the suite ran clean and
53+
are fixed: a coverage-judge that scored carved skills against half a document, an
54+
ios-qa daemon test that collided on a shared pidfile under concurrency, and an
55+
operational-learning fixture missing a lib it imports. Start a run with
56+
`bun run eval:bg:gate`; flip `EVALS_HERMETIC=0` only when you deliberately want your
57+
real `~/.claude` in the loop.
58+
59+
### Itemized changes
60+
61+
#### Added
62+
- **Hermetic E2E environment** (`test/helpers/hermetic-env.ts`): allowlist env
63+
builder (process basics, network/proxy vars, named `ANTHROPIC_*` auth, per-runner
64+
`extraAllow`), pure `promotedEnv()` shared with `lib/conductor-env-shim.ts`, a
65+
sync-memoized singleton temp dir (`<runRoot>/.claude` keeps the plan-file path
66+
contract), a seeded `.claude.json` for non-interactive first run, and pid-aware GC
67+
of crashed runs. Default-on; `EVALS_HERMETIC=0` restores the legacy env AND drops
68+
`--strict-mcp-config`.
69+
- **Two gate-tier isolation canaries** (`test/skill-e2e-hermetic-canary.test.ts`):
70+
`hermetic-canary` asserts env redirect + scrub + zero MCP servers + nonzero
71+
API-key cost from the Bash tool_result (not model prose); `hermetic-sentinel`
72+
proves the child cannot see a planted poisoned operator config.
73+
- **Static wiring tripwire** (`test/hermetic-wiring.test.ts`): free-tier invariants
74+
that fail CI if any of the five runners drops `hermeticChildEnv()`, the gated
75+
`--strict-mcp-config`, or leaks `process.env` through a callsite override.
76+
- **`gstack-detach`** + `eval:bg` / `eval:bg:all` / `eval:bg:gate` / `eval:bg:periodic`
77+
scripts: detached, SIGTERM-proof, `caffeinate`-wrapped eval runs with a machine-wide
78+
lock, per-run logs under `~/.gstack-dev/eval-runs/`, a watchdog, and an `EXIT=`
79+
sentinel.
80+
- **Conductor prose AskUserQuestion**: when a Conductor session is detected, every
81+
decision renders as a prose brief (labeled question, recommendation, per-option
82+
completeness, reply-with-a-letter), enforced by a PreToolUse hook that denies the
83+
tool and redirects. Auto-decide preferences still apply first; destructive
84+
confirmations require an explicit typed answer. Installed for Conductor even in
85+
non-interactive setup, with an upgrade migration for existing installs.
86+
87+
#### Changed
88+
- All five E2E runners (`session-runner`, `claude-pty-runner`, `agent-sdk-runner`,
89+
`codex-session-runner`, `gemini-session-runner`) spawn children through
90+
`hermeticChildEnv()`. The Agent SDK runner now receives a COMPLETE hermetic env
91+
via `Options.env` (the old "never pass env: to the SDK" rule was partial-env
92+
replacement; a complete env is safe).
93+
- `hermetic-env.ts` is a global touchfile, so any change to it selects every E2E +
94+
judge test.
95+
- CLAUDE.md documents hermetic-by-default local evals and retires the stale SDK env
96+
warning.
97+
98+
#### Fixed
99+
- The workflow LLM-judge now re-appends body-carved `sections/*.md` after the marker
100+
slice, so carved skills (document-release) are judged on the full workflow the
101+
agent executes instead of a half-document.
102+
- ios-qa daemon scenarios use unique pidfiles, fixing `already_running` collisions
103+
under `bun test --concurrent`.
104+
3105
## [1.58.0.0] - 2026-06-12
4106

5107
## **Your documents grow diagrams. Mermaid and excalidraw fences render as real pictures,**

CLAUDE.md

Lines changed: 48 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,26 @@ use Codex's own auth from `~/.codex/` config — no `OPENAI_API_KEY` env var nee
3131
`lib/conductor-env-shim.ts`) promotes `GSTACK_ANTHROPIC_API_KEY` /
3232
`GSTACK_OPENAI_API_KEY` to their canonical names inside gstack's TS binaries.
3333
Tests run through gstack entrypoints inherit this promotion automatically.
34-
Don't echo the key value to stdout, logs, or shell history. When passing to a
35-
test's Agent SDK, do NOT pass `env: {...}` to `runAgentSdkTest` — the SDK's
36-
auth pipeline doesn't pick up the key the same way when env is supplied as an
37-
object (confirmed failure mode). Mutate `process.env.ANTHROPIC_API_KEY`
38-
ambiently before the call and restore in `finally`.
34+
Don't echo the key value to stdout, logs, or shell history. The historical
35+
"never pass `env:` to `runAgentSdkTest`" rule is retired: the failure was
36+
partial-env replacement (the SDK's `Options.env` REPLACES the child's entire
37+
environment, so an object without the key broke auth). The runner now always
38+
passes a COMPLETE hermetic env with per-test `env:` merged last, so per-test
39+
overrides are safe; ambient `process.env.ANTHROPIC_API_KEY` mutation also
40+
still works (the env builder reads process.env at call time).
41+
42+
**Hermetic local E2E (default).** Every E2E runner (claude -p, PTY, Agent
43+
SDK, codex, gemini) spawns children through `test/helpers/hermetic-env.ts`:
44+
allowlist-scrubbed env (operator `CONDUCTOR_*`, `CLAUDE_*`, `GSTACK_*`,
45+
`MCP_*`, `GBRAIN_*`, and credentials like `GH_TOKEN` never reach children),
46+
a fresh seeded `CLAUDE_CONFIG_DIR` (no operator `~/.claude` CLAUDE.md /
47+
MCP servers / skills), a temp `GSTACK_HOME`, and `--strict-mcp-config`.
48+
Local eval signal matches CI. Debug against real operator state with
49+
`EVALS_HERMETIC=0` (restores the legacy env AND drops the strict-MCP flag).
50+
Per-test `env:` overrides merge last, so deliberate contamination
51+
(`CONDUCTOR_WORKSPACE_PATH`, per-test `GSTACK_HOME`) keeps working. Wiring
52+
is pinned by `test/hermetic-wiring.test.ts` (static tripwire) and two
53+
gate-tier canaries in `test/skill-e2e-hermetic-canary.test.ts`.
3954

4055
E2E tests stream progress in real-time (tool-by-tool via `--output-format stream-json
4156
--verbose`). Results are persisted to `~/.gstack-dev/evals/` with auto-comparison
@@ -828,6 +843,34 @@ them. Report progress at each check (which tests passed, which are running, any
828843
failures so far). The user wants to see the run complete, not a promise that
829844
you'll check later.
830845

846+
## Running evals as an agent: always detach (SIGTERM-proof)
847+
848+
When **you (an agent/harness)** launch a long eval/benchmark run, run it through
849+
`bin/gstack-detach` — NEVER as a plain backgrounded Bash task. A plain background
850+
task lives in the harness's process group, so a SIGTERM ("polite quit") on a turn
851+
boundary, a stopped Monitor, or an interruption kills the run mid-flight (observed:
852+
`script "test:gate" was terminated by signal SIGTERM` ~40 min into a run). On macOS
853+
the run can also die to idle-sleep. `gstack-detach` fixes both: a fresh session
854+
(escapes the group SIGTERM) wrapped in `caffeinate -i` (blocks idle-sleep).
855+
856+
- Use the `eval:bg*` scripts (`eval:bg`, `eval:bg:all`, `eval:bg:gate`,
857+
`eval:bg:periodic`) — they wrap the eval command in `gstack-detach` with the
858+
machine-wide `gstack-evals` lock (concurrent worktrees serialize instead of
859+
saturating the shared model API), a per-tier watchdog, and a **run-scoped** log
860+
under `~/.gstack-dev/eval-runs/` (no shared-`/tmp` collision). Each prints its
861+
log path. Or call `gstack-detach [--lock NAME] [--timeout SECS] [--label LBL] --
862+
<cmd>` directly for any long agent job. Export `ANTHROPIC_API_KEY` first (never
863+
pass keys in argv).
864+
- Then **poll the printed logfile** with a death-aware watcher: break on the
865+
guaranteed `### gstack-detach EXIT=<code> ###` sentinel (success AND failure are
866+
both marked, so silence is never mistaken for success). The detached run survives
867+
even if your watcher gets reaped, so re-checking the log always works.
868+
- Why the lock: a shared dev box with several Conductor worktrees will rate-limit
869+
the model API if two eval suites run at once (15-way concurrency each), which
870+
mass-times-out E2E tests. The lock makes the second run WAIT, not collide.
871+
- Humans running `bun run test:evals` foreground in their own terminal don't need
872+
this — Ctrl-C is intended there. Detachment is for agent-launched runs only.
873+
831874
## E2E test fixtures: extract, don't copy
832875

833876
**NEVER copy a full SKILL.md file into an E2E test fixture.** SKILL.md files are

CONTRIBUTING.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,18 @@ EVALS=1 bun test test/skill-e2e-*.test.ts
176176
- Saves full NDJSON transcripts and failure JSON for debugging
177177
- Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts`
178178

179+
**Hermetic by default.** Every E2E runner (claude -p, the real-PTY plan-mode
180+
runner, the Agent SDK runner, plus the codex and gemini runners) spawns its child
181+
through `test/helpers/hermetic-env.ts`: an allowlist-scrubbed environment, a fresh
182+
seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. Your
183+
operator `~/.claude` config, MCP servers (gbrain, Conductor), skills, `~/.gstack`
184+
decision logs, and `CONDUCTOR_*` env never leak into the child, so local eval
185+
signal matches CI instead of disagreeing for reasons unrelated to the code under
186+
test. Set `EVALS_HERMETIC=0` to debug against your real operator state (this also
187+
drops `--strict-mcp-config`). The wiring is pinned by `test/hermetic-wiring.test.ts`
188+
(a free static tripwire) and two gate-tier isolation canaries in
189+
`test/skill-e2e-hermetic-canary.test.ts`.
190+
179191
### E2E observability
180192

181193
When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`:
@@ -198,6 +210,25 @@ bun run eval:compare # compare two runs — shows per-test deltas + Take
198210
bun run eval:summary # aggregate stats + per-test efficiency averages across runs
199211
```
200212

213+
**Detached runs for agents and long suites.** When an agent (or you, for a run
214+
you don't want to babysit) launches a long eval, use the `eval:bg*` scripts. They
215+
wrap the eval command in `bin/gstack-detach`: a fresh session that escapes a
216+
turn-boundary SIGTERM, a `caffeinate` wrapper that blocks idle-sleep, a machine-wide
217+
`gstack-evals` lock so concurrent worktrees serialize instead of saturating the
218+
model API, a run-scoped log under `~/.gstack-dev/eval-runs/`, a per-tier watchdog,
219+
and a guaranteed `### gstack-detach EXIT=<code> ###` sentinel so a poller never
220+
mistakes silence for success.
221+
222+
```bash
223+
bun run eval:bg # detached test:evals (diff-based)
224+
bun run eval:bg:all # detached test:evals:all
225+
bun run eval:bg:gate # detached gate-tier suite
226+
bun run eval:bg:periodic # detached periodic-tier suite
227+
```
228+
229+
Each prints its log path. Humans running `bun run test:evals` foreground in their
230+
own terminal don't need this — Ctrl-C is intended there.
231+
201232
**Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`.
202233

203234
Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis.

SKILL.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,13 @@ echo "REPO_MODE: $REPO_MODE"
4848
_SESSION_KIND=$(~/.claude/skills/gstack/bin/gstack-session-kind 2>/dev/null || echo "interactive")
4949
case "$_SESSION_KIND" in spawned|headless|interactive) ;; *) _SESSION_KIND="interactive" ;; esac
5050
echo "SESSION_KIND: $_SESSION_KIND"
51+
# Conductor host: AskUserQuestion is unreliable here (native disabled, MCP
52+
# variant flaky), so skills render decisions as prose instead of calling the
53+
# tool. Gated on !headless so an eval/CI run INSIDE Conductor (GSTACK_HEADLESS)
54+
# still BLOCKs rather than rendering prose to nobody.
55+
if [ "$_SESSION_KIND" != "headless" ] && { [ -n "${CONDUCTOR_WORKSPACE_PATH:-}" ] || [ -n "${CONDUCTOR_PORT:-}" ]; }; then
56+
echo "CONDUCTOR_SESSION: true"
57+
fi
5158
_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no")
5259
echo "LAKE_INTRO: $_LAKE_SEEN"
5360
_TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true)

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.58.0.0
1+
1.58.1.0

0 commit comments

Comments
 (0)