Skip to content

Commit 14c2c48

Browse files
harshitsinghbhandariclaudegithub-actions[bot]
authored
fix(desktop): recover login-shell env so the daemon finds zellij/credentials (#389)
* fix(desktop): recover login-shell env so the daemon finds zellij/credentials A Finder/Dock launch starts the app via launchd, not a login shell, so ~/.zprofile and ~/.zshrc are never sourced. The daemon then inherits launchd's minimal env (no /opt/homebrew/bin on PATH, no exported ANTHROPIC_API_KEY, etc.), cannot exec zellij/git, and its agents cannot authenticate. Launching from a terminal masked this because the shell had already populated the env, so it only reproduced on a real Finder/Dock launch. Resolve the login-shell environment once at startup ($SHELL -ilc "printf sentinel; env -0"), adopt it as the base for daemonEnv(), and force PATH from the shell with a static floor when the probe fails (timeout/non-zero exit). The probe never blocks startup (3s SIGKILL timeout, stdin closed) and degrades to the floor rather than erroring. Windows keeps its existing behavior: its env comes from the registry/session and is inherited by GUI-launched apps, so this bug does not exist there. Pure parse/merge logic lives in shared/shell-env.ts (no node:* imports, per the daemon-attach.ts convention) with unit tests; main.ts owns the real spawn. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore: format with prettier [skip ci] --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 82d69ce commit 14c2c48

4 files changed

Lines changed: 500 additions & 2 deletions

File tree

docs/daemon-environment.md

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Daemon environment: the GUI-launch PATH/credentials problem
2+
3+
Status: proposed
4+
Scope: desktop (Electron) launch of the AO daemon on macOS (and any GUI-launched
5+
desktop platform)
6+
7+
## Summary
8+
9+
When the desktop app is launched from Finder/Dock/Spotlight, the daemon it spawns
10+
inherits a stunted environment (minimal `PATH`, no shell-exported credentials).
11+
The daemon then cannot find `zellij`/`git`/the agent CLIs, and the agents it
12+
launches cannot see API keys. The same app launched from a terminal works,
13+
because a terminal-started process inherits the shell's fully-populated
14+
environment. The fix is to resolve the user's login-shell environment once at
15+
startup and use it as the base for the daemon's environment.
16+
17+
## Problem statement
18+
19+
The Electron supervisor spawns the Go daemon with the environment it forwards in
20+
`daemonEnv()` (`frontend/src/main.ts`), which is essentially `...process.env`
21+
plus AO's telemetry defaults. The daemon, in turn, is the parent of every agent
22+
session (it execs `zellij`, which runs `claude`/`codex`, etc.), and the agent's
23+
`PATH` is derived from the daemon's own `PATH`
24+
(`runtimeEnv` -> `HookPATH(m.executable, os.Getenv, ...)` in
25+
`backend/internal/session_manager/manager.go`).
26+
27+
So whatever environment the daemon receives propagates to the entire stack:
28+
29+
```
30+
launchd (or terminal) -> Electron main -> daemon -> zellij -> agent (claude/codex)
31+
```
32+
33+
When that environment is impoverished, everything downstream breaks.
34+
35+
### Observed symptoms
36+
37+
All of these were traced to the same root cause:
38+
39+
- Terminal pane stuck on "Terminal disconnected - reattaching...".
40+
- Terminal pane showing "Terminal ended ... but the session is not marked
41+
terminated yet."
42+
- Sessions stuck `idle` + `is_terminated = 0` in the store, never reaped, and
43+
therefore not restorable (`Restore` requires `IsTerminated`, otherwise
44+
`ErrNotRestorable`).
45+
- `zellij list-sessions` showing sessions as alive-but-unreachable or dead,
46+
depending on which socket universe was inspected.
47+
48+
The unifying cause: the running, GUI-launched daemon cannot execute
49+
`/opt/homebrew/bin/zellij` (and friends), so its liveness probes error
50+
(`ProbeFailed`, never `ProbeDead`, so the reaper never terminates the row) and
51+
its terminal attaches cannot spawn `zellij attach`.
52+
53+
## Root cause: GUI apps do not inherit the shell environment
54+
55+
On macOS, a process's environment is inherited solely from its parent. The
56+
parent differs by launch method:
57+
58+
- **Terminal launch.** The terminal starts a login/interactive shell
59+
(`zsh -l`). That shell sources `/etc/zprofile`, `~/.zprofile`, `~/.zshrc`,
60+
etc. Those files are the only thing that sets the rich environment:
61+
`eval "$(/opt/homebrew/bin/brew shellenv)"` adds `/opt/homebrew/bin` to
62+
`PATH`; `export ANTHROPIC_API_KEY=...` exports credentials. Every process
63+
started from that terminal inherits the result. The app works.
64+
65+
- **Finder/Dock/Spotlight launch.** The app is started by **launchd**, not by a
66+
shell. launchd hands the process a fixed, minimal environment
67+
(`PATH=/usr/bin:/bin:/usr/sbin:/sbin`, `HOME`, `USER`, `TMPDIR`, little else).
68+
No shell runs anywhere in the chain, so no rc/profile file is ever sourced.
69+
The homebrew `PATH` and the exported credentials simply do not exist for the
70+
app, and `daemonEnv()` faithfully forwards that minimal env down to the daemon.
71+
72+
This is deliberate on Apple's part: GUI apps are decoupled from interactive shell
73+
configuration on purpose (it can be slow, interactive, or machine-specific). The
74+
old `~/.MacOSX/environment.plist` escape hatch was removed years ago. This is the
75+
single most common macOS-Electron footgun; it is why packages like `fix-path` and
76+
`shell-env` exist.
77+
78+
### Why "just forward env" is correct in principle
79+
80+
Forwarding the environment is not the bug. The daemon and agents genuinely need:
81+
82+
- `PATH` to resolve `zellij`, `git`, `node`, and the agent CLIs;
83+
- `HOME` for config/credentials (`~/.gitconfig`, `~/.claude`, `~/.codex`, ssh
84+
keys);
85+
- shell-exported credentials (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GH_TOKEN`,
86+
...);
87+
- locale/proxy (`LANG`, `LC_*`, `HTTPS_PROXY`);
88+
- AO's own vars (telemetry, `AO_DATA_DIR`, `AO_RUN_FILE`, session ids).
89+
90+
The bug is the _source_ of what we forward: under a GUI launch, `process.env` is
91+
launchd's minimal env, not the shell's. The fix is to forward a _good_ base env,
92+
not to stop forwarding.
93+
94+
## Proposed solution: resolve the login-shell environment
95+
96+
Do not reconstruct the shell environment by hand. Run the user's login shell
97+
once, ask it to print its environment, and adopt that as the base for
98+
`daemonEnv()`.
99+
100+
### The mechanism
101+
102+
```
103+
zsh -ilc 'env -0'
104+
```
105+
106+
- `-l` (login): source `/etc/zprofile` and `~/.zprofile` (where the homebrew
107+
`PATH` line typically lives).
108+
- `-i` (interactive): source `~/.zshrc` (where most `export` lines live).
109+
- `-c 'env -0'`: run one command and exit. `env` dumps the environment the shell
110+
built after sourcing all config; `-0` separates entries with NUL bytes instead
111+
of newlines, so values containing newlines parse unambiguously.
112+
113+
The output is a faithful snapshot of "what a terminal would see." Parse it back
114+
into key/value pairs and merge it under the existing forwarded env so explicit
115+
overrides still win:
116+
117+
```
118+
finalEnv = { ...shellEnv, ...process.env, AO_*: defaults }
119+
```
120+
121+
### Worked example
122+
123+
GUI-launched daemon env (before):
124+
125+
```
126+
PATH=/usr/bin:/bin:/usr/sbin:/sbin
127+
HOME=/Users/<user>
128+
```
129+
130+
After `zsh -ilc 'env -0'` resolution:
131+
132+
```
133+
PATH=/opt/homebrew/bin:/opt/homebrew/sbin:/usr/bin:/bin:/usr/sbin:/sbin
134+
HOME=/Users/<user>
135+
ANTHROPIC_API_KEY=sk-ant-...
136+
GH_TOKEN=ghp_...
137+
LANG=en_US.UTF-8
138+
```
139+
140+
The daemon can now resolve `/opt/homebrew/bin/zellij`, and agents inherit the
141+
credentials.
142+
143+
### Implementation details
144+
145+
Place the resolution in Electron's `daemonEnv()` (`frontend/src/main.ts`), the
146+
parent that hands env to the daemon.
147+
148+
- **Resolve once, cache.** Sourcing rc files can take 100ms to >1s
149+
(nvm/pyenv/...). Do it a single time at startup; never per-session.
150+
- **Pick the shell robustly.** Prefer `process.env.SHELL`; under launchd it may
151+
be absent, so fall back to the user record
152+
(`dscl . -read /Users/$USER UserShell`), then `/bin/zsh`. Do not hardcode zsh;
153+
honor bash/fish.
154+
- **Isolate the payload.** Interactive shells can print banners/motd/prompts to
155+
stdout. Bracket the real output with a sentinel and read only after it:
156+
`zsh -ilc 'echo __AO_ENV_START__; env -0'`.
157+
- **No stdin, with a timeout.** Run with `</dev/null` and a ~2-3s timeout so a
158+
misconfigured rc that waits for input cannot hang startup.
159+
- **Fallback on any failure.** If the probe fails, times out, or exits nonzero,
160+
fall back to a static base: prepend
161+
`/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin` and pull
162+
through known credential vars. A weird shell config then degrades to "zellij
163+
and git resolve" rather than "broken."
164+
165+
### Platform scope
166+
167+
- macOS: required (this is where the GUI/launchd split bites).
168+
- Linux: the same class of problem exists for `.desktop`-launched apps; the same
169+
resolution applies.
170+
- Windows: not applicable in the same form; a static `PATH` floor is sufficient.
171+
172+
This matches what `shell-env`/`fix-path` do; the logic above is the entirety of
173+
it. We shell out once to the user's own shell and adopt its result.
174+
175+
## Testing
176+
177+
- Parser unit test: feed NUL-separated output, including a value containing a
178+
newline and leading banner noise before the sentinel; assert the resulting map
179+
is correct and the noise is dropped.
180+
- Fallback test: simulate probe failure/timeout; assert the static PATH floor and
181+
credential pass-through are applied.
182+
- Manual: launch the packaged app from Finder (not a terminal) and confirm a new
183+
session spawns, the terminal attaches, and `zellij`/`git`/agent binaries
184+
resolve.
185+
186+
## Relevant code
187+
188+
- `frontend/src/main.ts` - `daemonEnv()` (env forwarded to the daemon), daemon
189+
spawn.
190+
- `backend/internal/session_manager/manager.go` - `runtimeEnv` / `HookPATH`
191+
(agent `PATH` derived from the daemon's `PATH`); `spawnEnv`.
192+
- `backend/internal/adapters/runtime/zellij/zellij.go` - `defaultBinary()`
193+
(`exec.LookPath("zellij")` against the daemon's `PATH`).
194+
- `backend/internal/observe/reaper/reaper.go`,
195+
`backend/internal/lifecycle/runtime.go` - liveness -> termination
196+
(`ProbeFailed` never terminates, so a daemon that cannot run `zellij` strands
197+
sessions).

frontend/src/main.ts

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ import {
2828
resolveDaemonFromPort,
2929
resolveDaemonFromRunFile,
3030
} from "./shared/daemon-attach";
31+
import { buildDaemonEnv, resolveShellEnv, type ShellRunner } from "./shared/shell-env";
3132
import { DEFAULT_POSTHOG_HOST, DEFAULT_POSTHOG_PROJECT_KEY } from "./shared/posthog-config";
3233
import { buildTelemetryBootstrap } from "./shared/telemetry";
3334
import { createBrowserViewHost, type BrowserViewHost } from "./main/browser-view-host";
@@ -220,16 +221,90 @@ function runFilePath(): string | null {
220221
return defaultRunFilePath(process.platform, process.env, os.homedir());
221222
}
222223

223-
function daemonEnv(): NodeJS.ProcessEnv {
224+
// How long to wait for the login shell to print its env before giving up. A
225+
// misconfigured rc that blocks (or a slow nvm/pyenv chain) must not hang startup;
226+
// the daemon then falls back to the static PATH floor.
227+
const SHELL_ENV_TIMEOUT_MS = 3_000;
228+
229+
// The login-shell env resolved once at startup (see docs/daemon-environment.md),
230+
// or null when the probe failed/timed out. Read synchronously by daemonEnv().
231+
let cachedShellEnv: Record<string, string> | null = null;
232+
// Memoize the in-flight resolution so concurrent/repeat awaits are cheap.
233+
let shellEnvPromise: Promise<void> | null = null;
234+
235+
// Telemetry defaults stamped on the daemon env on every platform; explicit env
236+
// always wins.
237+
function telemetryOverrides(): Record<string, string> {
224238
return {
225-
...process.env,
226239
AO_TELEMETRY_EVENTS: process.env.AO_TELEMETRY_EVENTS ?? "on",
227240
AO_TELEMETRY_REMOTE: process.env.AO_TELEMETRY_REMOTE ?? "posthog",
228241
AO_TELEMETRY_POSTHOG_KEY: process.env.AO_TELEMETRY_POSTHOG_KEY ?? DEFAULT_POSTHOG_PROJECT_KEY,
229242
AO_TELEMETRY_POSTHOG_HOST: process.env.AO_TELEMETRY_POSTHOG_HOST ?? DEFAULT_POSTHOG_HOST,
230243
};
231244
}
232245

246+
// Run the user's login shell to dump its env. stdin is ignored so an rc that
247+
// reads input hits EOF instead of hanging; stderr is ignored to drop banner
248+
// noise. Never rejects: resolves null on spawn error, non-zero exit, or timeout
249+
// (SIGKILLed), so the caller degrades to the static PATH floor.
250+
const runLoginShell: ShellRunner = (shellPath, args) =>
251+
new Promise((resolve) => {
252+
let settled = false;
253+
const finish = (value: string | null) => {
254+
if (settled) return;
255+
settled = true;
256+
resolve(value);
257+
};
258+
let child: ReturnType<typeof spawn>;
259+
try {
260+
child = spawn(shellPath, args, { stdio: ["ignore", "pipe", "ignore"] });
261+
} catch {
262+
finish(null);
263+
return;
264+
}
265+
const timer = setTimeout(() => {
266+
child.kill("SIGKILL");
267+
finish(null);
268+
}, SHELL_ENV_TIMEOUT_MS);
269+
let stdout = "";
270+
// stdout may be typed Readable | null under this stdio config; guard it.
271+
child.stdout?.on("data", (chunk: Buffer) => {
272+
stdout += chunk.toString("utf8");
273+
});
274+
child.once("error", () => {
275+
clearTimeout(timer);
276+
finish(null);
277+
});
278+
child.once("exit", (code) => {
279+
clearTimeout(timer);
280+
finish(code === 0 ? stdout : null);
281+
});
282+
});
283+
284+
// Resolve the login-shell env once and cache it. No-op on Windows (the launchd
285+
// shell split does not apply; a static PATH floor suffices). Awaited at the
286+
// daemon-spawn chokepoint so the cache is populated before the first spawn.
287+
function ensureShellEnv(): Promise<void> {
288+
if (process.platform === "win32") return Promise.resolve();
289+
if (!shellEnvPromise) {
290+
shellEnvPromise = resolveShellEnv(process.env, runLoginShell).then((resolved) => {
291+
cachedShellEnv = resolved;
292+
if (!resolved) {
293+
console.error("AO: could not read the login-shell environment; falling back to a static PATH floor.");
294+
}
295+
});
296+
}
297+
return shellEnvPromise;
298+
}
299+
300+
function daemonEnv(): NodeJS.ProcessEnv {
301+
// Windows keeps the old behavior exactly: no shell probe, no unix PATH floor.
302+
if (process.platform === "win32") {
303+
return { ...process.env, ...telemetryOverrides() };
304+
}
305+
return buildDaemonEnv(process.env, cachedShellEnv, telemetryOverrides());
306+
}
307+
233308
function pathKey(value: string): string {
234309
const resolved = path.resolve(value);
235310
return process.platform === "win32" ? resolved.toLowerCase() : resolved;
@@ -358,6 +433,11 @@ async function startDaemonInner(startEpoch: number): Promise<DaemonStatus> {
358433
return daemonStatus;
359434
}
360435

436+
// Single chokepoint: make sure the login-shell env is resolved before the
437+
// daemon is spawned, so a Finder/Dock launch hands the daemon a real PATH and
438+
// shell-exported credentials rather than launchd's minimal env.
439+
await ensureShellEnv();
440+
361441
const launch = resolveDaemonLaunch(
362442
process.env,
363443
app.isPackaged,

0 commit comments

Comments
 (0)