feat: pause conductor heartbeats after inactivity#839
Conversation
asheshgoplani
left a comment
There was a problem hiding this comment.
Hi @yaroshevych — thank you for the thorough work, the heartbeat-pause feature is genuinely useful and the implementation is mostly clean (variadic backwards-compat, proper hook-ledger usage instead of raw tmux activity, additive JSON, default-off). Tests pass, race-condition is benign, no regressions on existing conductors.
One concern from the dual-review (Claude + Codex agreed on this) that I'd like resolved before merge:
Silent-failure mode on zero lastActivity. shouldSkipConductorHeartbeatSend at cmd/agent-deck/session_cmd.go:1859 treats lastActivity.IsZero() as "skip heartbeat." In the happy path this means "no managed sessions yet, no need to ping." But it also fires if:
- The hook ledger files are missing (filesystem cleanup, fresh installs of the runtime)
- The ledger is corrupt (write race, disk full mid-write, partial flush after crash)
readHookStatusFilereturns zero for any reason while sessions still exist
In those cases, heartbeats stop and stay stopped until the user manually messages the conductor — which they likely won't notice for hours or days. That's the same class of silent failure we've been fixing across the codebase this week (notifier dedup, SQLite leak, busy-retry termination), so I'd rather not ship another silent-failure mode.
The test docstring on TestGetConductorLastActivity_NoManagedSessions even says "must not suppress heartbeats in this case" but the implementation does suppress — the assertion would mask a future regression that someone "fixes" toward the docstring.
Suggested fix: in shouldSkipConductorHeartbeatSend, treat lastActivity.IsZero() as "send heartbeat" (the conservative default) rather than "skip." If activity is genuinely never recorded, the user keeps getting their normal cadence — which they can disable explicitly via --no-heartbeat. Suppression should require positive evidence of recent activity.
If you'd rather take a different approach (e.g. a separate "activity-required" flag), happy to discuss. Either way the test docstring should be aligned with the implementation.
Everything else is good to go — once that's resolved, this lands in v1.7.79.
Committed by Ashesh Goplani
0231055 to
1aa5cdb
Compare
|
Good catch. Updated the behaviour. |
Summary
If agent-deck is left running for a long period of time - e.g. overnight, it keeps pinging conductor, burning tokens. New feature adds
--heartbeat-idle-minutesflag, which pauses heartbeats after period of inactivity.Tests
Tested manually:
agent-deck conductor setup test --heartbeat-idle-minutes 3Unit tests: