Skip to content

Commit 0c429e5

Browse files
authored
fix(gastown): restore agent working status on heartbeat after dispatch timeout race (#1359)
* fix(gastown): restore agent working status on heartbeat after dispatch timeout race (#1358) Three compounding fixes for the 5-minute bead reset cycle caused by a timing race between startAgentInContainer's 60s timeout and slow cold starts: 1. touchAgent restores idle→working on heartbeat — a heartbeat is proof the agent is alive in the container regardless of its recorded status. 2. reconcileBeads Rule 3 checks last_activity_at freshness — defense in depth so an agent with a recent heartbeat is never treated as lost, even if its status field is wrong. 3. dispatchAgent !started path no longer sets agent to idle — leaves it working so the reconciler doesn't reset the bead. reconcileAgents catches truly dead agents after 90s of missing heartbeats. Closes #1358 * fix(gastown): add cold start grace period for container_status not_found The container status pre-phase polls /agents/:id/status on every alarm tick. During a cold start (git clone + worktree), the agent hasn't registered in the process manager yet, so the container returns 404. This was immediately setting the agent to idle, undoing the dispatch timeout fix. Add a 3-minute grace period for not_found status: if the agent was dispatched recently (last_activity_at < 3 min ago), ignore the 404. Truly dead agents are still caught by reconcileAgents after 90s of missing heartbeats. * fix(gastown): fix SQLite datetime comparison bug that prevented stuck bead recovery reconcileBeads Rule 3 compared ISO 8601 timestamps (2026-03-21T05:55:50Z) against SQLite datetime() output (2026-03-21 05:55:50). Since 'T' (ASCII 84) > ' ' (ASCII 32), the comparison last_activity_at > datetime('now', '-90 seconds') was ALWAYS TRUE — the heartbeat check never expired. Rule 3 thought every hooked agent had a fresh heartbeat and never recovered stuck in_progress beads. Fix: use strftime('%Y-%m-%dT%H:%M:%fZ', ...) to produce ISO 8601 format matching the stored timestamps. Also: move invariant violation logging from console.error (spamming Workers logs every 5s per town) to analytics events for observability dashboards. Closes #1361 * fix(gastown): set refinery to idle after gt_done so next review can start immediately The refinery's gt_done path unhooks the agent but doesn't set it to idle. The refinery stays 'working' with no hook until agentCompleted fires (when the container process exits, which can take 10-30s after gt_done). During that time processReviewQueue sees the refinery as non-idle and won't pop the next MR bead. Set the refinery to idle immediately after unhooking in agentDone. The container process continues running but the DO knows the refinery is available for new reviews. * fix(gastown): set working agents with no hook to idle in reconcileAgents Working agents with fresh heartbeats but no hook are running in the container doing nothing — gt_done already ran and unhooked them, or the hook was cleared by another path. Without this, the refinery stays 'working' indefinitely (heartbeats keep it alive), blocking processReviewQueue from dispatching it for the next review. Also skip the mayor in the working-agent check (mayors are always working with no hook — that's normal). This eliminates the invariant 7 false positive from #1364. * style: run oxfmt formatter * fix(gastown): guard agentCompleted idle transition against re-dispatched agents agentCompleted unconditionally set the agent to idle, which could clobber a live dispatch if the agent was re-hooked and dispatched for new work between gt_done and the container's completion callback. Add a guard: don't set to idle if the agent is working AND has a hook (re-dispatched). Only set to idle if the agent is working with no hook (gt_done completed, waiting for process exit) or already idle. * style: format review-queue.ts
1 parent 9790a2e commit 0c429e5

6 files changed

Lines changed: 201 additions & 41 deletions

File tree

cloudflare-gastown/src/dos/Town.do.ts

Lines changed: 56 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,11 @@ export class TownDO extends DurableObject<Env> {
240240
}
241241

242242
private emitEvent(data: Omit<GastownEventData, 'userId' | 'delivery'>): void {
243-
writeEvent(this.env, { ...data, delivery: 'internal', userId: this._ownerUserId });
243+
writeEvent(this.env, {
244+
...data,
245+
delivery: 'internal',
246+
userId: this._ownerUserId,
247+
});
244248
}
245249

246250
/** Build the context object used by the scheduling sub-module. */
@@ -290,7 +294,9 @@ export class TownDO extends DurableObject<Env> {
290294
});
291295
}
292296

293-
return scheduling.dispatchAgent(schedulingCtx, agent, bead, { systemPromptOverride });
297+
return scheduling.dispatchAgent(schedulingCtx, agent, bead, {
298+
systemPromptOverride,
299+
});
294300
},
295301
stopAgent: async agentId => {
296302
await dispatch.stopAgentInContainer(this.env, this.townId, agentId);
@@ -677,7 +683,9 @@ export class TownDO extends DurableObject<Env> {
677683
const townConfig = await this.getTownConfig();
678684
if (!townConfig.kilocode_token || townConfig.kilocode_token !== rigConfig.kilocodeToken) {
679685
console.log(`${TOWN_LOG} configureRig: propagating kilocodeToken to town config`);
680-
await this.updateTownConfig({ kilocode_token: rigConfig.kilocodeToken });
686+
await this.updateTownConfig({
687+
kilocode_token: rigConfig.kilocodeToken,
688+
});
681689
}
682690
}
683691

@@ -1214,10 +1222,14 @@ export class TownDO extends DurableObject<Env> {
12141222
* Return undelivered, non-expired nudges for an agent.
12151223
* Urgent nudges are returned first, then FIFO within same priority.
12161224
*/
1217-
async getPendingNudges(
1218-
agentId: string
1219-
): Promise<
1220-
{ nudge_id: string; message: string; mode: string; priority: string; source: string }[]
1225+
async getPendingNudges(agentId: string): Promise<
1226+
{
1227+
nudge_id: string;
1228+
message: string;
1229+
mode: string;
1230+
priority: string;
1231+
source: string;
1232+
}[]
12211233
> {
12221234
const rows = [
12231235
...query(
@@ -1768,7 +1780,12 @@ export class TownDO extends DurableObject<Env> {
17681780

17691781
/** Build the rig list for mayor agent startup (browse worktree setup on fresh containers). */
17701782
private async rigListForMayor(): Promise<
1771-
Array<{ rigId: string; gitUrl: string; defaultBranch: string; platformIntegrationId?: string }>
1783+
Array<{
1784+
rigId: string;
1785+
gitUrl: string;
1786+
defaultBranch: string;
1787+
platformIntegrationId?: string;
1788+
}>
17721789
> {
17731790
const rigRecords = rigs.listRigs(this.sql);
17741791
return Promise.all(
@@ -1792,7 +1809,10 @@ export class TownDO extends DurableObject<Env> {
17921809
message: string,
17931810
_model?: string,
17941811
uiContext?: string
1795-
): Promise<{ agentId: string; sessionStatus: 'idle' | 'active' | 'starting' }> {
1812+
): Promise<{
1813+
agentId: string;
1814+
sessionStatus: 'idle' | 'active' | 'starting';
1815+
}> {
17961816
const townId = this.townId;
17971817

17981818
let mayor = agents.listAgents(this.sql, { role: 'mayor' })[0] ?? null;
@@ -1876,7 +1896,10 @@ export class TownDO extends DurableObject<Env> {
18761896
* Called eagerly on page load so the terminal is available immediately
18771897
* without requiring the user to send a message first.
18781898
*/
1879-
async ensureMayor(): Promise<{ agentId: string; sessionStatus: 'idle' | 'active' | 'starting' }> {
1899+
async ensureMayor(): Promise<{
1900+
agentId: string;
1901+
sessionStatus: 'idle' | 'active' | 'starting';
1902+
}> {
18801903
const townId = this.townId;
18811904

18821905
let mayor = agents.listAgents(this.sql, { role: 'mayor' })[0] ?? null;
@@ -2250,7 +2273,10 @@ export class TownDO extends DurableObject<Env> {
22502273
tasks: Array<{ title: string; body?: string; depends_on?: number[] }>;
22512274
merge_mode?: 'review-then-land' | 'review-and-merge';
22522275
staged?: boolean;
2253-
}): Promise<{ convoy: ConvoyEntry; beads: Array<{ bead: Bead; agent: Agent | null }> }> {
2276+
}): Promise<{
2277+
convoy: ConvoyEntry;
2278+
beads: Array<{ bead: Bead; agent: Agent | null }>;
2279+
}> {
22542280
// Resolve staged: explicit request wins, otherwise fall back to town config default.
22552281
const townConfig = await this.getTownConfig();
22562282
const isStaged = input.staged ?? townConfig.staged_convoys_default;
@@ -2449,9 +2475,10 @@ export class TownDO extends DurableObject<Env> {
24492475
/**
24502476
* Transition a staged convoy to active: hook agents and begin dispatch.
24512477
*/
2452-
async startConvoy(
2453-
convoyId: string
2454-
): Promise<{ convoy: ConvoyEntry; beads: Array<{ bead: Bead; agent: Agent | null }> }> {
2478+
async startConvoy(convoyId: string): Promise<{
2479+
convoy: ConvoyEntry;
2480+
beads: Array<{ bead: Bead; agent: Agent | null }>;
2481+
}> {
24552482
const convoy = this.getConvoy(convoyId);
24562483
if (!convoy) throw new Error(`Convoy not found: ${convoyId}`);
24572484
if (!convoy.staged) throw new Error(`Convoy is not staged: ${convoyId}`);
@@ -2994,9 +3021,14 @@ export class TownDO extends DurableObject<Env> {
29943021
const violations = reconciler.checkInvariants(this.sql);
29953022
metrics.invariantViolations = violations.length;
29963023
if (violations.length > 0) {
2997-
console.error(
2998-
`${TOWN_LOG} [reconciler:invariants] town=${townId} ${violations.length} violation(s): ${JSON.stringify(violations)}`
2999-
);
3024+
// Emit as an analytics event for observability dashboards instead
3025+
// of console.error (which spams Workers logs every 5s per town).
3026+
this.emitEvent({
3027+
event: 'reconciler.invariant_violations',
3028+
townId,
3029+
label: violations.map(v => `[${v.invariant}] ${v.message}`).join('; '),
3030+
value: violations.length,
3031+
});
30003032
}
30013033
} catch (err) {
30023034
console.warn(`${TOWN_LOG} [reconciler:invariants] town=${townId} check failed`, err);
@@ -3575,7 +3607,13 @@ export class TownDO extends DurableObject<Env> {
35753607
[]
35763608
),
35773609
];
3578-
const beadCounts = { open: 0, inProgress: 0, inReview: 0, failed: 0, triageRequests: 0 };
3610+
const beadCounts = {
3611+
open: 0,
3612+
inProgress: 0,
3613+
inReview: 0,
3614+
failed: 0,
3615+
triageRequests: 0,
3616+
};
35793617
for (const row of beadRows) {
35803618
const s = `${row.status as string}`;
35813619
const c = Number(row.cnt);

cloudflare-gastown/src/dos/town/agents.ts

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -546,11 +546,20 @@ export function touchAgent(
546546
activeTools?: string[];
547547
}
548548
): void {
549+
// A heartbeat is proof the agent is alive in the container.
550+
// If the agent's status is 'idle' (e.g. due to a dispatch timeout
551+
// race — see #1358), restore it to 'working'. This prevents the
552+
// reconciler from treating the agent as lost while it's actively
553+
// sending heartbeats.
549554
query(
550555
sql,
551556
/* sql */ `
552557
UPDATE ${agent_metadata}
553558
SET ${agent_metadata.columns.last_activity_at} = ?,
559+
${agent_metadata.columns.status} = CASE
560+
WHEN ${agent_metadata.columns.status} = 'idle' THEN 'working'
561+
ELSE ${agent_metadata.columns.status}
562+
END,
554563
${agent_metadata.columns.last_event_type} = COALESCE(?, ${agent_metadata.columns.last_event_type}),
555564
${agent_metadata.columns.last_event_at} = COALESCE(?, ${agent_metadata.columns.last_event_at}),
556565
${agent_metadata.columns.active_tools} = COALESCE(?, ${agent_metadata.columns.active_tools})

cloudflare-gastown/src/dos/town/reconciler.ts

Lines changed: 42 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,18 @@ export function applyEvent(sql: SqlStorage, event: TownEventRecord): void {
249249
const agent = agents.getAgent(sql, event.agent_id);
250250
if (!agent) return;
251251

252-
// Only act on working/stalled agents whose container has stopped
252+
// Only act on working/stalled agents whose container has stopped.
253+
// For 'not_found': skip if the agent was dispatched recently (#1358).
254+
// During a cold start the container may 404 on /agents/:id/status
255+
// because the agent hasn't registered in the process manager yet.
256+
// The 3-minute grace period covers the 60s HTTP timeout plus
257+
// typical cold start time (git clone + worktree). Truly dead
258+
// agents are caught by reconcileAgents after 90s of no heartbeats.
259+
if (containerStatus === 'not_found' && agent.last_activity_at) {
260+
const ageSec = (Date.now() - new Date(agent.last_activity_at).getTime()) / 1000;
261+
if (ageSec < 180) return; // 3-minute grace for cold starts
262+
}
263+
253264
if (
254265
(agent.status === 'working' || agent.status === 'stalled') &&
255266
(containerStatus === 'exited' || containerStatus === 'not_found')
@@ -340,6 +351,9 @@ export function reconcileAgents(sql: SqlStorage): Action[] {
340351
]);
341352

342353
for (const agent of workingAgents) {
354+
// Mayors are always working with no hook — skip them
355+
if (agent.role === 'mayor') continue;
356+
343357
if (!agent.last_activity_at) {
344358
// No heartbeat ever received — container may have failed to start
345359
actions.push({
@@ -357,6 +371,18 @@ export function reconcileAgents(sql: SqlStorage): Action[] {
357371
to: 'idle',
358372
reason: 'heartbeat lost (3 missed cycles)',
359373
});
374+
} else if (!agent.current_hook_bead_id) {
375+
// Agent is working with fresh heartbeat but no hook — it's running
376+
// in the container but has no bead to work on (gt_done already ran,
377+
// or the hook was cleared by another code path). Set to idle so
378+
// processReviewQueue / schedulePendingWork can use it.
379+
actions.push({
380+
type: 'transition_agent',
381+
agent_id: agent.bead_id,
382+
from: 'working',
383+
to: 'idle',
384+
reason: 'working agent has no hook (gt_done already completed)',
385+
});
360386
}
361387
}
362388

@@ -599,18 +625,24 @@ export function reconcileBeads(sql: SqlStorage): Action[] {
599625
for (const bead of staleInProgress) {
600626
if (!staleMs(bead.updated_at, STALE_IN_PROGRESS_TIMEOUT_MS)) continue;
601627

602-
// Check if any agent is hooked AND working/stalled
628+
// Check if any agent is hooked AND (working/stalled OR has a recent
629+
// heartbeat). The heartbeat check is defense-in-depth for #1358: if
630+
// the agent's status is wrong (e.g. stuck on 'idle' due to a dispatch
631+
// timeout race), a fresh heartbeat proves the agent is alive.
603632
const hookedAgent = z
604-
.object({ status: z.string() })
633+
.object({ status: z.string(), last_activity_at: z.string().nullable() })
605634
.array()
606635
.parse([
607636
...query(
608637
sql,
609638
/* sql */ `
610-
SELECT ${agent_metadata.status}
639+
SELECT ${agent_metadata.status}, ${agent_metadata.last_activity_at}
611640
FROM ${agent_metadata}
612641
WHERE ${agent_metadata.current_hook_bead_id} = ?
613-
AND ${agent_metadata.status} IN ('working', 'stalled')
642+
AND (
643+
${agent_metadata.status} IN ('working', 'stalled')
644+
OR ${agent_metadata.last_activity_at} > strftime('%Y-%m-%dT%H:%M:%fZ', 'now', '-90 seconds')
645+
)
614646
`,
615647
[bead.bead_id]
616648
),
@@ -991,7 +1023,11 @@ export function reconcileReviewQueue(sql: SqlStorage): Action[] {
9911023
}
9921024

9931025
const mrRows = z
994-
.object({ status: z.string(), type: z.string(), rig_id: z.string().nullable() })
1026+
.object({
1027+
status: z.string(),
1028+
type: z.string(),
1029+
rig_id: z.string().nullable(),
1030+
})
9951031
.array()
9961032
.parse([
9971033
...query(

cloudflare-gastown/src/dos/town/review-queue.ts

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ import {
2424
getConvoyFeatureBranch,
2525
getConvoyMergeMode,
2626
} from './beads';
27-
import { getAgent, unhookBead } from './agents';
27+
import { getAgent, unhookBead, updateAgentStatus } from './agents';
2828
import { getRig } from './rigs';
2929
import type { ReviewQueueInput, ReviewQueueEntry, AgentDoneInput, Molecule } from '../../types';
3030

@@ -621,6 +621,11 @@ export function agentDone(sql: SqlStorage, agentId: string, input: AgentDoneInpu
621621
}
622622

623623
unhookBead(sql, agentId);
624+
// Set refinery to idle immediately — the review is done and the
625+
// refinery is available for new work. Without this, processReviewQueue
626+
// sees the refinery as 'working' and won't pop the next MR bead until
627+
// agentCompleted fires (when the container process eventually exits).
628+
updateAgentStatus(sql, agentId, 'idle');
624629
return;
625630
}
626631

@@ -722,7 +727,11 @@ export function agentCompleted(
722727
}
723728
}
724729

725-
// Mark agent idle.
730+
// Mark agent idle — but ONLY if it hasn't been re-dispatched (status
731+
// still 'working' on new work) since gt_done ran. agentCompleted can
732+
// arrive after the agent has been re-hooked and dispatched for a new
733+
// bead. Without this guard, the stale completion event would clobber
734+
// the live dispatch.
726735
// For refineries, preserve dispatch_attempts so Rule 6's circuit-breaker
727736
// can track cumulative re-dispatch attempts across idle→dispatch cycles.
728737
// Resetting to 0 here was enabling infinite loops (#1342). Non-refineries
@@ -737,6 +746,10 @@ export function agentCompleted(
737746
ELSE 0
738747
END
739748
WHERE ${agent_metadata.bead_id} = ?
749+
AND NOT (
750+
${agent_metadata.columns.status} = 'working'
751+
AND ${agent_metadata.columns.current_hook_bead_id} IS NOT NULL
752+
)
740753
`,
741754
[agentId]
742755
);

cloudflare-gastown/src/dos/town/scheduling.ts

Lines changed: 8 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -159,20 +159,11 @@ export async function dispatchAgent(
159159
});
160160
} else {
161161
// Container start returned false — but the container may have
162-
// actually started the agent (timeout race). DON'T roll back
163-
// the bead to open. Leave it in_progress with the agent idle+hooked.
164-
// If the agent truly failed: rehookOrphanedBeads recovers after 2 min.
165-
// If the agent actually started: it works and calls gt_done normally.
166-
query(
167-
ctx.sql,
168-
/* sql */ `
169-
UPDATE ${agent_metadata}
170-
SET ${agent_metadata.columns.status} = 'idle',
171-
${agent_metadata.columns.last_activity_at} = ?
172-
WHERE ${agent_metadata.bead_id} = ?
173-
`,
174-
[now(), agent.id]
175-
);
162+
// actually started the agent (timeout race). Leave the agent
163+
// as 'working' so the reconciler doesn't treat it as lost.
164+
// If the agent truly didn't start: reconcileAgents catches it
165+
// after 90s of missing heartbeats and transitions to 'idle'.
166+
// If the agent actually started: heartbeats keep it alive. (#1358)
176167
ctx.emitEvent({
177168
event: 'agent.dispatch_failed',
178169
townId: ctx.townId,
@@ -185,7 +176,9 @@ export async function dispatchAgent(
185176
return started;
186177
} catch (err) {
187178
console.error(`${LOG} dispatchAgent: failed for agent=${agent.id}:`, err);
188-
Sentry.captureException(err, { extra: { agentId: agent.id, beadId: bead.bead_id } });
179+
Sentry.captureException(err, {
180+
extra: { agentId: agent.id, beadId: bead.bead_id },
181+
});
189182
try {
190183
query(
191184
ctx.sql,

0 commit comments

Comments
 (0)