Skip to content

Commit 14c6afb

Browse files
authored
feat(kiloclaw): add openclaw doctor controller API route (#2707)
* feat(kiloclaw): add openclaw doctor controller API route Adds POST /_kilo/doctor/run to the machine-side controller as a replacement for the 60s-capped Fly Machines exec API. Wired end-to-end through a new DO RPC, a worker platform route (/api/platform/doctor-controller), the internal client, a new admin tRPC procedure, and a new 'Run Doctor (Controller)' admin button. Behavior: - Synchronous buffered response with a 120s hard cap; SIGTERM then SIGKILL after a 5s grace on timeout or client disconnect. - Single run at a time, serialized via a runStartExclusive queue; 409 openclaw_doctor_already_active on concurrent start. - Env parity with the live gateway: spawn inherits process.env, so doctor sees the same decrypted secrets as 'openclaw gateway'. - --fix defaults to true at every layer (controller, worker route, admin tRPC) matching the Fly-exec flow and the admin checkbox. Compatibility: - Server: DO uses isErrorUnknownRoute to detect older controllers and returns null, which the worker surfaces as 404 controller_route_unavailable. - Client: admin UI uses calverAtLeast(version, '2026.4.22') to disable the new button with an 'Unavailable until redeploy' tooltip on stale controllers (matches supportsConfigRestore pattern). Auth/ownership: - Controller route uses timingSafeTokenEqual bearer middleware. - Admin tRPC calls assertInstanceBelongsToUser so mismatched {userId, instanceId} pairs 404 rather than acting on the wrong instance. Client-disconnect plumbing: handleHttpRequest now creates an AbortController wired to the node req's 'close' event (gated on !res.writableEnded) and passes it into the Request signal; the doctor route observes c.req.raw.signal to SIGTERM the child if the caller drops. Inert for existing routes since none read the signal. The existing Fly-exec /api/platform/doctor + 'Run Doctor' button are intentionally kept alive so both paths run side-by-side during the migration. Deprecation of the Fly-exec path is a follow-up PR. Dockerfile has no structural change; CI must pass --build-arg CONTROLLER_CACHE_BUST=$(date +%s) for the new controller code to land in the image. * fix(kiloclaw): listen on res not req for client-disconnect abort ServerResponse's 'close' event is the correct signal for client disconnects. IncomingMessage's 'close' event fires as soon as the request body stream is fully consumed — which, when we pass 'req' as init.body for POST requests, happens mid-handler inside Hono's c.req.json() call. That falsely tripped the AbortController before the response was sent, and the doctor route interpreted the signal as a real client disconnect and SIGTERMed the child. Locally reproduced: POST /_kilo/doctor/run returned 'status: cancelled' with '[cancelled by client disconnect]' in the output, even though the caller was still waiting. Switching to res.on('close') + !res.writableEnded gives us the legitimate 'connection terminated before response completed' case. Verified via smoke container: openclaw doctor now runs to completion and returns 'status: completed, exitCode: 0' with full output. * test(kiloclaw): fix doctor controller expectation * refactor(kiloclaw): make controller doctor runs async * fix(kiloclaw): refresh doctor status after start * fix(kiloclaw): avoid duplicate doctor log reads * fix(kiloclaw): gate controller doctor on async route version * fix(kiloclaw): show active doctor run on conflict * fix(kiloclaw): harden controller doctor availability * fix(kiloclaw): prevent overlapping doctor runs
1 parent 5861891 commit 14c6afb

17 files changed

Lines changed: 1971 additions & 12 deletions

File tree

apps/web/src/app/admin/components/KiloclawInstances/KiloclawInstanceDetail.tsx

Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ import {
3535
} from '@/components/ui/select';
3636
import { Input } from '@/components/ui/input';
3737
import { Textarea } from '@/components/ui/textarea';
38+
import { Checkbox } from '@/components/ui/checkbox';
39+
import { Label } from '@/components/ui/label';
40+
import type { DoctorControllerStatus, DoctorControllerStatusResponse } from '@/lib/kiloclaw/types';
3841
import {
3942
User,
4043
Calendar,
@@ -1301,6 +1304,8 @@ export function KiloclawInstanceDetail({ instanceId }: { instanceId: string }) {
13011304
const queryClient = useQueryClient();
13021305
const [destroyDialogOpen, setDestroyDialogOpen] = useState(false);
13031306
const [doctorDialogOpen, setDoctorDialogOpen] = useState(false);
1307+
const [doctorControllerDialogOpen, setDoctorControllerDialogOpen] = useState(false);
1308+
const [doctorControllerFix, setDoctorControllerFix] = useState(true);
13041309
const [restoreConfigDialogOpen, setRestoreConfigDialogOpen] = useState(false);
13051310
const [destroyMachineDialogOpen, setDestroyMachineDialogOpen] = useState(false);
13061311
const [resizeMachineDialogOpen, setResizeMachineDialogOpen] = useState(false);
@@ -1482,6 +1487,15 @@ export function KiloclawInstanceDetail({ instanceId }: { instanceId: string }) {
14821487
cleanVersion(controllerVersion?.version),
14831488
'2026.2.26'
14841489
);
1490+
// /_kilo/doctor/start|status|cancel is expected to land after 14:00 CDT on
1491+
// 2026-05-08 (19:00 UTC). Older same-day controllers may report only 2026.5.8,
1492+
// which compares as 2026.5.8.0 and must remain unsupported.
1493+
// controllers fall through to the catch-all proxy and return 404 —
1494+
// disable the button with a tooltip until they redeploy.
1495+
const supportsDoctorController = calverAtLeast(
1496+
cleanVersion(controllerVersion?.version),
1497+
'2026.5.8.1900'
1498+
);
14851499

14861500
// After a restart/upgrade, poll the machine status until it returns to "running",
14871501
// then invalidate controllerVersion so supportsConfigRestore reflects the new build.
@@ -2011,6 +2025,54 @@ export function KiloclawInstanceDetail({ instanceId }: { instanceId: string }) {
20112025
})
20122026
);
20132027

2028+
const startDoctorControllerMutation = useMutation(
2029+
trpc.admin.kiloclawInstances.startDoctorViaController.mutationOptions({
2030+
onSuccess: async (_result, variables) => {
2031+
await queryClient.invalidateQueries({
2032+
queryKey: trpc.admin.kiloclawInstances.doctorViaControllerStatus.queryKey({
2033+
userId: variables.userId,
2034+
instanceId: variables.instanceId,
2035+
}),
2036+
});
2037+
setDoctorControllerDialogOpen(true);
2038+
},
2039+
onError: (err, variables) => {
2040+
if (
2041+
err instanceof TRPCClientError &&
2042+
err.data?.code === 'CONFLICT' &&
2043+
err.message.includes('already in progress')
2044+
) {
2045+
setDoctorControllerDialogOpen(true);
2046+
void queryClient.invalidateQueries({
2047+
queryKey: trpc.admin.kiloclawInstances.doctorViaControllerStatus.queryKey({
2048+
userId: variables.userId,
2049+
instanceId: variables.instanceId,
2050+
}),
2051+
});
2052+
return;
2053+
}
2054+
toast.error(`Failed to start doctor (controller): ${err.message}`);
2055+
},
2056+
})
2057+
);
2058+
2059+
const cancelDoctorControllerMutation = useMutation(
2060+
trpc.admin.kiloclawInstances.cancelDoctorViaController.mutationOptions({
2061+
onError: err => {
2062+
toast.error(`Failed to cancel doctor (controller): ${err.message}`);
2063+
},
2064+
})
2065+
);
2066+
2067+
const { data: doctorControllerStatus, isError: doctorControllerStatusError } = useQuery({
2068+
...trpc.admin.kiloclawInstances.doctorViaControllerStatus.queryOptions({
2069+
userId: data?.user_id ?? '',
2070+
instanceId: data?.id,
2071+
}),
2072+
enabled: doctorControllerDialogOpen && supportsDoctorController && !!data?.user_id,
2073+
refetchInterval: query => (query.state.data?.status === 'running' ? 1000 : false),
2074+
});
2075+
20142076
const restoreConfigMutation = useMutation(
20152077
trpc.admin.kiloclawInstances.restoreConfig.mutationOptions({
20162078
onSuccess: data => {
@@ -2145,6 +2207,8 @@ export function KiloclawInstanceDetail({ instanceId }: { instanceId: string }) {
21452207
isGatewayStopping ||
21462208
isGatewayRestarting ||
21472209
runDoctorMutation.isPending ||
2210+
startDoctorControllerMutation.isPending ||
2211+
cancelDoctorControllerMutation.isPending ||
21482212
restoreConfigMutation.isPending;
21492213

21502214
return (
@@ -4173,6 +4237,30 @@ export function KiloclawInstanceDetail({ instanceId }: { instanceId: string }) {
41734237
<Stethoscope className="mr-1 h-4 w-4" />
41744238
Run Doctor
41754239
</Button>
4240+
<Tooltip>
4241+
<TooltipTrigger asChild>
4242+
<span>
4243+
<Button
4244+
size="sm"
4245+
variant="outline"
4246+
disabled={!supportsDoctorController || gatewayActionPending}
4247+
onClick={() => {
4248+
startDoctorControllerMutation.mutate({
4249+
userId: data.user_id,
4250+
instanceId: data.id,
4251+
fix: doctorControllerFix,
4252+
});
4253+
}}
4254+
>
4255+
<Stethoscope className="mr-1 h-4 w-4" />
4256+
Run Doctor (Controller)
4257+
</Button>
4258+
</span>
4259+
</TooltipTrigger>
4260+
{!supportsDoctorController && (
4261+
<TooltipContent>Unavailable until redeploy</TooltipContent>
4262+
)}
4263+
</Tooltip>
41764264
<Tooltip>
41774265
<TooltipTrigger asChild>
41784266
<span>
@@ -4590,6 +4678,31 @@ export function KiloclawInstanceDetail({ instanceId }: { instanceId: string }) {
45904678
mutation={runDoctorMutation}
45914679
/>
45924680

4681+
{/* Run Doctor (Controller) Dialog */}
4682+
<RunDoctorControllerDialog
4683+
open={doctorControllerDialogOpen && supportsDoctorController}
4684+
onOpenChange={setDoctorControllerDialogOpen}
4685+
fix={doctorControllerFix}
4686+
onFixChange={setDoctorControllerFix}
4687+
status={doctorControllerStatus}
4688+
statusError={doctorControllerStatusError}
4689+
starting={startDoctorControllerMutation.isPending}
4690+
cancelling={cancelDoctorControllerMutation.isPending}
4691+
onCancel={() => {
4692+
cancelDoctorControllerMutation.mutate({
4693+
userId: data.user_id,
4694+
instanceId: data.id,
4695+
});
4696+
}}
4697+
onRerun={() => {
4698+
startDoctorControllerMutation.mutate({
4699+
userId: data.user_id,
4700+
instanceId: data.id,
4701+
fix: doctorControllerFix,
4702+
});
4703+
}}
4704+
/>
4705+
45934706
{/* Restore Default Config Confirmation Dialog */}
45944707
<Dialog
45954708
open={restoreConfigDialogOpen && supportsConfigRestore}
@@ -4737,3 +4850,171 @@ function RunDoctorDialog({
47374850
</Dialog>
47384851
);
47394852
}
4853+
4854+
function formatRunDuration(startedAt: string | null, completedAt: string | null): string {
4855+
if (!startedAt || !completedAt) return '–';
4856+
const start = new Date(startedAt).getTime();
4857+
const end = new Date(completedAt).getTime();
4858+
if (!Number.isFinite(start) || !Number.isFinite(end) || end < start) return '–';
4859+
const ms = end - start;
4860+
if (ms < 1000) return `${ms}ms`;
4861+
return `${(ms / 1000).toFixed(1)}s`;
4862+
}
4863+
4864+
function doctorStatusLabel(status: DoctorControllerStatus | null): string {
4865+
switch (status) {
4866+
case 'running':
4867+
return 'Running';
4868+
case 'completed':
4869+
return 'Completed successfully';
4870+
case 'failed':
4871+
return 'Completed with issues';
4872+
case 'cancelled':
4873+
return 'Cancelled';
4874+
case 'timed_out':
4875+
return 'Timed out after 120s';
4876+
case null:
4877+
return 'No run yet';
4878+
}
4879+
}
4880+
4881+
function RunDoctorControllerDialog({
4882+
open,
4883+
onOpenChange,
4884+
fix,
4885+
onFixChange,
4886+
status,
4887+
statusError,
4888+
starting,
4889+
cancelling,
4890+
onCancel,
4891+
onRerun,
4892+
}: {
4893+
open: boolean;
4894+
onOpenChange: (open: boolean) => void;
4895+
fix: boolean;
4896+
onFixChange: (next: boolean) => void;
4897+
status: DoctorControllerStatusResponse | undefined;
4898+
statusError: boolean;
4899+
starting: boolean;
4900+
cancelling: boolean;
4901+
onCancel: () => void;
4902+
onRerun: () => void;
4903+
}) {
4904+
const isRunning = status?.status === 'running' || starting;
4905+
const handleOpenChange = (nextOpen: boolean) => {
4906+
onOpenChange(nextOpen);
4907+
};
4908+
4909+
const result = status?.hasRun ? { ...status, output: stripAnsi(status.output ?? '') } : null;
4910+
4911+
return (
4912+
<Dialog open={open} onOpenChange={handleOpenChange}>
4913+
<DialogContent className="sm:max-w-[750px]">
4914+
<DialogHeader>
4915+
<DialogTitle>OpenClaw Doctor (via Controller)</DialogTitle>
4916+
<DialogDescription>
4917+
Runs <code>openclaw doctor</code> inside the machine via the controller HTTP API. Output
4918+
is persisted on the instance and can be retrieved while the run continues.
4919+
</DialogDescription>
4920+
</DialogHeader>
4921+
4922+
<div className="flex items-center gap-2">
4923+
<Checkbox
4924+
id="doctor-controller-fix"
4925+
checked={fix}
4926+
onCheckedChange={onFixChange}
4927+
disabled={isRunning}
4928+
/>
4929+
<Label htmlFor="doctor-controller-fix" className="text-sm">
4930+
Pass <code>--fix</code>
4931+
</Label>
4932+
</div>
4933+
4934+
{starting && !result && (
4935+
<div className="flex flex-col items-center justify-center gap-3 py-12">
4936+
<Loader2 className="text-muted-foreground h-8 w-8 animate-spin" />
4937+
<p className="text-muted-foreground text-sm">
4938+
Starting <code>openclaw doctor{fix ? ' --fix' : ''}</code>
4939+
</p>
4940+
</div>
4941+
)}
4942+
4943+
{statusError && !result && !starting && (
4944+
<div className="flex flex-col items-center justify-center gap-3 py-12">
4945+
<XCircle className="h-8 w-8 text-red-400" />
4946+
<p className="text-sm text-red-400">Failed to fetch doctor status (controller)</p>
4947+
</div>
4948+
)}
4949+
4950+
{!result && !starting && !statusError && (
4951+
<div className="text-muted-foreground flex flex-col items-center justify-center gap-3 py-12 text-sm">
4952+
No controller doctor run has been recorded yet.
4953+
</div>
4954+
)}
4955+
4956+
{result && (
4957+
<div className="space-y-3">
4958+
<div className="flex flex-wrap items-center gap-2">
4959+
{result.status === 'running' ? (
4960+
<Loader2 className="text-muted-foreground h-4 w-4 animate-spin" />
4961+
) : result.status === 'completed' ? (
4962+
<CheckCircle2 className="h-4 w-4 text-emerald-400" />
4963+
) : (
4964+
<XCircle className="h-4 w-4 text-red-400" />
4965+
)}
4966+
<span className="text-sm font-medium">{doctorStatusLabel(result.status)}</span>
4967+
<Badge variant="outline" className="text-xs">
4968+
exit {result.exitCode ?? 'n/a'}
4969+
</Badge>
4970+
<Badge variant="outline" className="text-xs">
4971+
{formatRunDuration(result.startedAt, result.completedAt)}
4972+
</Badge>
4973+
<Badge variant="outline" className="text-xs">
4974+
{result.fix ? '--fix' : 'no --fix'}
4975+
</Badge>
4976+
{result.outputTruncated && (
4977+
<Badge variant="outline" className="border-yellow-500/30 text-xs text-yellow-400">
4978+
output truncated
4979+
</Badge>
4980+
)}
4981+
{result.timedOut && (
4982+
<Badge variant="outline" className="border-yellow-500/30 text-xs text-yellow-400">
4983+
timed out
4984+
</Badge>
4985+
)}
4986+
</div>
4987+
<div className="border-border bg-background max-h-[400px] overflow-auto rounded-md border">
4988+
{/* prettier-ignore */}
4989+
<pre
4990+
className="p-3 text-xs leading-relaxed whitespace-pre"
4991+
style={{ fontFamily: "'Courier New', Courier, monospace", tabSize: 8 }}
4992+
>{result.output}</pre>
4993+
</div>
4994+
</div>
4995+
)}
4996+
4997+
<DialogFooter className="gap-2 sm:gap-0">
4998+
<Button variant="outline" onClick={() => handleOpenChange(false)}>
4999+
Close
5000+
</Button>
5001+
{result?.status === 'running' && (
5002+
<Button variant="destructive" onClick={onCancel} disabled={cancelling}>
5003+
{cancelling ? <Loader2 className="mr-1 h-4 w-4 animate-spin" /> : null}
5004+
Cancel
5005+
</Button>
5006+
)}
5007+
<Button
5008+
variant="default"
5009+
onClick={onRerun}
5010+
disabled={isRunning || cancelling}
5011+
title="Re-run with the current --fix setting"
5012+
>
5013+
<Stethoscope className="mr-1 h-4 w-4" />
5014+
Re-run
5015+
</Button>
5016+
</DialogFooter>
5017+
</DialogContent>
5018+
</Dialog>
5019+
);
5020+
}

apps/web/src/lib/kiloclaw/kiloclaw-internal-client.ts

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ import type {
2828
DevicePairingApproveResponse,
2929
VolumeSnapshotsResponse,
3030
DoctorResponse,
31+
DoctorControllerStartResponse,
32+
DoctorControllerStatusResponse,
33+
DoctorControllerCancelResponse,
3134
OpenclawWorkspaceImportResponse,
3235
KiloCliRunStartResponse,
3336
KiloCliRunStatusResponse,
@@ -625,6 +628,48 @@ export class KiloClawInternalClient {
625628
);
626629
}
627630

631+
async startDoctorViaController(
632+
userId: string,
633+
fix: boolean,
634+
instanceId?: string
635+
): Promise<DoctorControllerStartResponse> {
636+
const params = instanceId ? `?instanceId=${encodeURIComponent(instanceId)}` : '';
637+
return this.request(
638+
`/api/platform/doctor-controller/start${params}`,
639+
{
640+
method: 'POST',
641+
body: JSON.stringify({ userId, fix }),
642+
},
643+
{ userId }
644+
);
645+
}
646+
647+
async getDoctorViaControllerStatus(
648+
userId: string,
649+
instanceId?: string
650+
): Promise<DoctorControllerStatusResponse> {
651+
const params = new URLSearchParams({ userId });
652+
if (instanceId) params.set('instanceId', instanceId);
653+
return this.request(`/api/platform/doctor-controller/status?${params.toString()}`, undefined, {
654+
userId,
655+
});
656+
}
657+
658+
async cancelDoctorViaController(
659+
userId: string,
660+
instanceId?: string
661+
): Promise<DoctorControllerCancelResponse> {
662+
const params = instanceId ? `?instanceId=${encodeURIComponent(instanceId)}` : '';
663+
return this.request(
664+
`/api/platform/doctor-controller/cancel${params}`,
665+
{
666+
method: 'POST',
667+
body: JSON.stringify({ userId }),
668+
},
669+
{ userId }
670+
);
671+
}
672+
628673
async startKiloCliRun(
629674
userId: string,
630675
prompt: string,

0 commit comments

Comments
 (0)