feat(kiloclaw): add openclaw doctor controller API route#2707
feat(kiloclaw): add openclaw doctor controller API route#2707pandemicsyn wants to merge 10 commits intomainfrom
Conversation
Adds POST /_kilo/doctor/run to the machine-side controller as a
replacement for the 60s-capped Fly Machines exec API. Wired end-to-end
through a new DO RPC, a worker platform route
(/api/platform/doctor-controller), the internal client, a new admin
tRPC procedure, and a new 'Run Doctor (Controller)' admin button.
Behavior:
- Synchronous buffered response with a 120s hard cap; SIGTERM then
SIGKILL after a 5s grace on timeout or client disconnect.
- Single run at a time, serialized via a runStartExclusive queue;
409 openclaw_doctor_already_active on concurrent start.
- Env parity with the live gateway: spawn inherits process.env, so
doctor sees the same decrypted secrets as 'openclaw gateway'.
- --fix defaults to true at every layer (controller, worker route,
admin tRPC) matching the Fly-exec flow and the admin checkbox.
Compatibility:
- Server: DO uses isErrorUnknownRoute to detect older controllers and
returns null, which the worker surfaces as 404
controller_route_unavailable.
- Client: admin UI uses calverAtLeast(version, '2026.4.22') to disable
the new button with an 'Unavailable until redeploy' tooltip on
stale controllers (matches supportsConfigRestore pattern).
Auth/ownership:
- Controller route uses timingSafeTokenEqual bearer middleware.
- Admin tRPC calls assertInstanceBelongsToUser so mismatched
{userId, instanceId} pairs 404 rather than acting on the wrong
instance.
Client-disconnect plumbing: handleHttpRequest now creates an
AbortController wired to the node req's 'close' event (gated on
!res.writableEnded) and passes it into the Request signal; the doctor
route observes c.req.raw.signal to SIGTERM the child if the caller
drops. Inert for existing routes since none read the signal.
The existing Fly-exec /api/platform/doctor + 'Run Doctor' button are
intentionally kept alive so both paths run side-by-side during the
migration. Deprecation of the Fly-exec path is a follow-up PR.
Dockerfile has no structural change; CI must pass
--build-arg CONTROLLER_CACHE_BUST=$(date +%s) for the new controller
code to land in the image.
ServerResponse's 'close' event is the correct signal for client
disconnects. IncomingMessage's 'close' event fires as soon as the
request body stream is fully consumed — which, when we pass 'req' as
init.body for POST requests, happens mid-handler inside Hono's
c.req.json() call. That falsely tripped the AbortController before
the response was sent, and the doctor route interpreted the signal
as a real client disconnect and SIGTERMed the child.
Locally reproduced: POST /_kilo/doctor/run returned
'status: cancelled' with '[cancelled by client disconnect]' in the
output, even though the caller was still waiting.
Switching to res.on('close') + !res.writableEnded gives us the
legitimate 'connection terminated before response completed' case.
Verified via smoke container: openclaw doctor now runs to completion
and returns 'status: completed, exitCode: 0' with full output.
…ler-doctor # Conflicts: # apps/web/src/lib/kiloclaw/kiloclaw-internal-client.ts # apps/web/src/lib/kiloclaw/types.ts
Code Review SummaryStatus: 2 Issues Found | Recommendation: Address before merge Fix these issues in Kilo Cloud Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: None. Resolved since earlier reviews:
Files Reviewed (1 file)
Reviewed by claude-4.6-sonnet-20260217 · 1,193,445 tokens |
jeanduplessis
left a comment
There was a problem hiding this comment.
Approved. I found 3 warnings worth interrogating if you agree with them before/after merge: (1) controller doctor exclusivity is released before cancelled/timed-out children exit, (2) invalid JSON can default into --fix, and (3) the UI feature gate may be too broad for same-day controller builds. I left line notes with trace and impact.
| clearTimeout(run.metadataFlushTimer); | ||
| run.metadataFlushTimer = null; | ||
| } | ||
| activeRun = null; |
There was a problem hiding this comment.
WARNING: Cancelled or timed-out runs can overlap with a new doctor process
finalizeRun clears activeRun immediately for timed_out/cancelled, but those paths only sent SIGTERM and scheduled SIGKILL; the child can still be alive until the grace timer or close. Since the next /start gates only on activeRun, an immediate retry after timeout/cancel can spawn another openclaw doctor while the previous one is still running.
Trace:
timeout/cancelcallscheduleSigkill(run)and thenfinalizeRun(...).finalizeRunsetsactiveRun = nullbefore childclose.- A subsequent start can pass the
if (activeRun)gate, truncatecurrent.log, and spawn a new doctor while the old process may still mutate config or append output.
Impact: Two openclaw doctor --fix processes can overlap during SIGTERM grace, risking concurrent state/config mutation and mixed persisted output. Keep the run exclusive until the child terminates, or gate starts on a still-terminating run.
| try { | ||
| body = await c.req.json(); | ||
| } catch { | ||
| body = {}; |
There was a problem hiding this comment.
WARNING: Malformed JSON defaults to a mutating doctor run
Any c.req.json() failure is treated as {}, then fix defaults to true, so malformed or truncated authenticated JSON starts openclaw doctor --fix instead of returning 400.
Trace:
- The catch path assigns
body = {}for all JSON parse failures. {}passesDoctorRunBodySchema.parsed.data.fix ?? trueselects the mutating--fixpath.
Impact: A bad internal request body can accidentally trigger a repair run. Please distinguish an actually empty body from invalid JSON, or require valid JSON whenever a body is present.
| // disable the button with a tooltip until they redeploy. | ||
| const supportsDoctorController = calverAtLeast( | ||
| cleanVersion(controllerVersion?.version), | ||
| '2026.5.8' |
There was a problem hiding this comment.
WARNING: Same-day stale controllers are incorrectly treated as supporting async doctor routes
The feature gate uses calverAtLeast(..., '2026.5.8'), but this PR also changes controller builds to YYYY.M.D.HHmm specifically to distinguish same-day controller changes. Because the parser treats a missing fourth segment as 0, any 2026.5.8 controller satisfies this gate even if it was built before /_kilo/doctor/start|status|cancel existed.
Trace:
- The UI gate is date-only:
2026.5.8. parseCalvernow treats missing time as0.- Same-day controller builds before this route can still report
2026.5.8and pass the gate.
Impact: Admins can see and click “Run Doctor (Controller)” on stale same-day images, then hit route-unavailable failures instead of the button staying disabled until redeploy. Use the exact build-time minimum that introduced the async routes, or gate on an explicit capability when available.
Summary
openclaw doctor [--fix] --non-interactiveinside the machine via the controller process, avoiding the Fly Machines exec API timeout while keeping the legacy Fly-exec doctor path available.KiloClawInstancestart/status/cancel RPC methods,/api/platform/doctor-controller/{start,status,cancel}, the internal client, admin tRPC, and the admin instance detail "Run Doctor (Controller)" dialog./root/.openclaw/.kiloclaw/doctor-runs/, so admins can close/reopen the dialog without cancelling or losing output.main, preserving newer Kilo Chat/Morning Briefing controller routes, and changes controller build CalVer to UTC date+time granularity with optional fourth-segment comparison support.Verification
N/A (no manual verification performed after the rebase).
Visual Changes
Admin instance detail now shows a second doctor action, "Run Doctor (Controller)", next to the existing Fly-exec "Run Doctor" button. The controller path opens a dialog with a
--fixtoggle, status/exit/duration badges, persisted output display, explicit cancel while running, and a re-run action. Screenshots not captured.Reviewer Notes
/api/platform/doctorendpoint and original "Run Doctor" button are intentionally unchanged so both paths can run side-by-side during migration.startreturns immediately, the admin UI pollsstatus, and closing the dialog does not cancel the run. Cancellation is explicit via/cancel.YYYY.M.D.HHmmso same-day controller changes can be distinguished. Explicit controller capability hints are intentionally deferred to a follow-up.controllerCapabilitiesVersionfor this feature: that DO field tracks the worker→controller config contract for named routing/origin allowlist compatibility, not controller binary route availability.