Skip to content

feat(kiloclaw): add openclaw doctor controller API route#2707

Open
pandemicsyn wants to merge 10 commits intomainfrom
florian/chore/controller-doctor
Open

feat(kiloclaw): add openclaw doctor controller API route#2707
pandemicsyn wants to merge 10 commits intomainfrom
florian/chore/controller-doctor

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented Apr 22, 2026

Summary

  • Adds a controller-local doctor runner that starts openclaw doctor [--fix] --non-interactive inside the machine via the controller process, avoiding the Fly Machines exec API timeout while keeping the legacy Fly-exec doctor path available.
  • Wires the async runner end-to-end through KiloClawInstance start/status/cancel RPC methods, /api/platform/doctor-controller/{start,status,cancel}, the internal client, admin tRPC, and the admin instance detail "Run Doctor (Controller)" dialog.
  • Stores the current/last doctor run metadata and log on the instance volume under /root/.openclaw/.kiloclaw/doctor-runs/, so admins can close/reopen the dialog without cancelling or losing output.
  • Refreshes the branch onto current main, preserving newer Kilo Chat/Morning Briefing controller routes, and changes controller build CalVer to UTC date+time granularity with optional fourth-segment comparison support.

Verification

N/A (no manual verification performed after the rebase).

Visual Changes

Admin instance detail now shows a second doctor action, "Run Doctor (Controller)", next to the existing Fly-exec "Run Doctor" button. The controller path opens a dialog with a --fix toggle, status/exit/duration badges, persisted output display, explicit cancel while running, and a re-run action. Screenshots not captured.

Reviewer Notes

  • The existing /api/platform/doctor endpoint and original "Run Doctor" button are intentionally unchanged so both paths can run side-by-side during migration.
  • Controller doctor is async: start returns immediately, the admin UI polls status, and closing the dialog does not cancel the run. Cancellation is explicit via /cancel.
  • The controller keeps only current/last run state, not multi-run history. Output is capped around 1 MiB and stored tail-preserving with a truncation marker.
  • UI gating still uses controller CalVer for this PR, but controller builds now emit YYYY.M.D.HHmm so same-day controller changes can be distinguished. Explicit controller capability hints are intentionally deferred to a follow-up.
  • Do not use controllerCapabilitiesVersion for this feature: that DO field tracks the worker→controller config contract for named routing/origin allowlist compatibility, not controller binary route availability.

Adds POST /_kilo/doctor/run to the machine-side controller as a
replacement for the 60s-capped Fly Machines exec API. Wired end-to-end
through a new DO RPC, a worker platform route
(/api/platform/doctor-controller), the internal client, a new admin
tRPC procedure, and a new 'Run Doctor (Controller)' admin button.

Behavior:
- Synchronous buffered response with a 120s hard cap; SIGTERM then
  SIGKILL after a 5s grace on timeout or client disconnect.
- Single run at a time, serialized via a runStartExclusive queue;
  409 openclaw_doctor_already_active on concurrent start.
- Env parity with the live gateway: spawn inherits process.env, so
  doctor sees the same decrypted secrets as 'openclaw gateway'.
- --fix defaults to true at every layer (controller, worker route,
  admin tRPC) matching the Fly-exec flow and the admin checkbox.

Compatibility:
- Server: DO uses isErrorUnknownRoute to detect older controllers and
  returns null, which the worker surfaces as 404
  controller_route_unavailable.
- Client: admin UI uses calverAtLeast(version, '2026.4.22') to disable
  the new button with an 'Unavailable until redeploy' tooltip on
  stale controllers (matches supportsConfigRestore pattern).

Auth/ownership:
- Controller route uses timingSafeTokenEqual bearer middleware.
- Admin tRPC calls assertInstanceBelongsToUser so mismatched
  {userId, instanceId} pairs 404 rather than acting on the wrong
  instance.

Client-disconnect plumbing: handleHttpRequest now creates an
AbortController wired to the node req's 'close' event (gated on
!res.writableEnded) and passes it into the Request signal; the doctor
route observes c.req.raw.signal to SIGTERM the child if the caller
drops. Inert for existing routes since none read the signal.

The existing Fly-exec /api/platform/doctor + 'Run Doctor' button are
intentionally kept alive so both paths run side-by-side during the
migration. Deprecation of the Fly-exec path is a follow-up PR.

Dockerfile has no structural change; CI must pass
--build-arg CONTROLLER_CACHE_BUST=$(date +%s) for the new controller
code to land in the image.
ServerResponse's 'close' event is the correct signal for client
disconnects. IncomingMessage's 'close' event fires as soon as the
request body stream is fully consumed — which, when we pass 'req' as
init.body for POST requests, happens mid-handler inside Hono's
c.req.json() call. That falsely tripped the AbortController before
the response was sent, and the doctor route interpreted the signal
as a real client disconnect and SIGTERMed the child.

Locally reproduced: POST /_kilo/doctor/run returned
'status: cancelled' with '[cancelled by client disconnect]' in the
output, even though the caller was still waiting.

Switching to res.on('close') + !res.writableEnded gives us the
legitimate 'connection terminated before response completed' case.
Verified via smoke container: openclaw doctor now runs to completion
and returns 'status: completed, exitCode: 0' with full output.
@pandemicsyn pandemicsyn marked this pull request as ready for review April 22, 2026 21:30
…ler-doctor

# Conflicts:
#	apps/web/src/lib/kiloclaw/kiloclaw-internal-client.ts
#	apps/web/src/lib/kiloclaw/types.ts
Copy link
Copy Markdown
Contributor

@RSO RSO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to figure out why you don't have a code review on this PR here

Comment thread services/kiloclaw/controller/src/routes/doctor.ts Outdated
Comment thread services/kiloclaw/controller/src/routes/doctor.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 29, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Fix these issues in Kilo Cloud

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 0
Issue Details (click to expand)

WARNING

File Line Issue
services/kiloclaw/controller/src/routes/doctor.ts 365 Invalid or truncated JSON still falls through to fix: true, running the mutating doctor repair path instead of returning 400.
services/kiloclaw/controller/src/routes/doctor.ts 458 Timed-out runs are still marked non-running before the child exits, allowing a second doctor process during SIGKILL grace.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

None.

Resolved since earlier reviews:

File Line Issue
apps/web/src/app/admin/components/KiloclawInstances/KiloclawInstanceDetail.tsx 1493 Controller doctor UI now gates on the async route version (2026.5.8).
apps/web/src/app/admin/components/KiloclawInstances/KiloclawInstanceDetail.tsx 2027 Start success now invalidates the status query; 409 conflict also opens the dialog and refreshes status instead of showing an error toast.
services/kiloclaw/controller/src/routes/doctor.ts 172 Completed doctor runs no longer retain the child process/output in module state; output is persisted and activeRun is cleared on normal completion.
Files Reviewed (1 file)
  • apps/web/src/app/admin/components/KiloclawInstances/KiloclawInstanceDetail.tsx - 0 new issues

Reviewed by claude-4.6-sonnet-20260217 · 1,193,445 tokens

Comment thread services/kiloclaw/controller/src/routes/doctor.ts
Copy link
Copy Markdown
Contributor

@jeanduplessis jeanduplessis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I found 3 warnings worth interrogating if you agree with them before/after merge: (1) controller doctor exclusivity is released before cancelled/timed-out children exit, (2) invalid JSON can default into --fix, and (3) the UI feature gate may be too broad for same-day controller builds. I left line notes with trace and impact.

clearTimeout(run.metadataFlushTimer);
run.metadataFlushTimer = null;
}
activeRun = null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Cancelled or timed-out runs can overlap with a new doctor process

finalizeRun clears activeRun immediately for timed_out/cancelled, but those paths only sent SIGTERM and scheduled SIGKILL; the child can still be alive until the grace timer or close. Since the next /start gates only on activeRun, an immediate retry after timeout/cancel can spawn another openclaw doctor while the previous one is still running.

Trace:

  1. timeout/cancel call scheduleSigkill(run) and then finalizeRun(...).
  2. finalizeRun sets activeRun = null before child close.
  3. A subsequent start can pass the if (activeRun) gate, truncate current.log, and spawn a new doctor while the old process may still mutate config or append output.

Impact: Two openclaw doctor --fix processes can overlap during SIGTERM grace, risking concurrent state/config mutation and mixed persisted output. Keep the run exclusive until the child terminates, or gate starts on a still-terminating run.

try {
body = await c.req.json();
} catch {
body = {};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Malformed JSON defaults to a mutating doctor run

Any c.req.json() failure is treated as {}, then fix defaults to true, so malformed or truncated authenticated JSON starts openclaw doctor --fix instead of returning 400.

Trace:

  1. The catch path assigns body = {} for all JSON parse failures.
  2. {} passes DoctorRunBodySchema.
  3. parsed.data.fix ?? true selects the mutating --fix path.

Impact: A bad internal request body can accidentally trigger a repair run. Please distinguish an actually empty body from invalid JSON, or require valid JSON whenever a body is present.

// disable the button with a tooltip until they redeploy.
const supportsDoctorController = calverAtLeast(
cleanVersion(controllerVersion?.version),
'2026.5.8'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Same-day stale controllers are incorrectly treated as supporting async doctor routes

The feature gate uses calverAtLeast(..., '2026.5.8'), but this PR also changes controller builds to YYYY.M.D.HHmm specifically to distinguish same-day controller changes. Because the parser treats a missing fourth segment as 0, any 2026.5.8 controller satisfies this gate even if it was built before /_kilo/doctor/start|status|cancel existed.

Trace:

  1. The UI gate is date-only: 2026.5.8.
  2. parseCalver now treats missing time as 0.
  3. Same-day controller builds before this route can still report 2026.5.8 and pass the gate.

Impact: Admins can see and click “Run Doctor (Controller)” on stale same-day images, then hit route-unavailable failures instead of the button staying disabled until redeploy. Use the exact build-time minimum that introduced the async routes, or gate on an explicit capability when available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants