Skip to content

feat(dashboard): SQLite workflowstore for ML jobs and OpenClaw; healt…#1656

Merged
rootfs merged 6 commits into
vllm-project:mainfrom
mkoushni:feature/dashboard-workflow-state-durable
Apr 16, 2026
Merged

feat(dashboard): SQLite workflowstore for ML jobs and OpenClaw; healt…#1656
rootfs merged 6 commits into
vllm-project:mainfrom
mkoushni:feature/dashboard-workflow-state-durable

Conversation

@mkoushni
Copy link
Copy Markdown
Contributor

Summary

Make dashboard workflow state restart-safe and server-owned for ML pipeline jobs and OpenClaw collaboration entities, instead of relying on in-memory maps, workspace-local JSON files, or browser-only state.

What changed

  • New workflowstore package (dashboard/backend/workflowstore/): a shared SQLite database (./data/workflow.sqlite by default, override via DASHBOARD_WORKFLOW_DB_PATH or --workflow-db flag) that owns durable state for ML pipeline jobs, typed progress events, and OpenClaw containers, teams, rooms, and room messages.

  • ML pipeline jobs (dashboard/backend/mlpipeline/runner.go):

    • Job metadata, status transitions, and progress are now persisted to ml_pipeline_jobs and ml_pipeline_progress_events tables instead of an in-memory map[string]*Job.
    • On startup, RecoverInterruptedMLJobs marks any running jobs as failed with a clear message (same pattern as the evaluation subsystem).
    • New API: GET /api/ml-pipeline/jobs/{id}/events returns typed durable progress history (not log-derived).
  • OpenClaw entities (dashboard/backend/handlers/openclaw*.go):

    • Container registry, teams, rooms, and room messages moved from ad hoc JSON files (containers.json, teams.json, rooms.json, room-messages/*.json) to SQLite tables.
    • Room message appends are now O(1) INSERT rows instead of rewriting the entire JSON file on every message.
    • Legacy JSON files are imported once on first startup when the database tables are empty; existing deployments migrate transparently.
    • Live SSE and WebSocket client registries remain in-memory (ephemeral by design).
  • Workflow health API: GET /api/workflows/health returns a typed JSON snapshot with store connectivity status, ML job counts, and OpenClaw entity counts — no log scraping.

  • Frontend annotation: useConversationStorage.ts now documents that browser localStorage chat history is demo/playground-only; OpenClaw room APIs are the supported server-owned collaboration history path.

  • State inventory update: docs/agent/state-taxonomy-and-inventory.md rows for ML pipeline and OpenClaw updated to reflect the new persistence contract.

  • Test fix: TestMergeDeployPayload_RoundTripsMaintainedAMDConfig pointed at non-existent deploy/amd/config.yaml; corrected to use deploy/recipes/balance.yaml (the actual AMD reference recipe documented in deploy/amd/README.md).

Files changed

Area Files
New package dashboard/backend/workflowstore/{store,mlpipeline,openclaw,legacy_import,store_test}.go
New handler dashboard/backend/handlers/workflow_health.go
New test helper dashboard/backend/handlers/openclaw_test_helpers_test.go
ML pipeline dashboard/backend/mlpipeline/runner{,_subprocess,_http,_config}.go
OpenClaw handlers dashboard/backend/handlers/openclaw{,_rooms}.go
Router wiring dashboard/backend/router/{router,core_routes,openclaw_routes}.go
Config dashboard/backend/config/config.go
Test updates openclaw_test.go, openclaw_image_test.go, openclaw_mcp_test.go, openclaw_room_readonly_test.go, openclaw_room_context_test.go, openclaw_worker_chat_test.go, mcp_routes_test.go, deploy_test.go
Frontend dashboard/frontend/src/hooks/useConversationStorage.ts
Docs docs/agent/state-taxonomy-and-inventory.md

Design decisions

  1. Single SQLite file — same proven approach as auth.db and evaluations.db already in the dashboard. Keeps local-dev simple; the Store interface is a natural seam for a future Postgres adapter when HA is needed.
  2. Snapshot-replace for registries — containers/teams/rooms use DELETE + INSERT ALL in a transaction (same semantics as the JSON-file writes they replace). Room messages use append-only INSERT.
  3. Legacy import gate — runs once when openclaw_container table is empty and LegacyOpenClawDir is set. No migration tooling needed; existing JSON data is preserved on disk.
  4. SSE/WS stay in memory — live connection maps are ephemeral by nature; moving them to a database would add latency with no durability benefit.

Related

Test plan

  • go test ./dashboard/backend/workflowstore/ — restart survival, recovery, incremental messages
  • go test ./dashboard/backend/handlers/ — all OpenClaw handler tests use newTestOpenClawHandler with temp SQLite
  • go test ./dashboard/backend/router/ — MCP integration test with WorkflowDBPath set
  • TestMergeDeployPayload_RoundTripsMaintainedAMDConfig passes with corrected recipe path
  • GET /api/workflows/health returns entity counts and "store":"ok"
  • Start dashboard, create ML job, restart dashboard, verify job shows as failed with recovery message and progress events are queryable via /api/ml-pipeline/jobs/{id}/events
  • Start dashboard with existing containers.json/teams.json/rooms.json, verify legacy import populates SQLite; subsequent restarts do not re-import

Resolve #1609

@netlify
Copy link
Copy Markdown

netlify Bot commented Mar 25, 2026

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit ce103c8
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/69e0ad7bc784100008e111ac
😎 Deploy Preview https://deploy-preview-1656--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 25, 2026

✅ Supply Chain Security Report — All Clear

Scanner Status Findings
AST Codebase Scan (Py, Go, JS/TS, Rust) 27 finding(s) — MEDIUM: 21 · LOW: 6
AST PR Diff Scan No issues detected
Regex Fallback Scan No issues detected

Scanned at 2026-04-16T09:36:22.487Z · View full workflow logs

@rootfs rootfs requested a review from Copilot March 25, 2026 15:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates dashboard workflow/control-plane state (ML pipeline jobs + OpenClaw collaboration entities) from in-memory / JSON-file persistence to a shared, durable SQLite-backed workflowstore, and wires new APIs/health reporting around it.

Changes:

  • Introduces dashboard/backend/workflowstore SQLite store with schema + legacy OpenClaw JSON import.
  • Updates ML pipeline runner/handlers to persist jobs + typed progress events, plus adds /api/ml-pipeline/jobs/{id}/events.
  • Updates OpenClaw handlers/tests to store containers/teams/rooms/messages in SQLite and adds /api/workflows/health.

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
docs/agent/state-taxonomy-and-inventory.md Updates state inventory to reflect SQLite-backed durability for ML pipeline + OpenClaw.
dashboard/frontend/src/hooks/useConversationStorage.ts Documents that localStorage chat is demo-only; points to OpenClaw rooms for durable history.
dashboard/backend/workflowstore/store.go Adds SQLite store open/init + schema for ML pipeline + OpenClaw entities.
dashboard/backend/workflowstore/mlpipeline.go Adds persisted ML job + progress event CRUD and recovery logic.
dashboard/backend/workflowstore/openclaw.go Adds OpenClaw entity/message CRUD against SQLite.
dashboard/backend/workflowstore/legacy_import.go Adds one-time import path from legacy OpenClaw JSON files into SQLite.
dashboard/backend/workflowstore/store_test.go Adds restart/reopen and incremental message append tests for the store.
dashboard/backend/router/router.go Opens workflow store, registers workflow health endpoint, injects store into OpenClaw/ML pipeline.
dashboard/backend/router/core_routes.go Updates ML pipeline route wiring to create runner with store + recover running jobs.
dashboard/backend/router/openclaw_routes.go Updates OpenClaw handler construction to require workflow store.
dashboard/backend/router/mcp_routes_test.go Sets WorkflowDBPath for router integration tests.
dashboard/backend/mlpipeline/runner*.go Replaces in-memory job map with persisted jobs + typed progress events in store.
dashboard/backend/handlers/workflow_health.go Adds /api/workflows/health snapshot endpoint backed by store counts.
dashboard/backend/handlers/openclaw*.go + tests Switches OpenClaw registry/rooms/messages persistence to store and updates tests to use temp SQLite.
dashboard/backend/handlers/mlpipeline.go Adds /events sub-route for durable typed ML progress history.
dashboard/backend/handlers/deploy_test.go Fixes AMD config fixture path to use the documented recipe.
dashboard/backend/config/config.go Adds --workflow-db / env default path config for workflow SQLite.

Comment thread dashboard/backend/workflowstore/store.go Outdated
Comment thread dashboard/backend/workflowstore/store.go
Comment thread dashboard/backend/workflowstore/openclaw.go Outdated
Comment thread dashboard/backend/mlpipeline/runner.go Outdated
Comment thread dashboard/backend/workflowstore/mlpipeline.go
Comment thread dashboard/backend/workflowstore/mlpipeline.go
Comment thread dashboard/backend/handlers/openclaw.go
Comment thread dashboard/backend/workflowstore/mlpipeline.go
Comment thread dashboard/backend/workflowstore/openclaw.go
@Xunzhuo
Copy link
Copy Markdown
Member

Xunzhuo commented Apr 2, 2026

can we reuse the postgre we set up now #1683? migrate all the sqlite storage into an unified one?

@mkoushni mkoushni force-pushed the feature/dashboard-workflow-state-durable branch 5 times, most recently from dd15c8f to 129edb2 Compare April 16, 2026 09:21
…h/events APIs; AMD deploy test fix

Signed-off-by: mkoushni <mkoushni@redhat.com>
…og persist errors

- Add _foreign_keys=1 to SQLite DSN so ON DELETE CASCADE is enforced
- Use INSERT OR IGNORE instead of INSERT OR REPLACE in AppendOpenClawRoomMessage
  to preserve seq ordering on duplicate messages
- Replace silent error discards (_ = ...) with log.Printf in runner.go
  so persistence failures are visible in logs

Made-with: Cursor
Signed-off-by: mkoushni <mkoushni@redhat.com>
Signed-off-by: mkoushni <mkoushni@redhat.com>
Signed-off-by: mkoushni <mkoushni@redhat.com>
Made-with: Cursor
Signed-off-by: mkoushni <mkoushni@redhat.com>
…errors

Signed-off-by: mkoushni <mkoushni@redhat.com>
Signed-off-by: mkoushni <mkoushni@redhat.com>
@mkoushni mkoushni force-pushed the feature/dashboard-workflow-state-durable branch from 129edb2 to ce103c8 Compare April 16, 2026 09:35
@mkoushni mkoushni marked this pull request as ready for review April 16, 2026 09:53
@mkoushni mkoushni requested review from Xunzhuo and rootfs as code owners April 16, 2026 09:53
@github-actions
Copy link
Copy Markdown
Contributor

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 dashboard

Owners: @Xunzhuo, @JaredforReal, @haowu1234, @szedan-rh, @yehuditkerido, @henschwartz
Files changed:

  • dashboard/backend/config/config.go
  • dashboard/backend/handlers/deploy_test.go
  • dashboard/backend/handlers/mlpipeline.go
  • dashboard/backend/handlers/openclaw.go
  • dashboard/backend/handlers/openclaw_automation_test.go
  • dashboard/backend/handlers/openclaw_gateway_test.go
  • dashboard/backend/handlers/openclaw_image_test.go
  • dashboard/backend/handlers/openclaw_mcp_test.go
  • dashboard/backend/handlers/openclaw_room_context_test.go
  • dashboard/backend/handlers/openclaw_room_helpers.go
  • dashboard/backend/handlers/openclaw_room_readonly_test.go
  • dashboard/backend/handlers/openclaw_rooms.go
  • dashboard/backend/handlers/openclaw_rooms_test.go
  • dashboard/backend/handlers/openclaw_test.go
  • dashboard/backend/handlers/openclaw_test_helpers_test.go
  • dashboard/backend/handlers/openclaw_worker_chat_test.go
  • dashboard/backend/handlers/workflow_health.go
  • dashboard/backend/mlpipeline/runner.go
  • dashboard/backend/mlpipeline/runner_config.go
  • dashboard/backend/mlpipeline/runner_http.go
  • dashboard/backend/mlpipeline/runner_subprocess.go
  • dashboard/backend/router/core_routes.go
  • dashboard/backend/router/mcp_routes_test.go
  • dashboard/backend/router/openclaw_routes.go
  • dashboard/backend/router/router.go
  • dashboard/backend/workflowstore/legacy_import.go
  • dashboard/backend/workflowstore/mlpipeline.go
  • dashboard/backend/workflowstore/openclaw.go
  • dashboard/backend/workflowstore/store.go
  • dashboard/backend/workflowstore/store_test.go
  • dashboard/frontend/src/hooks/useConversationStorage.ts

📁 docs

Owners: @Xunzhuo
Files changed:

  • docs/agent/state-taxonomy-and-inventory.md

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@rootfs rootfs merged commit 70f0738 into vllm-project:main Apr 16, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: make dashboard workflows and OpenClaw state restart-safe and server-owned

8 participants