Skip to content

Commit 3825ee1

Browse files
Partha-dev01claude
andcommitted
Add camera-based motor action verification to Step 7 + fix worker URL bug
Step 7 motor turns now activate the camera and use YOLO pose detection to verify the child performed the requested action (wave, touch nose, clap, raise arms, touch head, touch ears). Rule-based ActionDetector analyzes COCO-17 keypoints with body-scale-normalized thresholds. ActionTracker requires 5 consecutive positive frames to confirm detection. Also fixes Stage 10 ONNX worker URL parse error — model paths changed from relative (/models/...) to absolute (self.location.origin prefix) so they resolve correctly in Web Worker scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent e1f3f4d commit 3825ee1

7 files changed

Lines changed: 922 additions & 46 deletions

File tree

DOCS.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,6 +407,8 @@ npx playwright test # Run all 30 tests
407407
| R10 | **No consent before cloud sync** | Results auto-synced to cloud without user consent. Fixed: consent checkbox added at Stage 10 completion. Summary page (Stage 11) respects the preference — skips sync if user opts out. |
408408
| R11 | **Step 7 static instructions — no adaptive assessment** | Step 7 used 5 hardcoded instructions with parent-reported "Did it!" buttons. Replaced with a dynamic AI voice agent: Amazon Nova Lite (Bedrock) generates age-appropriate conversation, Amazon Polly speaks to the child, Web Speech API listens for responses. Collects richer biomarkers (response latency, engagement rate, comprehension). Falls back to pre-defined conversation when Bedrock unavailable. |
409409
| R12 | **CI Playwright failures — AWS SDK "Region is missing"** | `next.config.ts` inlines env vars at build time with `?? ""` defaults. On CI (no `.env.local`), `BEDROCK_REGION`/`POLLY_REGION` resolved to `""` (empty string). Nullish coalescing (`??`) doesn't catch empty strings, so `process.env.BEDROCK_REGION ?? "us-east-1"``""`. AWS SDK threw `Error: Region is missing` outside try/catch → uncaught 500. Fix: changed `??` to `||` in all 4 API routes (summary, clinical, tts, conversation) and moved client creation inside try/catch blocks. TTS error status changed from 500 → 503. |
410+
| R13 | **Step 7 auto-advances without verifying motor actions** | Voice agent spoke motor instructions ("touch your nose", "wave") but immediately moved on without checking if the child performed the action. Also, agent text wasn't displayed prominently. Fix: added camera-based motor action verification using existing YOLO pose detection pipeline. Motor turns activate camera → YOLO extracts 17 keypoints → rule-based ActionDetector checks keypoint geometry → ActionTracker requires 5 consecutive positive frames → confirmed. Agent text now displayed in large centered speech bubble with domain emoji headers. |
411+
| R14 | **Stage 10 worker URL parse error** | `Failed to execute 'fetch' on 'WorkerGlobalScope': Failed to parse URL from /models/yolo26n-pose-int8.onnx`. ONNX model paths were relative URLs (`/models/...`) which fail inside Web Workers because relative paths resolve against the worker script URL (blob: or /_next/static/), not the page origin. Fix: prefixed all 4 model paths with `${self.location.origin}` in PipelineOrchestrator.ts and MultimodalOrchestrator.ts. |
410412

411413
---
412414

@@ -513,3 +515,26 @@ npx playwright test # Run all 30 tests
513515
- Fixed: `app/api/report/clinical/route.ts` (`??``||`, client inside try/catch)
514516
- Fixed: `app/api/tts/route.ts` (`??``||`, client inside try/catch, 500 → 503)
515517
- Fixed: `app/api/chat/conversation/route.ts` (`??``||`, client inside try/catch)
518+
519+
### v1.4.0 — 2026-03-04 (Camera Action Verification + Worker URL Fix)
520+
521+
**Major Change:**
522+
- **Step 7 motor action verification via YOLO camera**: Motor instruction turns now activate the camera and use the existing YOLO26n-pose model to detect whether the child actually performed the requested action (wave, touch nose, clap, raise arms, touch head, touch ears). Rule-based ActionDetector analyzes 17 COCO keypoints with body-scale-normalized distance thresholds. ActionTracker requires 5 consecutive positive frames to confirm detection, preventing false positives.
523+
524+
**New:**
525+
- `app/lib/actions/actionDetector.ts` — Pure rule-based action detection from YOLO keypoints: 6 actions with geometry rules, `ActionTracker` class for sustained detection, `ACTION_META` map for UI labels/emoji
526+
- `app/hooks/useActionCamera.ts` — Camera + YOLO inference + action detection hook: manages getUserMedia, inference worker (body-only mode), requestAnimationFrame loop, skeleton overlay drawing, ActionTracker integration
527+
- New `"verifying"` phase in Step 7 state machine: camera feed shown with COCO-17 skeleton overlay, detection progress bar, 15-second timeout with skip option
528+
- Domain emoji headers in agent text display (social, cognitive, language, motor, general)
529+
- `action` field added to conversation API TurnMetadata — LLM includes action ID for motor turns
530+
531+
**Fixed:**
532+
- **Stage 10 ONNX worker URL parse error**: Model paths in `PipelineOrchestrator.ts` and `MultimodalOrchestrator.ts` changed from relative (`/models/...`) to absolute (`${self.location.origin}/models/...`) — resolves correctly in Web Worker scope
533+
534+
**Files:**
535+
- Created: `app/lib/actions/actionDetector.ts`
536+
- Created: `app/hooks/useActionCamera.ts`
537+
- Rewritten: `app/intake/preparation/page.tsx` (camera verification integration)
538+
- Updated: `app/api/chat/conversation/route.ts` (action field in metadata)
539+
- Fixed: `app/lib/inference/PipelineOrchestrator.ts` (absolute model URLs)
540+
- Fixed: `app/lib/inference/MultimodalOrchestrator.ts` (absolute model URLs)

app/api/chat/conversation/route.ts

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ interface TurnMetadata {
4141
responseRelevance: number;
4242
shouldEnd: boolean;
4343
domain: "social" | "cognitive" | "language" | "motor" | "general";
44+
action?: string; // For motor instructions: "wave", "touch_nose", "clap", "raise_arms", "touch_head", "touch_ears"
4445
}
4546

4647
interface ConversationResponse {
@@ -87,10 +88,11 @@ RULES:
8788
10. For motor instructions, phrase them as fun games — "Let's play a game! Can you..."
8889
8990
You MUST respond with ONLY valid JSON (no markdown, no code blocks) in this exact format:
90-
{"text":"Your spoken response here","turnType":"greeting|question|instruction|follow_up|farewell","expectsResponse":true,"responseRelevance":0.5,"shouldEnd":false,"domain":"social|cognitive|language|motor|general"}
91+
{"text":"Your spoken response here","turnType":"greeting|question|instruction|follow_up|farewell","expectsResponse":true,"responseRelevance":0.5,"shouldEnd":false,"domain":"social|cognitive|language|motor|general","action":null}
9192
9293
For responseRelevance: rate how relevant the child's LAST response was to your LAST question (0.0 = no response or completely irrelevant, 0.5 = somewhat relevant, 1.0 = perfect response). Use 0.5 for the first turn.
93-
For shouldEnd: set to true ONLY on your farewell turn (after 5-8 assistant turns).`;
94+
For shouldEnd: set to true ONLY on your farewell turn (after 5-8 assistant turns).
95+
For action: when domain is "motor" and turnType is "instruction", include one of: "wave", "touch_nose", "clap", "raise_arms", "touch_head", "touch_ears". For non-motor turns, set to null.`;
9496
}
9597

9698
/* ------------------------------------------------------------------ */
@@ -108,7 +110,7 @@ function buildFallbackTurn(
108110
},
109111
{
110112
text: `Awesome! Let's start with something fun. Can you wave hello to me?`,
111-
metadata: { turnType: "instruction", expectsResponse: true, responseRelevance: 0.5, shouldEnd: false, domain: "motor" },
113+
metadata: { turnType: "instruction", expectsResponse: true, responseRelevance: 0.5, shouldEnd: false, domain: "motor", action: "wave" },
112114
},
113115
{
114116
text: `Great job! Now tell me, what color is the sky?`,
@@ -120,7 +122,7 @@ function buildFallbackTurn(
120122
},
121123
{
122124
text: `That's wonderful! Now let's try something silly. Can you touch your nose?`,
123-
metadata: { turnType: "instruction", expectsResponse: true, responseRelevance: 0.5, shouldEnd: false, domain: "motor" },
125+
metadata: { turnType: "instruction", expectsResponse: true, responseRelevance: 0.5, shouldEnd: false, domain: "motor", action: "touch_nose" },
124126
},
125127
{
126128
text: `You're a superstar! What's your favorite animal?`,
@@ -153,6 +155,7 @@ function parseAgentResponse(raw: string): Omit<ConversationResponse, "fallback">
153155
responseRelevance: typeof parsed.responseRelevance === "number" ? parsed.responseRelevance : 0.5,
154156
shouldEnd: parsed.shouldEnd === true,
155157
domain: parsed.domain ?? "general",
158+
...(parsed.action ? { action: parsed.action } : {}),
156159
},
157160
};
158161
}
@@ -175,6 +178,7 @@ function parseAgentResponse(raw: string): Omit<ConversationResponse, "fallback">
175178
responseRelevance: typeof parsed.responseRelevance === "number" ? parsed.responseRelevance : 0.5,
176179
shouldEnd: parsed.shouldEnd === true,
177180
domain: parsed.domain ?? "general",
181+
...(parsed.action ? { action: parsed.action } : {}),
178182
},
179183
};
180184
}

app/hooks/useActionCamera.ts

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
/**
2+
* useActionCamera — manages camera + YOLO inference + action detection
3+
* for Step 7 motor instruction verification.
4+
*
5+
* Reuses the existing inference.worker.ts in body-only mode (YOLO + TCN)
6+
* but only extracts keypoints for rule-based action detection.
7+
*/
8+
9+
"use client";
10+
import { useState, useEffect, useRef, useCallback } from "react";
11+
import type { PipelineResult, WorkerOutMessage } from "../types/inference";
12+
import { ActionTracker, type ActionId, type ActionResult } from "../lib/actions/actionDetector";
13+
14+
// COCO-17 skeleton connections (same as DetectorVideoCanvas)
15+
const SKELETON: [number, number][] = [
16+
[0, 1], [0, 2], [1, 3], [2, 4],
17+
[5, 7], [7, 9], [6, 8], [8, 10],
18+
[5, 6], [5, 11], [6, 12], [11, 12],
19+
[11, 13], [13, 15], [12, 14], [14, 16],
20+
];
21+
22+
export interface UseActionCameraReturn {
23+
videoRef: React.RefObject<HTMLVideoElement | null>;
24+
overlayRef: React.RefObject<HTMLCanvasElement | null>;
25+
isModelLoaded: boolean;
26+
isActive: boolean;
27+
cameraError: string | null;
28+
startCamera: () => Promise<void>;
29+
stopCamera: () => void;
30+
startDetecting: (action: ActionId) => void;
31+
stopDetecting: () => void;
32+
actionResult: ActionResult | null;
33+
actionDetected: boolean;
34+
keypoints: Float32Array | null;
35+
confidence: Float32Array | null;
36+
}
37+
38+
export function useActionCamera(): UseActionCameraReturn {
39+
const videoRef = useRef<HTMLVideoElement | null>(null);
40+
const overlayRef = useRef<HTMLCanvasElement | null>(null);
41+
const captureCanvasRef = useRef<HTMLCanvasElement | null>(null);
42+
43+
const [isModelLoaded, setIsModelLoaded] = useState(false);
44+
const [isActive, setIsActive] = useState(false);
45+
const [cameraError, setCameraError] = useState<string | null>(null);
46+
const [actionResult, setActionResult] = useState<ActionResult | null>(null);
47+
const [actionDetected, setActionDetected] = useState(false);
48+
const [keypoints, setKeypoints] = useState<Float32Array | null>(null);
49+
const [confidence, setConfidence] = useState<Float32Array | null>(null);
50+
51+
const workerRef = useRef<Worker | null>(null);
52+
const busyRef = useRef(false);
53+
const rafRef = useRef(0);
54+
const streamRef = useRef<MediaStream | null>(null);
55+
const trackerRef = useRef(new ActionTracker());
56+
const targetActionRef = useRef<ActionId | null>(null);
57+
const detectingRef = useRef(false);
58+
59+
// Create & initialise worker on mount
60+
useEffect(() => {
61+
let worker: Worker;
62+
try {
63+
worker = new Worker(
64+
new URL("../../workers/inference.worker.ts", import.meta.url),
65+
{ type: "module" },
66+
);
67+
} catch (err) {
68+
setCameraError(`Failed to create inference worker: ${err instanceof Error ? err.message : String(err)}`);
69+
return;
70+
}
71+
workerRef.current = worker;
72+
73+
worker.onmessage = (e: MessageEvent<WorkerOutMessage>) => {
74+
const msg = e.data;
75+
switch (msg.type) {
76+
case "initialized":
77+
setIsModelLoaded(true);
78+
// Set body-only mode
79+
worker.postMessage({ type: "setModality", modality: "body" });
80+
break;
81+
case "result":
82+
handleResult(msg.data);
83+
busyRef.current = false;
84+
break;
85+
case "error":
86+
busyRef.current = false;
87+
break;
88+
}
89+
};
90+
91+
worker.postMessage({ type: "init" });
92+
93+
return () => {
94+
worker.terminate();
95+
workerRef.current = null;
96+
};
97+
// eslint-disable-next-line react-hooks/exhaustive-deps
98+
}, []);
99+
100+
// Handle inference result
101+
const handleResult = useCallback((result: PipelineResult) => {
102+
const kps = result.keypoints;
103+
const conf = result.confidence;
104+
if (kps && conf) {
105+
setKeypoints(kps);
106+
setConfidence(conf);
107+
drawSkeleton(kps, conf);
108+
109+
if (detectingRef.current && targetActionRef.current) {
110+
const tracked = trackerRef.current.update(kps, conf, targetActionRef.current);
111+
setActionResult(tracked);
112+
if (tracked.confirmed) {
113+
setActionDetected(true);
114+
detectingRef.current = false;
115+
}
116+
}
117+
}
118+
// eslint-disable-next-line react-hooks/exhaustive-deps
119+
}, []);
120+
121+
// Draw skeleton overlay
122+
const drawSkeleton = useCallback((kps: Float32Array, conf: Float32Array) => {
123+
const canvas = overlayRef.current;
124+
if (!canvas) return;
125+
const ctx = canvas.getContext("2d");
126+
if (!ctx) return;
127+
128+
const w = canvas.width;
129+
const h = canvas.height;
130+
ctx.clearRect(0, 0, w, h);
131+
132+
if (kps.length < 34) return;
133+
134+
// Scale keypoints from 320×240 to canvas size
135+
const scaleX = w / 320;
136+
const scaleY = h / 240;
137+
138+
// Draw bones
139+
ctx.strokeStyle = "rgba(104, 159, 56, 0.8)";
140+
ctx.lineWidth = 2.5;
141+
for (const [a, b] of SKELETON) {
142+
if (conf[a] < 0.3 || conf[b] < 0.3) continue;
143+
ctx.beginPath();
144+
ctx.moveTo(kps[a * 2] * scaleX, kps[a * 2 + 1] * scaleY);
145+
ctx.lineTo(kps[b * 2] * scaleX, kps[b * 2 + 1] * scaleY);
146+
ctx.stroke();
147+
}
148+
149+
// Draw keypoints
150+
for (let i = 0; i < 17; i++) {
151+
if (conf[i] < 0.3) continue;
152+
ctx.fillStyle = "rgba(104, 159, 56, 0.9)";
153+
ctx.beginPath();
154+
ctx.arc(kps[i * 2] * scaleX, kps[i * 2 + 1] * scaleY, 4, 0, Math.PI * 2);
155+
ctx.fill();
156+
}
157+
}, []);
158+
159+
// Frame capture loop
160+
const sendFrame = useCallback(() => {
161+
const worker = workerRef.current;
162+
const video = videoRef.current;
163+
164+
if (!worker || !video || !isActive || !isModelLoaded || busyRef.current || video.paused) {
165+
if (isActive) rafRef.current = requestAnimationFrame(sendFrame);
166+
return;
167+
}
168+
169+
try {
170+
if (!captureCanvasRef.current) {
171+
captureCanvasRef.current = document.createElement("canvas");
172+
captureCanvasRef.current.width = 320;
173+
captureCanvasRef.current.height = 240;
174+
}
175+
const ctx = captureCanvasRef.current.getContext("2d", { willReadFrequently: true });
176+
if (!ctx) { rafRef.current = requestAnimationFrame(sendFrame); return; }
177+
178+
ctx.drawImage(video, 0, 0, 320, 240);
179+
const imageData = ctx.getImageData(0, 0, 320, 240);
180+
181+
busyRef.current = true;
182+
worker.postMessage({ type: "processFrame", imageData }, [imageData.data.buffer]);
183+
} catch {
184+
// Frame capture error — skip
185+
}
186+
187+
rafRef.current = requestAnimationFrame(sendFrame);
188+
}, [isActive, isModelLoaded]);
189+
190+
// Start/stop frame loop when active changes
191+
useEffect(() => {
192+
if (isActive && isModelLoaded) {
193+
rafRef.current = requestAnimationFrame(sendFrame);
194+
}
195+
return () => cancelAnimationFrame(rafRef.current);
196+
}, [isActive, isModelLoaded, sendFrame]);
197+
198+
const startCamera = useCallback(async () => {
199+
try {
200+
const stream = await navigator.mediaDevices.getUserMedia({
201+
video: { width: 320, height: 240, facingMode: "user" },
202+
});
203+
streamRef.current = stream;
204+
if (videoRef.current) {
205+
videoRef.current.srcObject = stream;
206+
await videoRef.current.play().catch(() => {});
207+
}
208+
setIsActive(true);
209+
setCameraError(null);
210+
} catch (err) {
211+
setCameraError(
212+
err instanceof Error ? err.message : "Camera access denied",
213+
);
214+
}
215+
}, []);
216+
217+
const stopCamera = useCallback(() => {
218+
setIsActive(false);
219+
cancelAnimationFrame(rafRef.current);
220+
if (streamRef.current) {
221+
streamRef.current.getTracks().forEach((t) => t.stop());
222+
streamRef.current = null;
223+
}
224+
if (videoRef.current) {
225+
videoRef.current.srcObject = null;
226+
}
227+
}, []);
228+
229+
const startDetecting = useCallback((action: ActionId) => {
230+
targetActionRef.current = action;
231+
detectingRef.current = true;
232+
trackerRef.current.reset();
233+
setActionDetected(false);
234+
setActionResult(null);
235+
}, []);
236+
237+
const stopDetecting = useCallback(() => {
238+
targetActionRef.current = null;
239+
detectingRef.current = false;
240+
}, []);
241+
242+
// Cleanup on unmount
243+
useEffect(() => {
244+
return () => {
245+
cancelAnimationFrame(rafRef.current);
246+
if (streamRef.current) {
247+
streamRef.current.getTracks().forEach((t) => t.stop());
248+
}
249+
};
250+
}, []);
251+
252+
return {
253+
videoRef,
254+
overlayRef,
255+
isModelLoaded,
256+
isActive,
257+
cameraError,
258+
startCamera,
259+
stopCamera,
260+
startDetecting,
261+
stopDetecting,
262+
actionResult,
263+
actionDetected,
264+
keypoints,
265+
confidence,
266+
};
267+
}

0 commit comments

Comments
 (0)