Skip to content

Commit 42a2a36

Browse files
committed
fix: Adds links for data. Fixes table formatting. Fixes minor sentence issues.
1 parent 69ddb37 commit 42a2a36

1 file changed

Lines changed: 65 additions & 43 deletions

File tree

Testing/Evaluation/evaluation_report_030426.md

Lines changed: 65 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,20 @@
44
**Date:** 03 April 2026
55
**Session window:** 11:24 – 11:58
66
**Platforms tested:** Franka Panda · UR10e
7-
**Source logs:** `cobot.log`, `cobot_backend_Franka.log`, `cobot_backend_UR.log`
7+
**Source logs:** [cobot_frontend.log](./Logs/cobot_frontend.log), [cobot_backend_Franka.log](./Logs/cobot_backend_Franka.log), [cobot_backend_UR.log](./Logs/cobot_backend_UR.log)
8+
**Source data:** [Evaluation/Data/](./Data/) — 129 audio recordings + 125 parse-result JSONs (timestamp-named pairs)
89

910
---
1011

1112
## Summary
1213

1314
All eight tests passed. The framework achieved a 100% end-to-end success rate across both robot platforms (60/60 benchmark trials). The invalid-input rejection test produced five correct rejections. The backend swap was completed in approximately 2 minutes with no changes to code or any upstream module.
1415

15-
| Research question | Tests | Outcome |
16-
|---|---|---|
17-
| RQ1 — End-to-end success | 1.1, 1.2 | **PASS** |
16+
| Research question | Tests | Outcome |
17+
|---------------------------------------|-----------|----------|
18+
| RQ1 — End-to-end success | 1.1, 1.2 | **PASS** |
1819
| RQ2 — Input variability and stability | 2.1 – 2.5 | **PASS** |
19-
| RQ3 — Modularity and switch cost | 3.1 | **PASS** |
20+
| RQ3 — Modularity and switch cost | 3.1 | **PASS** |
2021

2122
---
2223

@@ -28,20 +29,20 @@ All eight tests passed. The framework achieved a 100% end-to-end success rate ac
2829

2930
All six benchmark tasks completed on all five trials. No failed or rejected trials. Every command reached the backend and returned a success response. The Franka MoveIt stack initialised once at 11:25:26 and remained active throughout all 30 trials without requiring a restart.
3031

31-
| Task | Command | Trials | Outcome |
32-
|---|---|---|---|
33-
| 1 | Go to P1 | 5/5 | Success |
34-
| 2 | Go linear to P2 | 5/5 | Success |
35-
| 3 | Open gripper | 5/5 | Success |
36-
| 4 | Teach current position as P3 | 5/5 | Success |
37-
| 5 | Pick at P1 and place at P2 | 5/5 | Success |
38-
| 6 | Pick at P1 and place at offset x=80, y=50 | 5/5 | Success |
32+
| Task | Command | Trials | Outcome |
33+
|------|-------------------------------------------|--------|---------|
34+
| 1 | Go to P1 | 5/5 | Success |
35+
| 2 | Go linear to P2 | 5/5 | Success |
36+
| 3 | Open gripper | 5/5 | Success |
37+
| 4 | Teach current position as P3 | 5/5 | Success |
38+
| 5 | Pick at P1 and place at P2 | 5/5 | Success |
39+
| 6 | Pick at P1 and place at offset x=80, y=50 | 5/5 | Success |
3940

4041
### Test 1.2 — UR10e (criterion: ≥ 24/30)
4142

4243
**Result: 30/30 · PASS**
4344

44-
The identical task set produced identical outcomes after backend swap. The UR backend registered only the UR and mock controllers, confirming correct vendor isolation. First-connection activation took 22 seconds (hardware cold-start); all subsequent responses completed within three seconds.
45+
The identical task set produced identical outcomes after backend swap. The UR backend registered only the UR and mock controllers, confirming correct vendor isolation. First-connection activation took 22 seconds (hardware cold-start, measured from robot activation signal to ready); all subsequent responses completed within three seconds.
4546

4647
---
4748

@@ -65,32 +66,32 @@ All five trials executed successfully. ASR confidence ranged from 0.92 to 0.95 a
6566

6667
Five paraphrases of the linear-motion command and five paraphrases of the teach-pose command were issued by voice. Nine of ten produced the correct IR.
6768

68-
| Task | Paraphrase | ASR output | IR correct |
69-
|---|---|---|---|
70-
| A | Move to P2 in a straight line | "Move to P2 in a straight line." | Yes |
71-
| A | Go to P2 using linear motion | "Go to P2 using linear motion." | Yes |
72-
| A | Linear move to P2 | "Linear Move 2P2" | **No** — LLM appended erroneous gripper close |
73-
| A | Drive linearly to P2 | "Drive linearly to P2." | Yes |
74-
| A | Reach P2 via a straight path | "Reach P2 via a straight path." | Yes |
75-
| B | Save this position as P3 | "Save this position as P3." | Yes |
76-
| B | Store current pose as P3 | "Store current pose as P3." | Yes |
77-
| B | Remember this position as P3 | "Remember this position as P3." | Yes |
78-
| B | Set P3 to current position | "Set P3 to current position." | Yes |
79-
| B | Name this position P3 | "Name this position P3." | Yes |
80-
81-
**Failure analysis — trial A3:** The ASR garbled "Linear move to P2" into `"Linear Move 2P2"`. The LLM correctly extracted a `moveL to p2` command but also generated an erroneous `gripper close` step, producing a two-command sequence rather than a one-command sequence. The robot moved to P2 but executed an unrequested gripper close. No motion safety issue occurred. This represents a parser-level failure triggered by an abnormal ASR output.
69+
| Task | Paraphrase | ASR output | IR correct |
70+
|------|-------------------------------|----------------------------------|-----------------------------------------------|
71+
| A | Move to P2 in a straight line | "Move to P2 in a straight line." | Yes |
72+
| A | Go to P2 using linear motion | "Go to P2 using linear motion." | Yes |
73+
| A | Linear move to P2 | "Linear Move 2P2" | **No** — LLM appended incorrect gripper close |
74+
| A | Drive linearly to P2 | "Drive linearly to P2." | Yes |
75+
| A | Reach P2 via a straight path | "Reach P2 via a straight path." | Yes |
76+
| B | Save this position as P3 | "Save this position as P3." | Yes |
77+
| B | Store current pose as P3 | "Store current pose as P3." | Yes |
78+
| B | Remember this position as P3 | "Remember this position as P3." | Yes |
79+
| B | Set P3 to current position | "Set P3 to current position." | Yes |
80+
| B | Name this position P3 | "Name this position P3." | Yes |
81+
82+
**Failure analysis — trial A3:** The ASR garbled "Linear move to P2" into `"Linear Move 2P2"`. The LLM correctly extracted a `moveL to p2` command but also generated an incorrect `gripper close` step, producing a two-command sequence rather than a one-command sequence. The robot moved to P2 but executed an unrequested gripper close. No motion safety issue occurred. This represents a parser-level failure triggered by an abnormal ASR output.
8283

8384
### Test 2.3 — Invalid Input Rejection (criterion: 5/5 rejected before motion)
8485

8586
**Result: 5/5 · PASS**
8687

87-
| Input | Rejection site | Reason logged |
88-
|---|---|---|
89-
| "Move to P99" | Backend validator | `Unknown pose: 'p99'` |
90-
| "Pick at somewhere" | Parser | `Vague or non-resolvable words such as 'somewhere' are NOT valid targets.` |
91-
| "Hello robot" | Parser | `No valid command detected. Please provide a specific robot command.` |
92-
| "Go to" | Pre-parser (dropped) | No LLM call dispatched; no backend request sent |
93-
| "Move P1 and P2 simultaneously" | Parser | `Impossible command: cannot move to two poses at the same time.` |
88+
| Input | Rejection site | Reason logged |
89+
|---------------------------------|----------------------|----------------------------------------------------------------------------|
90+
| "Move to P99" | Backend validator | `Unknown pose: 'p99'` |
91+
| "Pick at somewhere" | Parser | `Vague or non-resolvable words such as 'somewhere' are NOT valid targets.` |
92+
| "Hello robot" | Parser | `No valid command detected. Please provide a specific robot command.` |
93+
| "Go to" | Pre-parser (dropped) | No LLM call dispatched; no backend request sent |
94+
| "Move P1 and P2 simultaneously" | Parser | `Impossible command: cannot move to two poses at the same time.` |
9495

9596
No robot motion occurred for any of the five inputs. One observation: the incomplete command `"Go to"` was discarded silently before the parsing stage. No rejection log entry was written for this input. The pass criterion is met, but a logged rejection message would provide clearer evidence of intentional handling.
9697

@@ -120,18 +121,39 @@ All three trials executed successfully. Backend logs confirm the wait was handle
120121

121122
**Result: documented · PASS** *(descriptive — no threshold)*
122123

123-
| Metric | Value |
124-
|---|---|
125-
| Physical swap time | < 2 minutes |
126-
| Total inter-session gap (log-derived) | 2 min 11 s (11:42:54 → 11:45:05) |
127-
| Modified files | 0 (backend swap only — no code edited) |
128-
| Lines of code changed | 0 |
129-
| Configuration steps | 5 (unplug Franka → move laptop → plug UR → start backend → switch robot in GUI) |
124+
| Metric | Value |
125+
|---------------------------------------|---------------------------------------------------------------------------------|
126+
| Physical swap time | < 2 minutes |
127+
| Total inter-session gap (log-derived) | 2 min 11 s (11:42:54 → 11:45:05) |
128+
| Modified files | 0 (backend swap only — no code edited) |
129+
| Lines of code changed | 0 |
130+
| Configuration steps | 5 (unplug Franka → move laptop → plug UR → start backend → switch robot in GUI) |
130131

131132
The upstream modules — `pipeline.py`, `ASR_module.py`, `parsing_module.py` — are identical across both vendor sessions, confirmed by the continuous frontend log. The IR format is identical across both backends. The Franka backend registered three controllers (mock, franka, ur) at startup; the UR backend registered two (mock, ur), confirming correct vendor isolation without code changes.
132133

133134
---
134-
135+
## End-to-End Timing
136+
137+
Timing was measured from the moment the audio recording stopped to the moment the backend confirmed execution complete. Two hardware-startup events are excluded from these figures: the Franka MoveIt stack initialisation on the first trial (18 s, one-time) and the UR cold-start activation on the first UR command (37 s, one-time). All 123 remaining trials are included. Timestamps are taken from the frontend log at one-second resolution.
138+
139+
The pipeline portion — ASR transcription plus LLM parsing — took an average of 1.0 s and 3.7 s respectively, totalling approximately 4.6 s regardless of command type. Robot execution time is the main source of variation and depends on the number of motion steps and the platform.
140+
141+
| Command type | N | ASR avg | LLM avg | Exec avg | Total avg | Total range |
142+
|--------------------|----|---------|---------|----------|-----------|-------------|
143+
| Move (joint) | 37 | 1.0 s | 3.2 s | 1.5 s | 5.7 s | 4–6 s |
144+
| Move (linear) | 17 | 0.8 s | 3.4 s | 1.9 s | 6.1 s | 5–8 s |
145+
| Gripper | 21 | 0.9 s | 3.1 s | 2.1 s | 6.0 s | 4–8 s |
146+
| Teach pose | 16 | 0.9 s | 3.3 s | 0.1 s | 4.3 s | 4–5 s |
147+
| Two-step sequence | 5 | 0.8 s | 4.0 s | 3.2 s | 8.0 s | 8–8 s |
148+
| Pick & place | 11 | 1.0 s | 5.1 s | 7.8 s | 13.9 s | 12–17 s |
149+
| Pick + offset | 11 | 1.2 s | 5.2 s | 7.0 s | 13.4 s | 11–16 s |
150+
| Multi-step (5-cmd) | 5 | 1.4 s | 5.0 s | 7.4 s | 13.8 s | 13–14 s |
151+
152+
Pick-and-place tasks split by platform: Franka averaged 12.1 s (range 11–13 s); UR10e averaged 15.5 s (range 15–17 s). The difference reflects robot motion speed rather than any pipeline difference — the pipeline contribution is identical across platforms.
153+
154+
The LLM parsing time scales with command complexity: 3.1–3.4 s for single-step commands, rising to 5.0–5.2 s for four- and five-step sequences. ASR time remains stable across all command types at approximately 1 s.
155+
156+
---
135157
## Cross-Session Observations
136158

137159
### ASR performance

0 commit comments

Comments
 (0)