Skip to content

Commit 59cbb4f

Browse files
authored
Create metrics.md
1 parent 85c805c commit 59cbb4f

1 file changed

Lines changed: 396 additions & 0 deletions

File tree

docs/benchmarks/metrics.md

Lines changed: 396 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,396 @@
1+
# Benchmark Metrics
2+
3+
This document defines the current and planned metric vocabulary for IX-HapticSight benchmarks.
4+
5+
The purpose of this document is to make benchmark numbers interpretable.
6+
A benchmark metric is only useful if a reviewer can answer:
7+
8+
- what the metric measures
9+
- how it is counted
10+
- what layer produced it
11+
- what the metric does **not** prove
12+
13+
At the current repository stage, the benchmark layer is still mostly software-path and evidence-structure oriented.
14+
That means the current metrics are strongest for:
15+
16+
- decision-path correctness
17+
- execution-path acceptance or denial
18+
- structured event emission
19+
- timing of repository-side handling
20+
21+
The repo is **not** yet at a stage where physical contact quality, force-control quality, or real hardware timing claims should be made from benchmark numbers alone.
22+
23+
---
24+
25+
## 1. Purpose
26+
27+
The benchmark metric system exists to support:
28+
29+
- deterministic comparisons across repo changes
30+
- clearer PASS/FAIL reasoning
31+
- structured evidence summaries
32+
- later CI-style regression checks
33+
- future expansion into replay and HIL metrics
34+
35+
The benchmark system should prefer a small number of explicit, stable metrics over a large number of vague ones.
36+
37+
---
38+
39+
## 2. Metric Philosophy
40+
41+
IX-HapticSight metrics should follow these rules:
42+
43+
1. **Explicit definition**
44+
- every metric should say exactly what is counted
45+
46+
2. **Stable meaning**
47+
- metric names should not silently change meaning across versions
48+
49+
3. **Repository honesty**
50+
- a metric should not imply physical evidence that the current repo does not actually have
51+
52+
4. **Layer clarity**
53+
- it should be clear whether a metric comes from:
54+
- decision logic
55+
- execution adapter behavior
56+
- logging/replay
57+
- benchmark harness
58+
- future HIL data
59+
60+
5. **No “safety score” theater**
61+
- broad vanity scores are weaker than explicit measurements
62+
63+
---
64+
65+
## 3. Current Implemented Metrics
66+
67+
At the current repository stage, the benchmark runner emits these built-in metrics:
68+
69+
### `event_count`
70+
**Unit:** `count`
71+
72+
**Definition:**
73+
The number of structured events buffered by the event recorder during one benchmark scenario.
74+
75+
**Produced by:**
76+
- `src/ohip_bench/runner.py`
77+
- `src/ohip_logging/recorder.py`
78+
79+
**What it is useful for:**
80+
- checking that scenarios are producing a structured event trail
81+
- detecting drift in event emission patterns
82+
- providing a simple signal that logging did or did not occur
83+
84+
**What it does not prove:**
85+
- log completeness in a formal sense
86+
- causal correctness of every event
87+
- hardware truth
88+
- physical safety
89+
90+
---
91+
92+
### `decision_duration_ms`
93+
**Unit:** `ms`
94+
95+
**Definition:**
96+
Wall-clock elapsed time spent by the runtime service while handling one benchmark request inside the benchmark runner.
97+
98+
**Produced by:**
99+
- `src/ohip_bench/runner.py`
100+
101+
**What it is useful for:**
102+
- comparing repository-side processing changes
103+
- identifying obvious regressions in benchmark-path runtime handling
104+
- measuring coarse software-path timing changes
105+
106+
**What it does not prove:**
107+
- real-time guarantees
108+
- middleware latency
109+
- actuator latency
110+
- physical stop time
111+
- human-safe timing bounds
112+
113+
This is a repository-side timing metric, not a deployment safety timing metric.
114+
115+
---
116+
117+
## 4. Current Observed Fields That Behave Like Metrics
118+
119+
Some structured observation fields are not emitted as standalone numeric metrics yet, but they already function like benchmark evidence fields.
120+
121+
These include:
122+
123+
### `observed_status`
124+
Examples:
125+
- `APPROVED`
126+
- `DENIED`
127+
- `REQUIRES_VERIFICATION`
128+
- `ERROR`
129+
130+
This is a categorical outcome field, not a numeric metric, but it is still central to benchmark evaluation.
131+
132+
---
133+
134+
### `observed_executable`
135+
Examples:
136+
- `True`
137+
- `False`
138+
139+
This distinguishes:
140+
- approved and executable
141+
from
142+
- approved but not executable
143+
or
144+
- denied
145+
146+
Again, not numeric, but extremely important.
147+
148+
---
149+
150+
### `observed_fault_reason`
151+
Examples:
152+
- `consent_missing_or_invalid`
153+
- `session_safety_red`
154+
155+
This is a categorical evidence field rather than a numeric metric.
156+
It helps detect whether the repo denied or faulted for the **right reason** rather than merely denying in general.
157+
158+
---
159+
160+
### `observed_execution_status`
161+
Examples:
162+
- `ACCEPTED`
163+
- `REJECTED`
164+
- `ABORTED`
165+
- `SAFE_HOLD`
166+
167+
This is also categorical evidence rather than numeric measurement.
168+
169+
---
170+
171+
## 5. Why Current Metrics Are Intentionally Narrow
172+
173+
At this repo stage, narrow metrics are better than inflated metrics.
174+
175+
Why:
176+
- the repo is still mostly benchmarking logic paths and structured evidence paths
177+
- there is no HIL measurement layer yet
178+
- there is no real actuator timing or real contact measurement integrated into benchmark results yet
179+
- pretending otherwise would be fiction data
180+
181+
So the current metric layer is intentionally modest.
182+
183+
That is a strength, not a weakness.
184+
185+
---
186+
187+
## 6. Recommended Near-Term Metrics
188+
189+
The next wave of metrics should grow carefully from the current benchmark and runtime layers.
190+
191+
### 6.1 Decision-path metrics
192+
These are still software-side metrics, but valuable.
193+
194+
#### `decision_status_match`
195+
**Type:** boolean or categorical
196+
**Meaning:** whether observed decision status matched expectation
197+
198+
#### `execution_status_match`
199+
**Type:** boolean or categorical
200+
**Meaning:** whether observed execution status matched expectation
201+
202+
#### `fault_reason_match`
203+
**Type:** boolean or categorical
204+
**Meaning:** whether the observed fault reason matched expectation
205+
206+
These could remain implicit through PASS/FAIL logic, but exposing them directly would strengthen reporting.
207+
208+
---
209+
210+
### 6.2 Logging-path metrics
211+
These would tighten evidence quality.
212+
213+
#### `state_transition_event_count`
214+
How many transition events were emitted
215+
216+
#### `fault_event_count`
217+
How many structured fault events were emitted
218+
219+
#### `execution_status_event_count`
220+
How many execution-status events were emitted
221+
222+
#### `event_order_valid`
223+
Whether the event sequence satisfies expected ordering constraints
224+
225+
These would be especially useful once replay-integrity benchmarks are added.
226+
227+
---
228+
229+
### 6.3 Replay-path metrics
230+
Once replay-integrity benchmarks exist, useful metrics include:
231+
232+
#### `replay_event_count_match`
233+
Whether replayed event count matched source event count
234+
235+
#### `replay_first_event_match`
236+
Whether first replayed event matched source first event
237+
238+
#### `replay_last_event_match`
239+
Whether last replayed event matched source last event
240+
241+
#### `replay_order_integrity`
242+
Whether replay preserved event ordering
243+
244+
These would strengthen the evidence story around reproducibility.
245+
246+
---
247+
248+
### 6.4 Execution-path metrics
249+
Once the simulated execution adapter is benchmarked more explicitly, useful metrics include:
250+
251+
#### `execution_acceptance_rate`
252+
Fraction of scenarios whose execution requests were accepted
253+
254+
#### `abort_path_success_rate`
255+
Fraction of abort scenarios that reached the expected execution state
256+
257+
#### `safe_hold_path_success_rate`
258+
Fraction of safe-hold scenarios that reached the expected execution state
259+
260+
#### `execution_progress_terminal_consistency`
261+
Whether terminal execution states behave consistently across scenarios
262+
263+
These are still software-path metrics unless backed by real runtime measurements.
264+
265+
---
266+
267+
## 7. Future HIL Metrics
268+
269+
This is where the metric system becomes much more serious.
270+
271+
Once HIL scaffolding is connected to actual measurements, the benchmark/evidence layer should eventually support metrics like:
272+
273+
### 7.1 Contact metrics
274+
- peak measured force
275+
- dwell duration
276+
- contact onset latency
277+
- contact release latency
278+
- contact-zone localization error
279+
280+
### 7.2 Retreat metrics
281+
- retreat start latency
282+
- retreat completion time
283+
- retreat failure rate
284+
- safe-hold fallback rate
285+
286+
### 7.3 Fault metrics
287+
- overforce detection latency
288+
- thermal threshold trigger latency
289+
- watchdog-trigger latency
290+
- fault-to-hold transition latency
291+
292+
### 7.4 Logging/evidence metrics
293+
- evidence bundle completeness
294+
- missing-event rate
295+
- traceability coverage ratio
296+
297+
These would be strong metrics **only if backed by actual instrumentation**, not simulation theater.
298+
299+
---
300+
301+
## 8. Metric Naming Rules
302+
303+
Metric names should aim to be:
304+
305+
- concise
306+
- literal
307+
- stable
308+
- not marketing language
309+
310+
Good:
311+
- `event_count`
312+
- `decision_duration_ms`
313+
- `fault_event_count`
314+
315+
Bad:
316+
- `interaction_quality_score`
317+
- `trust_index`
318+
- `safety_rating`
319+
320+
Those broad names hide too much and suggest more evidence than the repo has.
321+
322+
---
323+
324+
## 9. Metric Units
325+
326+
Units should always be explicit where applicable.
327+
328+
Common units for this repo include:
329+
330+
- `count`
331+
- `ms`
332+
- `s`
333+
- `N`
334+
- `Nm`
335+
- `kPa`
336+
- `mm`
337+
- `C`
338+
339+
If a metric has no natural physical unit, it should either:
340+
- be categorical, or
341+
- be clearly unitless
342+
343+
---
344+
345+
## 10. Relationship to PASS/FAIL
346+
347+
PASS/FAIL is not a metric.
348+
It is an outcome.
349+
350+
Metrics support the reasoning behind PASS/FAIL.
351+
352+
Example:
353+
- PASS because:
354+
- expected status matched
355+
- expected execution status matched
356+
- event count was present
357+
- no unexpected fault reason occurred
358+
359+
The repo should not collapse all evidence into one pass/fail badge and call it a day.
360+
361+
---
362+
363+
## 11. Current Metric Gaps
364+
365+
Important current gaps include:
366+
367+
- no explicit event-order metrics
368+
- no explicit replay-integrity metrics
369+
- no dedicated safe-hold or abort benchmark metrics
370+
- no force/thermal/proximity/tactile physical metrics in benchmark outputs
371+
- no HIL metrics yet
372+
- no benchmark artifact manifest completeness metric yet
373+
374+
These gaps should remain visible.
375+
376+
---
377+
378+
## 12. Review Questions
379+
380+
When adding a new metric, reviewers should ask:
381+
382+
1. What exactly does this metric measure?
383+
2. What layer produced it?
384+
3. Does the metric imply more evidence than the repo actually has?
385+
4. Is the metric stable enough to compare across runs?
386+
5. Is the metric useful for a real reviewer, or just decorative?
387+
388+
If those answers are weak, the metric is weak.
389+
390+
---
391+
392+
## 13. Final Rule
393+
394+
A benchmark metric should reduce ambiguity, not create it.
395+
396+
If a number sounds impressive but cannot be tied to a precise definition and an actual evidence source, it should not be in this repo.

0 commit comments

Comments
 (0)