Skip to content

Commit ac4f4f1

Browse files
authored
Create overview.md
1 parent 8aa7528 commit ac4f4f1

1 file changed

Lines changed: 354 additions & 0 deletions

File tree

docs/benchmarks/overview.md

Lines changed: 354 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,354 @@
1+
# Benchmark Overview
2+
3+
This document defines the benchmark philosophy and current benchmark architecture for IX-HapticSight.
4+
5+
The benchmark layer exists to turn repository claims into repeatable scenario checks.
6+
It is not a certification artifact.
7+
It is not real-world deployment proof.
8+
It is a structured way to ask:
9+
10+
- what scenario was tested
11+
- what outcome was expected
12+
- what outcome was observed
13+
- what metrics were recorded
14+
- whether the result matched the expectation
15+
16+
That is the minimum discipline required for a serious interaction-governance repo.
17+
18+
---
19+
20+
## 1. Purpose
21+
22+
The benchmark system exists to support:
23+
24+
- deterministic regression checks
25+
- explicit scenario-based evaluation
26+
- structured result comparison across repo changes
27+
- replay-friendly evidence generation
28+
- clearer separation between claims and measured repository behavior
29+
30+
The benchmark layer is intended to answer:
31+
“Did the current repo behave the way the repo says it should?”
32+
33+
That is narrower than asking whether a deployed robot is safe in the real world.
34+
35+
---
36+
37+
## 2. Benchmark Philosophy
38+
39+
IX-HapticSight benchmarks should follow these rules:
40+
41+
1. **Scenario first**
42+
- every run starts from an explicit scenario definition
43+
- no hidden assumptions
44+
- no mystery runtime conditions
45+
46+
2. **Expectation first**
47+
- each scenario declares what should happen
48+
- approval, denial, fault behavior, and execution behavior should be explicit
49+
50+
3. **Structured observation**
51+
- results should be collected as machine-readable observation records
52+
- event counts and execution outcomes should not depend on casual console reading
53+
54+
4. **Determinism over theater**
55+
- benchmark value comes from repeatability, not dramatic demos
56+
57+
5. **Repository truth, not hype**
58+
- benchmark results are evidence about the repo’s current behavior
59+
- they are not blanket safety guarantees
60+
61+
---
62+
63+
## 3. Current Benchmark Components
64+
65+
The current benchmark layer includes:
66+
67+
### Core models
68+
- `src/ohip_bench/models.py`
69+
- scenario, expectation, observation, metric, and result structures
70+
71+
### Runner
72+
- `src/ohip_bench/runner.py`
73+
- deterministic scenario execution against a fresh runtime service
74+
75+
### Built-in scenarios
76+
- `src/ohip_bench/scenarios.py`
77+
- current catalog of consent and safety scenarios
78+
79+
### Reporting
80+
- `src/ohip_bench/reporting.py`
81+
- result summarization and export helpers
82+
83+
### Related runtime dependencies
84+
- `src/ohip_runtime/runtime_service.py`
85+
- `src/ohip_logging/`
86+
- `src/ohip_interfaces/`
87+
- `src/ohip/`
88+
89+
The benchmark layer does not stand alone.
90+
It evaluates the integrated behavior of those layers.
91+
92+
---
93+
94+
## 4. Current Benchmark Domains
95+
96+
The benchmark model supports these domains:
97+
98+
- `CONSENT`
99+
- `SAFETY`
100+
- `PLANNING`
101+
- `EXECUTION`
102+
- `LOGGING`
103+
- `REPLAY`
104+
- `INTEGRATION`
105+
106+
At the current repository stage, the strongest implemented coverage is in:
107+
108+
- consent-path evaluation
109+
- safety-path denial behavior
110+
- runtime integration behavior through the service layer
111+
112+
Future versions should increase coverage for:
113+
- execution fault behavior
114+
- replay integrity
115+
- logging completeness
116+
- HIL evidence ingestion
117+
- state-transition invariants
118+
119+
---
120+
121+
## 5. Current Scenario Flow
122+
123+
A typical benchmark run currently works like this:
124+
125+
1. build a fresh runtime service
126+
2. create a fresh interaction session from scenario inputs
127+
3. optionally apply explicit consent based on scenario inputs
128+
4. construct a runtime request and optional nudge
129+
5. execute the runtime request
130+
6. collect:
131+
- decision outcome
132+
- execution response if present
133+
- active fault reason if present
134+
- structured event count
135+
- timing metrics
136+
7. compare observed output against the scenario expectation
137+
8. emit a structured benchmark result
138+
139+
This is intentionally boring.
140+
That is a strength.
141+
142+
---
143+
144+
## 6. What a Benchmark Scenario Contains
145+
146+
A benchmark scenario currently contains:
147+
148+
- `scenario_id`
149+
- `title`
150+
- `domain`
151+
- `description`
152+
- `inputs`
153+
- `expectation`
154+
- `tags`
155+
156+
The expectation may include:
157+
- expected decision status
158+
- expected executable flag
159+
- expected fault reason
160+
- expected execution status
161+
162+
This means the benchmark system can distinguish:
163+
- “approved but not executable”
164+
- “denied with the wrong reason”
165+
- “approved but wrong execution backend response”
166+
167+
That is already more useful than vague pass/fail prose.
168+
169+
---
170+
171+
## 7. What a Benchmark Result Contains
172+
173+
A benchmark result currently contains:
174+
175+
- scenario ID
176+
- domain
177+
- outcome (`PASS`, `FAIL`, `ERROR`, `SKIPPED`)
178+
- structured observation
179+
- structured metrics
180+
- reason code
181+
- start and finish times
182+
- derived duration
183+
184+
A structured observation may include:
185+
- observed decision status
186+
- observed executable flag
187+
- observed fault reason
188+
- observed execution status
189+
- event count
190+
191+
This gives the repository a baseline evidence spine.
192+
193+
---
194+
195+
## 8. Current Limitations
196+
197+
The benchmark layer is useful, but it is still early-stage.
198+
199+
Current limitations include:
200+
201+
- scenarios are still relatively small in number
202+
- there is no hardware-in-the-loop path yet
203+
- there is no real robot backend under test
204+
- most metrics are currently high-level, not physical
205+
- benchmark scenarios still emphasize logic-path correctness over physical execution truth
206+
- there is not yet a persistent benchmark artifact manifest system
207+
208+
That is acceptable as long as the repo stays honest about it.
209+
210+
---
211+
212+
## 9. What Current Benchmarks Do Prove
213+
214+
Current benchmarks can help prove things such as:
215+
216+
- a consent path allows or blocks contact as expected
217+
- a safety-red session blocks execution as expected
218+
- runtime service emits a structured event trail
219+
- execution adapter behavior is accepted, rejected, or faulted as expected
220+
- scenario expectations are compared in a repeatable way
221+
222+
That is meaningful repository evidence.
223+
224+
---
225+
226+
## 10. What Current Benchmarks Do Not Prove
227+
228+
Current benchmarks do **not** prove:
229+
230+
- real-world physical safety
231+
- human comfort or acceptance
232+
- force quality under real hardware contact
233+
- certified collaborative behavior
234+
- hardware watchdog latency
235+
- thermal dissipation safety in physical deployment
236+
- regulatory compliance
237+
- medical or therapeutic suitability
238+
239+
Those require stronger evidence classes later.
240+
241+
---
242+
243+
## 11. Relationship to Replay
244+
245+
The benchmark system is designed to align with the structured logging and replay layer.
246+
247+
This matters because a serious benchmark should be:
248+
249+
- re-runnable
250+
- reviewable
251+
- inspectable after the fact
252+
253+
The replay layer supports that by preserving structured event trails that can later be:
254+
- compared
255+
- reloaded
256+
- grouped
257+
- inspected by session, request, and event kind
258+
259+
Benchmarking without replay is weaker.
260+
Replay without scenario expectations is also weaker.
261+
They are stronger together.
262+
263+
---
264+
265+
## 12. Relationship to HIL
266+
267+
The current benchmark layer is software-first.
268+
269+
The next major maturity step is to connect it to HIL-style evidence, where scenarios may eventually include:
270+
271+
- calibrated load-cell data
272+
- overforce timing checks
273+
- retreat timing measurements
274+
- backend fault injection records
275+
- thermal trip behavior
276+
- stop/hold timing results
277+
278+
That is not implemented yet, but the current benchmark structure is intentionally shaped so that future evidence can be added without rewriting the whole system.
279+
280+
---
281+
282+
## 13. Benchmark Outcome Semantics
283+
284+
### PASS
285+
Observed behavior matched the explicit expectation.
286+
287+
### FAIL
288+
Scenario executed, but observed behavior did not match the expectation.
289+
290+
### ERROR
291+
The benchmark itself could not run correctly because of malformed input or a runner/runtime issue.
292+
293+
### SKIPPED
294+
Scenario was intentionally not executed.
295+
296+
These distinctions matter.
297+
A FAIL says the repo behavior diverged from expectation.
298+
An ERROR says the benchmark setup or execution path itself was broken.
299+
300+
---
301+
302+
## 14. Reporting Direction
303+
304+
The reporting layer currently supports:
305+
306+
- aggregate counts
307+
- per-domain grouping
308+
- per-outcome grouping
309+
- pass-rate summaries
310+
- export-friendly dictionaries
311+
312+
That is enough for local inspection and future CI-style checks.
313+
314+
Later reporting could add:
315+
- baseline-vs-head comparisons
316+
- event-count drift alerts
317+
- trend snapshots
318+
- benchmark artifact manifests
319+
320+
---
321+
322+
## 15. Review Questions
323+
324+
When adding a new benchmark, reviewers should ask:
325+
326+
1. Is the scenario explicit?
327+
2. Is the expectation explicit?
328+
3. Does the scenario measure something real about the repository?
329+
4. Is the result structured and reproducible?
330+
5. Is the benchmark claiming more than it actually tests?
331+
6. Can the output be replayed or reviewed later?
332+
333+
If those answers are weak, the benchmark is probably weak too.
334+
335+
---
336+
337+
## 16. Near-Term Priorities
338+
339+
The highest-value next benchmark improvements are:
340+
341+
1. expand the built-in scenario catalog
342+
2. add replay-integrity benchmarks
343+
3. add event-log completeness benchmarks
344+
4. add execution-fault and safe-hold benchmark cases
345+
5. add state-transition expectation benchmarks
346+
6. prepare HIL-compatible evidence bundle conventions
347+
348+
---
349+
350+
## 17. Final Rule
351+
352+
A benchmark is only valuable if it narrows uncertainty.
353+
354+
If it cannot tell a reviewer what happened, why it mattered, and whether it matched the stated expectation, it is just decoration.

0 commit comments

Comments
 (0)