|
| 1 | +# Benchmark Overview |
| 2 | + |
| 3 | +This document defines the benchmark philosophy and current benchmark architecture for IX-HapticSight. |
| 4 | + |
| 5 | +The benchmark layer exists to turn repository claims into repeatable scenario checks. |
| 6 | +It is not a certification artifact. |
| 7 | +It is not real-world deployment proof. |
| 8 | +It is a structured way to ask: |
| 9 | + |
| 10 | +- what scenario was tested |
| 11 | +- what outcome was expected |
| 12 | +- what outcome was observed |
| 13 | +- what metrics were recorded |
| 14 | +- whether the result matched the expectation |
| 15 | + |
| 16 | +That is the minimum discipline required for a serious interaction-governance repo. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## 1. Purpose |
| 21 | + |
| 22 | +The benchmark system exists to support: |
| 23 | + |
| 24 | +- deterministic regression checks |
| 25 | +- explicit scenario-based evaluation |
| 26 | +- structured result comparison across repo changes |
| 27 | +- replay-friendly evidence generation |
| 28 | +- clearer separation between claims and measured repository behavior |
| 29 | + |
| 30 | +The benchmark layer is intended to answer: |
| 31 | +“Did the current repo behave the way the repo says it should?” |
| 32 | + |
| 33 | +That is narrower than asking whether a deployed robot is safe in the real world. |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## 2. Benchmark Philosophy |
| 38 | + |
| 39 | +IX-HapticSight benchmarks should follow these rules: |
| 40 | + |
| 41 | +1. **Scenario first** |
| 42 | + - every run starts from an explicit scenario definition |
| 43 | + - no hidden assumptions |
| 44 | + - no mystery runtime conditions |
| 45 | + |
| 46 | +2. **Expectation first** |
| 47 | + - each scenario declares what should happen |
| 48 | + - approval, denial, fault behavior, and execution behavior should be explicit |
| 49 | + |
| 50 | +3. **Structured observation** |
| 51 | + - results should be collected as machine-readable observation records |
| 52 | + - event counts and execution outcomes should not depend on casual console reading |
| 53 | + |
| 54 | +4. **Determinism over theater** |
| 55 | + - benchmark value comes from repeatability, not dramatic demos |
| 56 | + |
| 57 | +5. **Repository truth, not hype** |
| 58 | + - benchmark results are evidence about the repo’s current behavior |
| 59 | + - they are not blanket safety guarantees |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## 3. Current Benchmark Components |
| 64 | + |
| 65 | +The current benchmark layer includes: |
| 66 | + |
| 67 | +### Core models |
| 68 | +- `src/ohip_bench/models.py` |
| 69 | + - scenario, expectation, observation, metric, and result structures |
| 70 | + |
| 71 | +### Runner |
| 72 | +- `src/ohip_bench/runner.py` |
| 73 | + - deterministic scenario execution against a fresh runtime service |
| 74 | + |
| 75 | +### Built-in scenarios |
| 76 | +- `src/ohip_bench/scenarios.py` |
| 77 | + - current catalog of consent and safety scenarios |
| 78 | + |
| 79 | +### Reporting |
| 80 | +- `src/ohip_bench/reporting.py` |
| 81 | + - result summarization and export helpers |
| 82 | + |
| 83 | +### Related runtime dependencies |
| 84 | +- `src/ohip_runtime/runtime_service.py` |
| 85 | +- `src/ohip_logging/` |
| 86 | +- `src/ohip_interfaces/` |
| 87 | +- `src/ohip/` |
| 88 | + |
| 89 | +The benchmark layer does not stand alone. |
| 90 | +It evaluates the integrated behavior of those layers. |
| 91 | + |
| 92 | +--- |
| 93 | + |
| 94 | +## 4. Current Benchmark Domains |
| 95 | + |
| 96 | +The benchmark model supports these domains: |
| 97 | + |
| 98 | +- `CONSENT` |
| 99 | +- `SAFETY` |
| 100 | +- `PLANNING` |
| 101 | +- `EXECUTION` |
| 102 | +- `LOGGING` |
| 103 | +- `REPLAY` |
| 104 | +- `INTEGRATION` |
| 105 | + |
| 106 | +At the current repository stage, the strongest implemented coverage is in: |
| 107 | + |
| 108 | +- consent-path evaluation |
| 109 | +- safety-path denial behavior |
| 110 | +- runtime integration behavior through the service layer |
| 111 | + |
| 112 | +Future versions should increase coverage for: |
| 113 | +- execution fault behavior |
| 114 | +- replay integrity |
| 115 | +- logging completeness |
| 116 | +- HIL evidence ingestion |
| 117 | +- state-transition invariants |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## 5. Current Scenario Flow |
| 122 | + |
| 123 | +A typical benchmark run currently works like this: |
| 124 | + |
| 125 | +1. build a fresh runtime service |
| 126 | +2. create a fresh interaction session from scenario inputs |
| 127 | +3. optionally apply explicit consent based on scenario inputs |
| 128 | +4. construct a runtime request and optional nudge |
| 129 | +5. execute the runtime request |
| 130 | +6. collect: |
| 131 | + - decision outcome |
| 132 | + - execution response if present |
| 133 | + - active fault reason if present |
| 134 | + - structured event count |
| 135 | + - timing metrics |
| 136 | +7. compare observed output against the scenario expectation |
| 137 | +8. emit a structured benchmark result |
| 138 | + |
| 139 | +This is intentionally boring. |
| 140 | +That is a strength. |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +## 6. What a Benchmark Scenario Contains |
| 145 | + |
| 146 | +A benchmark scenario currently contains: |
| 147 | + |
| 148 | +- `scenario_id` |
| 149 | +- `title` |
| 150 | +- `domain` |
| 151 | +- `description` |
| 152 | +- `inputs` |
| 153 | +- `expectation` |
| 154 | +- `tags` |
| 155 | + |
| 156 | +The expectation may include: |
| 157 | +- expected decision status |
| 158 | +- expected executable flag |
| 159 | +- expected fault reason |
| 160 | +- expected execution status |
| 161 | + |
| 162 | +This means the benchmark system can distinguish: |
| 163 | +- “approved but not executable” |
| 164 | +- “denied with the wrong reason” |
| 165 | +- “approved but wrong execution backend response” |
| 166 | + |
| 167 | +That is already more useful than vague pass/fail prose. |
| 168 | + |
| 169 | +--- |
| 170 | + |
| 171 | +## 7. What a Benchmark Result Contains |
| 172 | + |
| 173 | +A benchmark result currently contains: |
| 174 | + |
| 175 | +- scenario ID |
| 176 | +- domain |
| 177 | +- outcome (`PASS`, `FAIL`, `ERROR`, `SKIPPED`) |
| 178 | +- structured observation |
| 179 | +- structured metrics |
| 180 | +- reason code |
| 181 | +- start and finish times |
| 182 | +- derived duration |
| 183 | + |
| 184 | +A structured observation may include: |
| 185 | +- observed decision status |
| 186 | +- observed executable flag |
| 187 | +- observed fault reason |
| 188 | +- observed execution status |
| 189 | +- event count |
| 190 | + |
| 191 | +This gives the repository a baseline evidence spine. |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +## 8. Current Limitations |
| 196 | + |
| 197 | +The benchmark layer is useful, but it is still early-stage. |
| 198 | + |
| 199 | +Current limitations include: |
| 200 | + |
| 201 | +- scenarios are still relatively small in number |
| 202 | +- there is no hardware-in-the-loop path yet |
| 203 | +- there is no real robot backend under test |
| 204 | +- most metrics are currently high-level, not physical |
| 205 | +- benchmark scenarios still emphasize logic-path correctness over physical execution truth |
| 206 | +- there is not yet a persistent benchmark artifact manifest system |
| 207 | + |
| 208 | +That is acceptable as long as the repo stays honest about it. |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +## 9. What Current Benchmarks Do Prove |
| 213 | + |
| 214 | +Current benchmarks can help prove things such as: |
| 215 | + |
| 216 | +- a consent path allows or blocks contact as expected |
| 217 | +- a safety-red session blocks execution as expected |
| 218 | +- runtime service emits a structured event trail |
| 219 | +- execution adapter behavior is accepted, rejected, or faulted as expected |
| 220 | +- scenario expectations are compared in a repeatable way |
| 221 | + |
| 222 | +That is meaningful repository evidence. |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +## 10. What Current Benchmarks Do Not Prove |
| 227 | + |
| 228 | +Current benchmarks do **not** prove: |
| 229 | + |
| 230 | +- real-world physical safety |
| 231 | +- human comfort or acceptance |
| 232 | +- force quality under real hardware contact |
| 233 | +- certified collaborative behavior |
| 234 | +- hardware watchdog latency |
| 235 | +- thermal dissipation safety in physical deployment |
| 236 | +- regulatory compliance |
| 237 | +- medical or therapeutic suitability |
| 238 | + |
| 239 | +Those require stronger evidence classes later. |
| 240 | + |
| 241 | +--- |
| 242 | + |
| 243 | +## 11. Relationship to Replay |
| 244 | + |
| 245 | +The benchmark system is designed to align with the structured logging and replay layer. |
| 246 | + |
| 247 | +This matters because a serious benchmark should be: |
| 248 | + |
| 249 | +- re-runnable |
| 250 | +- reviewable |
| 251 | +- inspectable after the fact |
| 252 | + |
| 253 | +The replay layer supports that by preserving structured event trails that can later be: |
| 254 | +- compared |
| 255 | +- reloaded |
| 256 | +- grouped |
| 257 | +- inspected by session, request, and event kind |
| 258 | + |
| 259 | +Benchmarking without replay is weaker. |
| 260 | +Replay without scenario expectations is also weaker. |
| 261 | +They are stronger together. |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +## 12. Relationship to HIL |
| 266 | + |
| 267 | +The current benchmark layer is software-first. |
| 268 | + |
| 269 | +The next major maturity step is to connect it to HIL-style evidence, where scenarios may eventually include: |
| 270 | + |
| 271 | +- calibrated load-cell data |
| 272 | +- overforce timing checks |
| 273 | +- retreat timing measurements |
| 274 | +- backend fault injection records |
| 275 | +- thermal trip behavior |
| 276 | +- stop/hold timing results |
| 277 | + |
| 278 | +That is not implemented yet, but the current benchmark structure is intentionally shaped so that future evidence can be added without rewriting the whole system. |
| 279 | + |
| 280 | +--- |
| 281 | + |
| 282 | +## 13. Benchmark Outcome Semantics |
| 283 | + |
| 284 | +### PASS |
| 285 | +Observed behavior matched the explicit expectation. |
| 286 | + |
| 287 | +### FAIL |
| 288 | +Scenario executed, but observed behavior did not match the expectation. |
| 289 | + |
| 290 | +### ERROR |
| 291 | +The benchmark itself could not run correctly because of malformed input or a runner/runtime issue. |
| 292 | + |
| 293 | +### SKIPPED |
| 294 | +Scenario was intentionally not executed. |
| 295 | + |
| 296 | +These distinctions matter. |
| 297 | +A FAIL says the repo behavior diverged from expectation. |
| 298 | +An ERROR says the benchmark setup or execution path itself was broken. |
| 299 | + |
| 300 | +--- |
| 301 | + |
| 302 | +## 14. Reporting Direction |
| 303 | + |
| 304 | +The reporting layer currently supports: |
| 305 | + |
| 306 | +- aggregate counts |
| 307 | +- per-domain grouping |
| 308 | +- per-outcome grouping |
| 309 | +- pass-rate summaries |
| 310 | +- export-friendly dictionaries |
| 311 | + |
| 312 | +That is enough for local inspection and future CI-style checks. |
| 313 | + |
| 314 | +Later reporting could add: |
| 315 | +- baseline-vs-head comparisons |
| 316 | +- event-count drift alerts |
| 317 | +- trend snapshots |
| 318 | +- benchmark artifact manifests |
| 319 | + |
| 320 | +--- |
| 321 | + |
| 322 | +## 15. Review Questions |
| 323 | + |
| 324 | +When adding a new benchmark, reviewers should ask: |
| 325 | + |
| 326 | +1. Is the scenario explicit? |
| 327 | +2. Is the expectation explicit? |
| 328 | +3. Does the scenario measure something real about the repository? |
| 329 | +4. Is the result structured and reproducible? |
| 330 | +5. Is the benchmark claiming more than it actually tests? |
| 331 | +6. Can the output be replayed or reviewed later? |
| 332 | + |
| 333 | +If those answers are weak, the benchmark is probably weak too. |
| 334 | + |
| 335 | +--- |
| 336 | + |
| 337 | +## 16. Near-Term Priorities |
| 338 | + |
| 339 | +The highest-value next benchmark improvements are: |
| 340 | + |
| 341 | +1. expand the built-in scenario catalog |
| 342 | +2. add replay-integrity benchmarks |
| 343 | +3. add event-log completeness benchmarks |
| 344 | +4. add execution-fault and safe-hold benchmark cases |
| 345 | +5. add state-transition expectation benchmarks |
| 346 | +6. prepare HIL-compatible evidence bundle conventions |
| 347 | + |
| 348 | +--- |
| 349 | + |
| 350 | +## 17. Final Rule |
| 351 | + |
| 352 | +A benchmark is only valuable if it narrows uncertainty. |
| 353 | + |
| 354 | +If it cannot tell a reviewer what happened, why it mattered, and whether it matched the stated expectation, it is just decoration. |
0 commit comments