A debugging framework without evaluation is hard to trust.
The project should prove that runbooks and evidence normalization improve investigation quality, not just produce nicer reports.
The first benchmark can stay small.
Recommended setup:
- 3 to 5 replayable incident cases
- each case has a known root cause
- each case includes mock trace, db, and redis data
- each case defines expected findings and acceptable final conclusions
The model receives tool access but no runbook and no normalized evidence layer.
The model receives the same tools plus runbook constraints and evidence schema.
A deterministic checklist or operator guide is used without model reasoning.
- root-cause accuracy
- evidence completeness
- unsupported-claim rate
- average tool calls per case
- time to first plausible diagnosis
- final report usefulness score
Evaluation should also classify failure modes:
- wrong runbook selection
- insufficient evidence collected
- evidence conflict ignored
- tool misuse
- overconfident conclusion
The MVP does not need perfect automation. A reasonable acceptance bar is:
- lower unsupported-claim rate than raw-tools mode
- equal or better root-cause accuracy on simple cases
- clearly better report structure and evidence traceability