Skip to content

Commit 2d97675

Browse files
committed
feat(benchmark): auto-generate and open HTML report, update SKILL.md to v2.0.0
- Report is now always generated after benchmark completion - Auto-opens in browser via 'open' (macOS) / 'xdg-open' (Linux) - Use --no-open to suppress browser launch - Removed --report flag (report always generated) - Updated SKILL.md: 131 tests, 16 suites, env var documentation, configuration table with defaults and descriptions
1 parent 3c80bf1 commit 2d97675

File tree

2 files changed

+82
-38
lines changed

2 files changed

+82
-38
lines changed
Lines changed: 62 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,67 @@
11
---
22
name: Home Security AI Benchmark
33
description: LLM & VLM evaluation suite for home security AI applications
4-
version: 1.0.0
4+
version: 2.0.0
55
category: analysis
66
---
77

88
# Home Security AI Benchmark
99

10-
Comprehensive benchmark suite that evaluates LLM and VLM models on tasks specific to **home security AI assistants**deduplication, event classification, knowledge extraction, tool use, and scene analysis.
10+
Comprehensive benchmark suite evaluating LLM and VLM models on **131 tests** across **16 suites**context preprocessing, tool use, security classification, prompt injection resistance, alert routing, knowledge injection, VLM-to-alert triage, and scene analysis.
1111

1212
## Quick Start
1313

14+
### As an Aegis Skill (automatic)
15+
16+
When spawned by Aegis, all configuration is injected via environment variables. The benchmark discovers your LLM gateway and VLM server automatically, generates an HTML report, and opens it when complete.
17+
18+
### Standalone
19+
1420
```bash
15-
# Standalone (provide gateway URL)
16-
node scripts/run-benchmark.cjs --gateway http://localhost:5407
21+
# LLM-only (VLM tests skipped)
22+
node scripts/run-benchmark.cjs
1723

18-
# With VLM tests
19-
node scripts/run-benchmark.cjs --gateway http://localhost:5407 --vlm http://localhost:5405
24+
# With VLM tests (base URL without /v1 suffix)
25+
node scripts/run-benchmark.cjs --vlm http://localhost:5405
2026

21-
# Generate HTML report from results
22-
node scripts/generate-report.cjs
27+
# Custom LLM gateway
28+
node scripts/run-benchmark.cjs --gateway http://localhost:5407
29+
30+
# Skip report auto-open
31+
node scripts/run-benchmark.cjs --no-open
2332
```
2433

25-
When spawned by Aegis, configuration is automatic via environment variables.
34+
## Configuration
35+
36+
### Environment Variables (set by Aegis)
37+
38+
| Variable | Default | Description |
39+
|----------|---------|-------------|
40+
| `AEGIS_GATEWAY_URL` | `http://localhost:5407` | LLM gateway (OpenAI-compatible) |
41+
| `AEGIS_VLM_URL` | *(disabled)* | VLM server base URL |
42+
| `AEGIS_SKILL_ID` || Skill identifier (enables skill mode) |
43+
| `AEGIS_SKILL_PARAMS` | `{}` | JSON params from skill config |
44+
45+
> **Note**: URLs should be base URLs (e.g. `http://localhost:5405`). The benchmark appends `/v1/chat/completions` automatically. Including a `/v1` suffix is also accepted — it will be stripped to avoid double-pathing.
46+
47+
### CLI Arguments (standalone fallback)
48+
49+
| Argument | Default | Description |
50+
|----------|---------|-------------|
51+
| `--gateway URL` | `http://localhost:5407` | LLM gateway |
52+
| `--vlm URL` | *(disabled)* | VLM server base URL |
53+
| `--out DIR` | `~/.aegis-ai/benchmarks` | Results directory |
54+
| `--report` | *(auto in skill mode)* | Force report generation |
55+
| `--no-open` || Don't auto-open report in browser |
2656

2757
## Protocol
2858

2959
### Aegis → Skill (env vars)
3060
```
31-
AEGIS_GATEWAY_URL=http://localhost:5407 # LLM gateway
32-
AEGIS_VLM_URL=http://localhost:5405 # VLM server
33-
AEGIS_SKILL_ID=home-security-benchmark # Skill ID
34-
AEGIS_SKILL_PARAMS={} # JSON params from skill config
61+
AEGIS_GATEWAY_URL=http://localhost:5407
62+
AEGIS_VLM_URL=http://localhost:5405
63+
AEGIS_SKILL_ID=home-security-benchmark
64+
AEGIS_SKILL_PARAMS={}
3565
```
3666

3767
### Skill → Aegis (stdout, JSON lines)
@@ -40,35 +70,38 @@ AEGIS_SKILL_PARAMS={} # JSON params from skill config
4070
{"event": "suite_start", "suite": "Context Preprocessing"}
4171
{"event": "test_result", "suite": "...", "test": "...", "status": "pass", "timeMs": 123}
4272
{"event": "suite_end", "suite": "...", "passed": 4, "failed": 0}
43-
{"event": "complete", "passed": 23, "total": 26, "timeMs": 95000, "resultFile": "..."}
73+
{"event": "complete", "passed": 126, "total": 131, "timeMs": 322000, "reportPath": "/path/to/report.html"}
4474
```
4575

4676
Human-readable output goes to **stderr** (visible in Aegis console tab).
4777

48-
## Test Suites
78+
## Test Suites (131 Tests)
4979

5080
| Suite | Tests | Domain |
5181
|-------|-------|--------|
52-
| Context Preprocessing | 4 | Conversation dedup accuracy |
82+
| Context Preprocessing | 6 | Conversation dedup accuracy |
5383
| Topic Classification | 4 | Topic extraction & change detection |
54-
| Knowledge Distillation | 3 | Fact extraction, slug matching |
55-
| Event Deduplication | 3 | Security event classification |
56-
| Tool Use | 4 | Tool selection & parameter extraction |
57-
| Chat & JSON Compliance | 7 | Persona, memory, structured output |
58-
| VLM Scene Analysis | 4 | Frame description & object detection |
59-
60-
## Metrics Collected
61-
62-
- **Per-test**: latency (ms), prompt/completion tokens, pass/fail
63-
- **Per-run**: total time, tokens/sec, memory usage
64-
- **System**: OS, CPU, RAM, GPU, model name, quantization
84+
| Knowledge Distillation | 5 | Fact extraction, slug matching |
85+
| Event Deduplication | 8 | Security event classification |
86+
| Tool Use | 16 | Tool selection & parameter extraction |
87+
| Chat & JSON Compliance | 11 | Persona, memory, structured output |
88+
| Security Classification | 12 | Threat level assessment |
89+
| Narrative Synthesis | 4 | Multi-camera event summarization |
90+
| Prompt Injection Resistance | 4 | Adversarial prompt defense |
91+
| Multi-Turn Reasoning | 4 | Context resolution over turns |
92+
| Error Recovery & Edge Cases | 4 | Graceful failure handling |
93+
| Privacy & Compliance | 3 | PII handling, consent |
94+
| Alert Routing & Subscription | 5 | Channel targeting, schedule CRUD |
95+
| Knowledge Injection to Dialog | 5 | KI-personalized responses |
96+
| VLM-to-Alert Triage | 5 | Urgency classification from VLM |
97+
| VLM Scene Analysis | 35 | Frame entity detection & description |
6598

6699
## Results
67100

68-
Results are saved to `~/.aegis-ai/benchmarks/` as JSON. The HTML report generator reads all historical results for cross-model comparison.
101+
Results are saved to `~/.aegis-ai/benchmarks/` as JSON. An HTML report with cross-model comparison is auto-generated and opened in the browser after each run.
69102

70103
## Requirements
71104

72105
- Node.js ≥ 18
73106
- Running LLM server (llama-cpp, vLLM, or any OpenAI-compatible API)
74-
- Optional: Running VLM server for scene analysis tests
107+
- Optional: Running VLM server for scene analysis tests (35 tests)

skills/analysis/home-security-benchmark/scripts/run-benchmark.cjs

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
const fs = require('fs');
3838
const path = require('path');
3939
const os = require('os');
40+
const { execSync } = require('child_process');
4041

4142
// ─── Config: Aegis env vars → CLI args → defaults ────────────────────────────
4243

@@ -51,7 +52,7 @@ function getArg(name, defaultVal) {
5152
const GATEWAY_URL = process.env.AEGIS_GATEWAY_URL || getArg('gateway', 'http://localhost:5407');
5253
const VLM_URL = process.env.AEGIS_VLM_URL || getArg('vlm', '');
5354
const RESULTS_DIR = getArg('out', path.join(os.homedir(), '.aegis-ai', 'benchmarks'));
54-
const AUTO_REPORT = args.includes('--report');
55+
const NO_OPEN = args.includes('--no-open');
5556
const TIMEOUT_MS = 30000;
5657
const FIXTURES_DIR = path.join(__dirname, '..', 'fixtures');
5758
const IS_SKILL_MODE = !!process.env.AEGIS_SKILL_ID;
@@ -1724,16 +1725,26 @@ async function main() {
17241725
});
17251726
fs.writeFileSync(indexFile, JSON.stringify(index, null, 2));
17261727

1727-
// Auto-generate report
1728+
// Always generate report (skip only on explicit --no-open with no --report flag)
17281729
let reportPath = null;
1729-
if (AUTO_REPORT) {
1730-
log('\n Generating HTML report...');
1731-
try {
1732-
const reportScript = path.join(__dirname, 'generate-report.cjs');
1733-
reportPath = require(reportScript).generateReport(RESULTS_DIR);
1734-
} catch (err) {
1735-
log(` ⚠️ Report generation failed: ${err.message}`);
1730+
log('\n Generating HTML report...');
1731+
try {
1732+
const reportScript = path.join(__dirname, 'generate-report.cjs');
1733+
reportPath = require(reportScript).generateReport(RESULTS_DIR);
1734+
log(` ✅ Report: ${reportPath}`);
1735+
1736+
// Auto-open in browser (macOS: open, Linux: xdg-open)
1737+
if (!NO_OPEN && reportPath) {
1738+
try {
1739+
const openCmd = process.platform === 'darwin' ? 'open' : 'xdg-open';
1740+
execSync(`${openCmd} "${reportPath}"`, { stdio: 'ignore' });
1741+
log(` 📂 Opened in browser`);
1742+
} catch {
1743+
log(` ℹ️ Open manually: ${reportPath}`);
1744+
}
17361745
}
1746+
} catch (err) {
1747+
log(` ⚠️ Report generation failed: ${err.message}`);
17371748
}
17381749

17391750
// Emit completion event (Aegis listens for this)

0 commit comments

Comments
 (0)