Skip to content

Commit 7d3e7a3

Browse files
committed
feat(benchmark): add Indoor Safety Hazards VLM suite (12 tests)
Expand HomeSec-Bench VLM Scene Analysis from 35 to 47 tests with a new Indoor Safety Hazards category covering fire, electrical, trip/fall, child safety, and blocked exit scenarios using AI-generated indoor security camera frames. New test scenarios: - Stove smoke, candle near curtain, space heater near drapes, iron left on - Overloaded power strip, frayed electrical cord - Toys on stairs, wet floor - Person fallen, items on high shelf - Open cabinet with chemicals - Cluttered/blocked exit Total benchmark: 131 → 143 tests VLM suite: 35 → 47 tests Version: 2.0.0 → 2.1.0
1 parent 80d1efc commit 7d3e7a3

15 files changed

Lines changed: 70 additions & 8 deletions

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ Each skill is a self-contained module with its own model, parameters, and [commu
4242
| **Detection** | [`yolo-detection-2026`](skills/detection/yolo-detection-2026/) | Real-time 80+ class detection — auto-accelerated via TensorRT / CoreML / OpenVINO / ONNX ||
4343
| | [`dinov3-grounding`](skills/detection/dinov3-grounding/) | Open-vocabulary detection — describe what to find | 📐 |
4444
| | [`person-recognition`](skills/detection/person-recognition/) | Re-identify individuals across cameras | 📐 |
45-
| **Analysis** | [`home-security-benchmark`](skills/analysis/home-security-benchmark/) | [131-test evaluation suite](#-homesec-bench--how-secure-is-your-local-ai) for LLM & VLM security performance ||
45+
| **Analysis** | [`home-security-benchmark`](skills/analysis/home-security-benchmark/) | [143-test evaluation suite](#-homesec-bench--how-secure-is-your-local-ai) for LLM & VLM security performance ||
4646
| | [`vlm-scene-analysis`](skills/analysis/vlm-scene-analysis/) | Describe what happened in recorded clips | 📐 |
4747
| | [`sam2-segmentation`](skills/analysis/sam2-segmentation/) | Click-to-segment with pixel-perfect masks | 📐 |
4848
| **Transformation** | [`depth-estimation`](skills/transformation/depth-estimation/) | Monocular depth maps with Depth Anything v2 | 📐 |
@@ -140,7 +140,7 @@ Camera → Frame Governor → detect.py (JSONL) → Aegis IPC → Live Overlay
140140

141141
## 📊 HomeSec-Bench — How Secure Is Your Local AI?
142142

143-
**HomeSec-Bench** is a 131-test security benchmark that measures how well your local AI performs as a security guard. It tests what matters: Can it detect a person in fog? Classify a break-in vs. a delivery? Resist prompt injection? Route alerts correctly at 3 AM?
143+
**HomeSec-Bench** is a 143-test security benchmark that measures how well your local AI performs as a security guard. It tests what matters: Can it detect a person in fog? Classify a break-in vs. a delivery? Resist prompt injection? Route alerts correctly at 3 AM?
144144

145145
Run it on your own hardware to know exactly where your setup stands.
146146

skills/analysis/home-security-benchmark/SKILL.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
name: Home Security AI Benchmark
33
description: LLM & VLM evaluation suite for home security AI applications
4-
version: 2.0.0
4+
version: 2.1.0
55
category: analysis
66
runtime: node
77
entry: scripts/run-benchmark.cjs
@@ -15,7 +15,7 @@ requirements:
1515

1616
# Home Security AI Benchmark
1717

18-
Comprehensive benchmark suite evaluating LLM and VLM models on **131 tests** across **16 suites** — context preprocessing, tool use, security classification, prompt injection resistance, alert routing, knowledge injection, VLM-to-alert triage, and scene analysis.
18+
Comprehensive benchmark suite evaluating LLM and VLM models on **143 tests** across **16 suites** — context preprocessing, tool use, security classification, prompt injection resistance, alert routing, knowledge injection, VLM-to-alert triage, and scene analysis.
1919

2020
## Setup
2121

@@ -76,7 +76,7 @@ This skill includes a [`config.yaml`](config.yaml) that defines user-configurabl
7676

7777
| Parameter | Type | Default | Description |
7878
|-----------|------|---------|-------------|
79-
| `mode` | select | `llm` | Which suites to run: `llm` (96 tests), `vlm` (35 tests), or `full` (131 tests) |
79+
| `mode` | select | `llm` | Which suites to run: `llm` (96 tests), `vlm` (47 tests), or `full` (143 tests) |
8080
| `noOpen` | boolean | `false` | Skip auto-opening the HTML report in browser |
8181

8282
Platform parameters like `AEGIS_GATEWAY_URL` and `AEGIS_VLM_URL` are auto-injected by Aegis — they are **not** in `config.yaml`. See [Aegis Skill Platform Parameters](../../../docs/skill-params.md) for the full platform contract.
@@ -112,7 +112,7 @@ AEGIS_SKILL_PARAMS={}
112112

113113
Human-readable output goes to **stderr** (visible in Aegis console tab).
114114

115-
## Test Suites (131 Tests)
115+
## Test Suites (143 Tests)
116116

117117
| Suite | Tests | Domain |
118118
|-------|-------|--------|
@@ -131,7 +131,7 @@ Human-readable output goes to **stderr** (visible in Aegis console tab).
131131
| Alert Routing & Subscription | 5 | Channel targeting, schedule CRUD |
132132
| Knowledge Injection to Dialog | 5 | KI-personalized responses |
133133
| VLM-to-Alert Triage | 5 | Urgency classification from VLM |
134-
| VLM Scene Analysis | 35 | Frame entity detection & description |
134+
| VLM Scene Analysis | 47 | Frame entity detection & description (outdoor + indoor safety) |
135135

136136
## Results
137137

@@ -142,4 +142,4 @@ Results are saved to `~/.aegis-ai/benchmarks/` as JSON. An HTML report with cros
142142
- Node.js ≥ 18
143143
- `npm install` (for `openai` SDK dependency)
144144
- Running LLM server (llama-server, OpenAI API, or any OpenAI-compatible endpoint)
145-
- Optional: Running VLM server for scene analysis tests (35 tests)
145+
- Optional: Running VLM server for scene analysis tests (47 tests)
611 KB
Loading
671 KB
Loading
676 KB
Loading
712 KB
Loading
553 KB
Loading
664 KB
Loading
679 KB
Loading
652 KB
Loading

0 commit comments

Comments
 (0)