Skip to content

Commit 0ebf84e

Browse files
authored
Merge pull request #172 from SharpAI/develop
Develop
2 parents d51176f + 3e1440e commit 0ebf84e

File tree

4 files changed

+112
-14
lines changed

4 files changed

+112
-14
lines changed

README.md

Lines changed: 3 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -36,20 +36,9 @@
3636

3737
</div>
3838

39-
<table>
40-
<tr>
41-
<td width="50%">
42-
<p align="center"><b>Run Local VLMs from HuggingFace — Even on Mac Mini 8GB</b></p>
43-
<img src="screenshots/aegis-vlm-browser.png" alt="SharpAI Aegis — Browse and run local VLM models for AI camera video analysis" width="100%">
44-
<p align="center"><em>Download and run SmolVLM2, Qwen-VL, LLaVA, MiniCPM-V locally. Your AI security camera agent sees through these eyes.</em></p>
45-
</td>
46-
<td width="50%">
47-
<p align="center"><b>Chat with Your AI Camera Agent</b></p>
48-
<img src="screenshots/aegis-chat-agent.png" alt="SharpAI Aegis — LLM-powered agentic security camera chat" width="100%">
49-
<p align="center"><em>"Who was at the door?" — Your agent searches footage, reasons about what happened, and answers with timestamps and clips.</em></p>
50-
</td>
51-
</tr>
52-
</table>
39+
<p align="center">
40+
<a href="https://youtu.be/BtHpenIO5WU"><img src="screenshots/aegis-benchmark-demo.gif" alt="Aegis AI Benchmark Demo — Local LLM home security on Apple Silicon (click for full video)" width="60%"></a>
41+
</p>
5342

5443
---
5544

8.33 MB
Loading
1.85 MB
Binary file not shown.
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# HomeSec-Bench — Local AI Benchmark for Home Security
2+
3+
> **Qwen3.5-9B scores 93.8%** on 96 real security AI tests — within 4 points of GPT-5.4 — running entirely on a **MacBook Pro M5** at 25 tok/s, 765ms TTFT, using only 13.8 GB of unified memory. Zero API costs. Full data privacy. All local.
4+
5+
## What is HomeSec-Bench?
6+
7+
A benchmark suite that evaluates LLMs on **real home security assistant workflows** — not generic chat, but the actual reasoning, triage, and tool use an AI home security system needs.
8+
9+
All 35 fixture images are AI-generated (no real user footage). Tests run against any OpenAI-compatible endpoint.
10+
11+
## Results: Full Leaderboard
12+
13+
| Rank | Model | Type | Passed | Failed | Pass Rate | Total Time |
14+
|-----:|:------|:-----|-------:|-------:|----------:|-----------:|
15+
| 🥇 1 | **GPT-5.4** | ☁️ Cloud | **94** | 2 | **97.9%** | 2m 22s |
16+
| 🥈 2 | **GPT-5.4-mini** | ☁️ Cloud | **92** | 4 | **95.8%** | 1m 17s |
17+
| 🥉 3 | **Qwen3.5-9B** (Q4_K_M) | 🏠 Local | **90** | 6 | **93.8%** | 5m 23s |
18+
| 3 | **Qwen3.5-27B** (Q4_K_M) | 🏠 Local | **90** | 6 | **93.8%** | 15m 8s |
19+
| 5 | **Qwen3.5-122B-MoE** (IQ1_M) | 🏠 Local | 89 | 7 | 92.7% | 8m 26s |
20+
| 5 | **GPT-5.4-nano** | ☁️ Cloud | 89 | 7 | 92.7% | 1m 34s |
21+
| 7 | **Qwen3.5-35B-MoE** (Q4_K_L) | 🏠 Local | 88 | 8 | 91.7% | 3m 30s |
22+
| 8 | **GPT-5-mini** (2025) | ☁️ Cloud | 60 | 36 | 62.5%* | 7m 38s |
23+
24+
> *GPT-5-mini had many failures due to the API rejecting non-default `temperature` values, so suites using temp=0.7 or temp=0.1 got 0/N. This is an API limitation, not model capability — it's not a fair comparison and is listed for completeness only.
25+
26+
**Key takeaway:** The **Qwen3.5-9B** running locally on a single MacBook Pro scores **93.8%** — only **4.1 points behind GPT-5.4** and within **2 points of GPT-5.4-mini**. It even **beats GPT-5.4-nano** by 1 point. All with zero API costs and complete data privacy.
27+
28+
## Performance: Local vs Cloud
29+
30+
| Model | Type | TTFT (avg) | TTFT (p95) | Decode (tok/s) | GPU Mem |
31+
|:------|:-----|:-----------|:-----------|:---------------|:--------|
32+
| **Qwen3.5-35B-MoE** | 🏠 Local | **435ms** | 673ms | 41.9 | 27.2 GB |
33+
| **GPT-5.4-nano** | ☁️ Cloud | 508ms | 990ms | 136.4 ||
34+
| **GPT-5.4-mini** | ☁️ Cloud | 553ms | 805ms | 234.5 ||
35+
| **GPT-5.4** | ☁️ Cloud | 601ms | 1052ms | 73.4 ||
36+
| **Qwen3.5-9B** | 🏠 Local | 765ms | 1437ms | 25.0 | 13.8 GB |
37+
| **Qwen3.5-122B-MoE** | 🏠 Local | 1627ms | 2331ms | 18.0 | 40.8 GB |
38+
| **Qwen3.5-27B** | 🏠 Local | 2156ms | 3642ms | 10.0 | 24.9 GB |
39+
40+
> The **Qwen3.5-35B-MoE** has a lower TTFT than **all OpenAI cloud models** — 435ms vs. 508ms for GPT-5.4-nano. MoE with only 3B active parameters is remarkably fast for local inference.
41+
42+
## Test Hardware
43+
44+
- **Machine:** MacBook Pro M5 (M5 Pro chip, 18 cores, 64 GB unified memory)
45+
- **Local inference:** llama-server (llama.cpp)
46+
- **Cloud models:** OpenAI API
47+
- **OS:** macOS 15.3 (arm64)
48+
49+
## Test Suites (96 LLM Tests)
50+
51+
| # | Suite | Tests | What It Evaluates |
52+
|--:|:------|------:|:------------------|
53+
| 1 | 📋 Context Preprocessing | 6 | Deduplicating conversations, preserving system msgs |
54+
| 2 | 🏷️ Topic Classification | 4 | Routing queries to the right domain |
55+
| 3 | 🧠 Knowledge Distillation | 5 | Extracting durable facts from conversations |
56+
| 4 | 🔔 Event Deduplication | 8 | "Same person or new visitor?" across cameras |
57+
| 5 | 🔧 Tool Use | 16 | Selecting correct tools with correct parameters |
58+
| 6 | 💬 Chat & JSON Compliance | 11 | Persona, JSON output, multilingual |
59+
| 7 | 🚨 Security Classification | 12 | Normal → Monitor → Suspicious → Critical triage |
60+
| 8 | 📖 Narrative Synthesis | 4 | Summarizing event logs into daily reports |
61+
| 9 | 🛡️ Prompt Injection Resistance | 4 | Role confusion, prompt extraction, escalation |
62+
| 10 | 🔄 Multi-Turn Reasoning | 4 | Reference resolution, temporal carry-over |
63+
| 11 | ⚠️ Error Recovery | 4 | Handling impossible queries, API errors |
64+
| 12 | 🔒 Privacy & Compliance | 3 | PII redaction, illegal surveillance rejection |
65+
| 13 | 📡 Alert Routing | 5 | Channel routing, quiet hours parsing |
66+
| 14 | 💉 Knowledge Injection | 5 | Using injected KIs to personalize responses |
67+
| 15 | 🚨 VLM-to-Alert Triage | 5 | End-to-end: VLM output → urgency → alert dispatch |
68+
69+
## Running the Benchmark
70+
71+
### As an Aegis Skill (automatic)
72+
73+
When spawned by [Aegis-AI](https://github.com/SharpAI/DeepCamera), all configuration is injected via environment variables. The benchmark discovers your LLM gateway and VLM server automatically, generates an HTML report, and opens it when complete.
74+
75+
### Standalone
76+
77+
```bash
78+
# Install dependencies
79+
npm install
80+
81+
# LLM-only (VLM tests skipped)
82+
node scripts/run-benchmark.cjs
83+
84+
# With VLM tests
85+
node scripts/run-benchmark.cjs --vlm http://localhost:5405
86+
87+
# Custom LLM gateway
88+
node scripts/run-benchmark.cjs --gateway http://localhost:5407
89+
```
90+
91+
See [SKILL.md](SKILL.md) for full configuration options and the protocol spec.
92+
93+
## Why This Matters
94+
95+
Most LLM benchmarks test generic capabilities. But when you're building a **real product** — especially one running **entirely on consumer hardware** — you need domain-specific evaluation:
96+
97+
1. ✅ Can it pick the right tool with correct parameters?
98+
2. ✅ Can it classify "masked person at night" as Critical vs. Suspicious?
99+
3. ✅ Can it resist prompt injection disguised as camera event descriptions?
100+
4. ✅ Can it deduplicate the same delivery person seen across 3 cameras?
101+
5. ✅ Can it maintain context across multi-turn security conversations?
102+
103+
A **9B Qwen model on a MacBook Pro** scoring within 4% of GPT-5.4 on these domain tasks — while running fully offline with complete privacy — is the value proposition of local AI.
104+
105+
---
106+
107+
**System:** [Aegis-AI](https://aegis.sharpai.org) — Local-first AI home security on consumer hardware.
108+
**Benchmark:** HomeSec-Bench — 96 LLM + 35 VLM tests across 16 suites.
109+
**Skill Platform:** [DeepCamera](https://github.com/SharpAI/DeepCamera) — Decentralized AI skill ecosystem.

0 commit comments

Comments
 (0)