Skip to content

Commit a275e40

Browse files
authored
Merge pull request #175 from SharpAI/develop
Develop
2 parents c7e9ddd + 65f3ca5 commit a275e40

File tree

28 files changed

+3128
-82
lines changed

28 files changed

+3128
-82
lines changed

README.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ Each skill is a self-contained module with its own model, parameters, and [commu
5858
| Category | Skill | What It Does | Status |
5959
|----------|-------|--------------|:------:|
6060
| **Detection** | [`yolo-detection-2026`](skills/detection/yolo-detection-2026/) | Real-time 80+ class detection — auto-accelerated via TensorRT / CoreML / OpenVINO / ONNX ||
61+
| | [`yolo-detection-2026-coral-tpu`](skills/detection/yolo-detection-2026-coral-tpu/) | Google Coral Edge TPU — ~4ms inference via USB accelerator ([Docker-based](#detection--segmentation-skills)) | 🧪 |
62+
| | [`yolo-detection-2026-openvino`](skills/detection/yolo-detection-2026-openvino/) | Intel NCS2 USB / Intel GPU / CPU — multi-device via OpenVINO ([Docker-based](#detection--segmentation-skills)) | 🧪 |
6163
| **Analysis** | [`home-security-benchmark`](skills/analysis/home-security-benchmark/) | [143-test evaluation suite](#-homesec-bench--how-secure-is-your-local-ai) for LLM & VLM security performance ||
6264
| **Privacy** | [`depth-estimation`](skills/transformation/depth-estimation/) | [Real-time depth-map privacy transform](#-privacy--depth-map-anonymization) — anonymize camera feeds while preserving activity ||
6365
| **Segmentation** | [`sam2-segmentation`](skills/segmentation/sam2-segmentation/) | Interactive click-to-segment with Segment Anything 2 — pixel-perfect masks, point/box prompts, video tracking ||
@@ -70,6 +72,54 @@ Each skill is a self-contained module with its own model, parameters, and [commu
7072
7173
> **Registry:** All skills are indexed in [`skills.json`](skills.json) for programmatic discovery.
7274
75+
### Detection & Segmentation Skills
76+
77+
Detection and segmentation skills process visual data from camera feeds — detecting objects, segmenting regions, or analyzing scenes. All skills use the same **JSONL stdin/stdout protocol**: Aegis writes a frame to a shared volume, sends a `frame` event on stdin, and reads `detections` from stdout. This means every detection skill — whether running natively or inside Docker — is interchangeable from Aegis's perspective.
78+
79+
```mermaid
80+
graph TB
81+
CAM["📷 Camera Feed"] --> GOV["Frame Governor (5 FPS)"]
82+
GOV --> |"frame.jpg → shared volume"| PROTO["JSONL stdin/stdout Protocol"]
83+
84+
PROTO --> NATIVE["🖥️ Native: yolo-detection-2026"]
85+
PROTO --> DOCKER["🐳 Docker: Coral TPU / OpenVINO"]
86+
87+
subgraph Native["Native Skill (runs on host)"]
88+
NATIVE --> ENV["env_config.py auto-detect"]
89+
ENV --> TRT["NVIDIA → TensorRT"]
90+
ENV --> CML["Apple Silicon → CoreML"]
91+
ENV --> OV["Intel → OpenVINO IR"]
92+
ENV --> ONNX["AMD / CPU → ONNX"]
93+
end
94+
95+
subgraph Container["Docker Container"]
96+
DOCKER --> CORAL["Coral TPU → pycoral"]
97+
DOCKER --> OVIR["OpenVINO → Ultralytics OV"]
98+
DOCKER --> CPU["CPU fallback"]
99+
CORAL -.-> USB["USB/IP passthrough"]
100+
OVIR -.-> DRI["/dev/dri · /dev/bus/usb"]
101+
end
102+
103+
NATIVE --> |"stdout: detections"| AEGIS["Aegis IPC → Live Overlay + Alerts"]
104+
DOCKER --> |"stdout: detections"| AEGIS
105+
```
106+
107+
- **Native skills** run directly on the host — [`env_config.py`](skills/lib/env_config.py) auto-detects the GPU and converts models to the fastest format (TensorRT, CoreML, OpenVINO IR, ONNX)
108+
- **Docker skills** wrap hardware-specific runtimes in a container — cross-platform USB/device access without native driver installation
109+
- **Same output** — Aegis sees identical JSONL from all skills, so detection overlays, alerts, and forensic analysis work with any backend
110+
111+
#### LLM-Assisted Skill Installation
112+
113+
Skills are installed by an **autonomous LLM deployment agent** — not by brittle shell scripts. When you click "Install" in Aegis, a focused mini-agent session reads the skill's `SKILL.md` manifest and figures out what to do:
114+
115+
1. **Probe** — reads `SKILL.md`, `requirements.txt`, and `package.json` to understand what the skill needs
116+
2. **Detect hardware** — checks for NVIDIA (CUDA), AMD (ROCm), Apple Silicon (MPS), Intel (OpenVINO), or CPU-only
117+
3. **Install** — runs the right commands (`pip install`, `npm install`, `docker build`) with the correct backend-specific dependencies
118+
4. **Verify** — runs a smoke test to confirm the skill loads before marking it complete
119+
5. **Determine launch command** — figures out the exact `run_command` to start the skill and saves it to the registry
120+
121+
This means community-contributed skills don't need a bespoke installer — the LLM reads the manifest and adapts to whatever hardware you have. If something fails, it reads the error output and tries to fix it autonomously.
122+
73123

74124
## 🚀 Getting Started with [SharpAI Aegis](https://www.sharpai.org)
75125

docs/paper/home-security-benchmark.tex

Lines changed: 93 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -75,20 +75,22 @@
7575
preprocessing, tool use, security classification, prompt injection resistance,
7676
knowledge injection, and event deduplication, plus an optional multimodal
7777
VLM scene analysis suite (35~additional tests). We present results across
78-
\textbf{seven model configurations}: four local Qwen3.5 variants
79-
(9B~Q4\_K\_M, 27B~Q4\_K\_M, 35B-MoE~Q4\_K\_L, 122B-MoE~IQ1\_M) and three
80-
OpenAI cloud models (GPT-5.4, GPT-5.4-mini, GPT-5.4-nano), all evaluated
81-
on a single Apple M5~Pro consumer laptop (64~GB unified memory). Our
82-
findings reveal that (1)~the best local model (Qwen3.5-9B) achieves
83-
93.8\% accuracy vs.\ 97.9\% for GPT-5.4---a gap of only 4.1~percentage
84-
points---with complete data privacy and zero API cost; (2)~the
85-
Qwen3.5-35B-MoE variant produces lower first-token latency (435~ms)
86-
than any OpenAI cloud endpoint tested (508~ms for GPT-5.4-nano);
87-
(3)~security threat classification is universally robust across all
88-
eight model sizes; and (4)~event deduplication across camera views
89-
remains the hardest task, with only GPT-5.4 achieving a perfect 8/8
90-
score. HomeSec-Bench is released as an open-source DeepCamera skill,
91-
enabling reproducible evaluation of any OpenAI-compatible endpoint.
78+
\textbf{sixteen model configurations} spanning five model families: Qwen3.5
79+
(six variants from 9B to 122B-MoE), Mistral Small~4 (119B, two quants),
80+
NVIDIA Nemotron-3-Nano (4B and 30B), Liquid LFM2 (1.2B and 24B), and
81+
three OpenAI cloud models (GPT-5.4, GPT-5.4-mini, GPT-5.4-nano), all
82+
evaluated on a single Apple M5~Pro consumer laptop (64~GB unified memory).
83+
Our findings reveal that (1)~the best local model (Qwen3.5-27B~Q8) achieves
84+
95.8\% accuracy vs.\ 97.9\% for GPT-5.4---a gap of only 2.1~percentage
85+
points---with complete data privacy and zero API cost; (2)~Mistral
86+
Small~4 (119B) at Q2\_K\_XL quantization scores 89.6\%, establishing
87+
that 119B-class thinking models can run on consumer hardware with
88+
proper thinking-mode suppression; (3)~security threat classification
89+
is universally robust across all model sizes; and (4)~event deduplication
90+
across camera views remains the hardest task, with only GPT-5.4
91+
achieving a perfect 8/8 score. HomeSec-Bench is released as an
92+
open-source DeepCamera skill, enabling reproducible evaluation of any
93+
OpenAI-compatible endpoint.
9294
\end{abstract}
9395

9496
\begin{IEEEkeywords}
@@ -731,39 +733,56 @@ \section{Experimental Setup}
731733

732734
\subsection{Models Under Test}
733735

734-
We evaluate seven model configurations spanning local and cloud
735-
deployments. Local models run via \texttt{llama-server} with Metal
736-
Performance Shaders (MPS/CoreML) acceleration. Cloud models route
737-
through the OpenAI API.
736+
We evaluate sixteen model configurations spanning five model families
737+
across local and cloud deployments. Local models run via
738+
\texttt{llama-server} (llama.cpp build b8416) with Metal Performance
739+
Shaders acceleration on Apple M5~Pro. Cloud models route through the
740+
OpenAI API.
738741

739742
\begin{table}[h]
740743
\centering
741-
\caption{Model Configurations Under Test}
744+
\caption{Model Configurations Under Test (16 Models)}
742745
\label{tab:models}
743746
\small
744-
\begin{tabular}{p{2.8cm}p{1.3cm}p{1.7cm}}
747+
\begin{tabular}{p{3.4cm}p{1.0cm}p{2.0cm}}
745748
\toprule
746749
\textbf{Model} & \textbf{Type} & \textbf{Quant / Size} \\
747750
\midrule
751+
\multicolumn{3}{l}{\textit{Qwen3.5 Family}} \\
748752
Qwen3.5-9B & Local & Q4\_K\_M, 13.8~GB \\
753+
Qwen3.5-9B & Local & BF16, 18.5~GB \\
749754
Qwen3.5-27B & Local & Q4\_K\_M, 24.9~GB \\
755+
Qwen3.5-27B & Local & Q8\_K\_XL, 30.2~GB \\
750756
Qwen3.5-35B-MoE & Local & Q4\_K\_L, 27.2~GB \\
751757
Qwen3.5-122B-MoE & Local & IQ1\_M, 40.8~GB \\
758+
\multicolumn{3}{l}{\textit{Mistral Family}} \\
759+
Mistral-Small-4-119B & Local & IQ1\_M, 29.0~GB \\
760+
Mistral-Small-4-119B & Local & Q2\_K\_XL, 42.9~GB \\
761+
\multicolumn{3}{l}{\textit{NVIDIA Nemotron}} \\
762+
Nemotron-3-Nano-4B & Local & Q4\_K\_M, 2.5~GB \\
763+
Nemotron-3-Nano-30B & Local & Q8\_0, 31.5~GB \\
764+
\multicolumn{3}{l}{\textit{Liquid LFM}} \\
765+
LFM2.5-1.2B & Local & BF16, 2.4~GB \\
766+
LFM2-24B-MoE & Local & Q8\_0, 25.6~GB \\
767+
\multicolumn{3}{l}{\textit{OpenAI Cloud}} \\
752768
GPT-5.4 & Cloud & API \\
753769
GPT-5.4-mini & Cloud & API \\
754770
GPT-5.4-nano & Cloud & API \\
771+
GPT-5-mini (2025) & Cloud & API \\
755772
\bottomrule
756773
\end{tabular}
757774
\end{table}
758775

759-
All local models are GGUF variants served by \texttt{llama-server}
760-
(llama.cpp). The MoE variants (35B and 122B) activate only a fraction
761-
of parameters per token---approximately 3B active for the 35B
762-
variant---enabling surprisingly low latency relative to parameter count.
763-
GPT-5.4-mini exhibited API-level restrictions on non-default temperature
764-
values; affected suites (using \texttt{temperature}$\neq$1.0) returned
765-
blanket failures, so GPT-5.4-mini results should be interpreted as a
766-
lower bound of true capability.
776+
All local models are GGUF variants served by \texttt{llama-server}.
777+
The MoE variants (Qwen3.5-35B, 122B; LFM2-24B) activate only a
778+
fraction of parameters per token---approximately 3B active for the
779+
35B variant---enabling surprisingly low latency relative to parameter
780+
count. Mistral Small~4 is a thinking model; we suppress reasoning
781+
tokens via \texttt{--chat-template-kwargs \{"reasoning\_effort":"none"\}}
782+
and \texttt{--parallel 1} to prevent KV cache memory exhaustion on
783+
64~GB hardware. GPT-5-mini (2025) rejected non-default temperature
784+
values; affected suites returned blanket 400 errors, so its results
785+
represent a lower bound.
767786

768787
\subsection{Hardware}
769788

@@ -795,33 +814,45 @@ \subsection{Overall Scorecard (LLM-Only, 96 Tests)}
795814

796815
\begin{table}[h]
797816
\centering
798-
\caption{Overall LLM Benchmark Results — 96 Tests}
817+
\caption{Overall LLM Benchmark Results — 96 Tests, 16 Models}
799818
\label{tab:overall}
800819
\small
801-
\begin{tabular}{p{2.5cm}cccc}
820+
\begin{tabular}{p{3.2cm}cccc}
802821
\toprule
803822
\textbf{Model} & \textbf{Pass} & \textbf{Fail} & \textbf{Rate} & \textbf{Time} \\
804823
\midrule
805824
GPT-5.4 & \textbf{94} & 2 & \textbf{97.9\%} & 2m 22s \\
806825
GPT-5.4-mini & 92 & 4 & 95.8\% & 1m 17s \\
807-
Qwen3.5-9B & 90 & 6 & 93.8\% & 5m 23s \\
808-
Qwen3.5-27B & 90 & 6 & 93.8\% & 15m 8s \\
826+
Qwen3.5-27B Q8\_K\_XL & 92 & 4 & 95.8\% & --- \\
827+
Qwen3.5-9B BF16 & 91 & 5 & 94.8\% & --- \\
828+
Qwen3.5-27B Q4\_K\_M & 90 & 6 & 93.8\% & 15m 8s \\
829+
Mistral-119B Q2\_K\_XL & 86 & 10 & 89.6\% & --- \\
809830
Qwen3.5-122B-MoE & 89 & 7 & 92.7\% & 8m 26s \\
810831
GPT-5.4-nano & 89 & 7 & 92.7\% & 1m 34s \\
832+
Qwen3.5-9B Q4\_K\_M & 88 & 8 & 91.7\% & 5m 23s \\
811833
Qwen3.5-35B-MoE & 88 & 8 & 91.7\% & 3m 30s \\
834+
Nemotron-4B$^\ddagger$ & 84 & 12 & 87.5\% & --- \\
835+
Mistral-119B IQ1\_M & 79 & 17 & 82.3\% & --- \\
836+
Nemotron-30B$^\ddagger$ & 78 & 18 & 81.3\% & --- \\
837+
LFM2-24B-MoE$^\ddagger$ & 72 & 24 & 75.0\% & --- \\
838+
LFM2.5-1.2B & 62 & 34 & 64.6\% & --- \\
812839
GPT-5-mini (2025)$^\dagger$ & 60 & 36 & 62.5\% & 7m 38s \\
813840
\midrule
814-
\multicolumn{5}{l}{\footnotesize $^\dagger$API rejected non-default temperature; see §\ref{sec:limitations}.}
841+
\multicolumn{5}{l}{\footnotesize $^\dagger$API rejected non-default temperature; see §\ref{sec:limitations}.} \\
842+
\multicolumn{5}{l}{\footnotesize $^\ddagger$Temperature restriction failures inflate fail count; see §\ref{sec:limitations}.}
815843
\end{tabular}
816844
\end{table}
817845

818-
The \textbf{Qwen3.5-9B} running entirely on a consumer laptop scores
819-
\textbf{93.8\%}---only 4.1~percentage points below GPT-5.4, and within
820-
2~points of GPT-5.4-mini. Strikingly, the Qwen3.5-35B-MoE model
821-
(88/96) ranks last among valid local models despite having 4$\times$
822-
more parameters than the 9B variant; this is primarily attributed to
823-
quantization-induced precision loss at IQ-level quants and higher
824-
memory bandwidth contention on long reasoning chains.
846+
The expanded 16-model evaluation reveals several new findings.
847+
\textbf{Qwen3.5-27B at Q8\_K\_XL} quantization achieves \textbf{95.8\%}---tying
848+
GPT-5.4-mini and closing to within 2.1~points of GPT-5.4. Higher-precision
849+
quantization (Q8 vs.\ Q4) provides a 2-point lift for the 27B model.
850+
\textbf{Mistral Small~4} (119B) at Q2\_K\_XL scores \textbf{89.6\%},
851+
demonstrating that 119B-class thinking models can produce competitive
852+
results on consumer hardware when thinking-mode is properly suppressed.
853+
Nemotron and LFM2 models are penalized by temperature-restriction errors
854+
(\texttt{temperature=0.1} unsupported); their true capability is higher
855+
than reported scores suggest.
825856

826857
\subsection{Inference Performance}
827858

@@ -860,15 +891,13 @@ \subsection{Inference Performance}
860891
choice for threat triage, preserving privacy for the most
861892
sensitivity-relevant task.
862893

863-
\textbf{Key finding 3: 9B local model closes the cloud gap.}
864-
Qwen3.5-9B ties with Qwen3.5-27B at 93.8\%---a larger model provides
865-
no accuracy benefit at 3.7$\times$ the inference time (5m23s vs.
866-
15m8s for a full 96-test run). The 9B variant represents the
867-
Pareto-optimal local configuration:
868-
{
869-
\small
870-
$$\text{Qwen3.5-9B}: \frac{93.8\%}{5\text{m23s}} = 17.4\%/\text{min} \quad\text{vs}\quad \text{27B}: \frac{93.8\%}{15\text{m8s}} = 6.2\%/\text{min}$$
871-
}
894+
\textbf{Key finding 3: Quantization precision matters more than parameter count.}
895+
Qwen3.5-27B at Q8\_K\_XL (95.8\%) outperforms the same model at Q4\_K\_M
896+
(93.8\%)---a 2-point lift from higher-precision quantization alone.
897+
Similarly, Mistral-119B at Q2\_K\_XL (89.6\%) outperforms its IQ1\_M
898+
variant (82.3\%) by 7.3~points. For accuracy-critical deployments,
899+
allocating more memory to higher-precision quants yields better results
900+
than increasing parameter count at aggressive quantization.
872901

873902
\textbf{Key finding 4: Context preprocessing remains universally challenging.}
874903
All models---local and cloud---fail at least one context deduplication
@@ -978,7 +1007,7 @@ \section{Discussion}
9781007

9791008
\subsection{Deployment Decision Matrix}
9801009

981-
Based on our seven-model evaluation, we propose the following guidance:
1010+
Based on our sixteen-model evaluation, we propose the following guidance:
9821011

9831012
\begin{table}[h]
9841013
\centering
@@ -1085,16 +1114,20 @@ \section{Conclusion}
10851114
multi-turn contextual reasoning---providing a standardized, reproducible
10861115
framework for comparing model suitability in video surveillance deployments.
10871116

1088-
Evaluating seven model configurations on a single Apple~M5~Pro laptop
1089-
reveals a fundamentally different landscape than the established
1090-
consensus that cloud models are required for production AI accuracy.
1091-
The \textbf{Qwen3.5-9B} achieves \textbf{93.8\%}---within 4.1 points
1092-
of GPT-5.4 (97.9\%)---while running entirely locally with 13.8~GB of
1093-
unified memory, zero API cost, and complete data privacy. The
1094-
Qwen3.5-35B-MoE variant produces \textbf{lower first-token latency}
1095-
(435~ms) than any cloud endpoint we tested (508~ms for GPT-5.4-nano),
1096-
demonstrating that sparse MoE activation is a compelling architectural
1097-
choice for latency-sensitive security alerting on consumer hardware.
1117+
Evaluating sixteen model configurations across five model families on a
1118+
single Apple~M5~Pro laptop reveals a fundamentally different landscape
1119+
than the established consensus that cloud models are required for
1120+
production AI accuracy. The \textbf{Qwen3.5-27B at Q8} achieves
1121+
\textbf{95.8\%}---within 2.1~points of GPT-5.4 (97.9\%)---while running
1122+
entirely locally with 30.2~GB of unified memory, zero API cost, and
1123+
complete data privacy. \textbf{Mistral Small~4} (119B) at Q2\_K\_XL
1124+
scores \textbf{89.6\%}, establishing that 119B-class thinking models
1125+
can serve as effective security assistants on consumer hardware when
1126+
reasoning tokens are suppressed. The Qwen3.5-35B-MoE variant produces
1127+
\textbf{lower first-token latency} (435~ms) than any cloud endpoint
1128+
tested (508~ms for GPT-5.4-nano), demonstrating that sparse MoE
1129+
activation is a compelling architectural choice for latency-sensitive
1130+
security alerting.
10981131

10991132
Security classification is universally robust (100\% across all models),
11001133
validating local inference for the most consequence-heavy task.

0 commit comments

Comments
 (0)