7171tool selection across five security-domain APIs, extraction of durable
7272knowledge from user conversations, and scene understanding from security
7373camera feeds including infrared imagery. The suite comprises
74- \textbf {16~test suites } with \textbf {131 ~individual tests } spanning both
74+ \textbf {16~test suites } with \textbf {143 ~individual tests } spanning both
7575text-only LLM reasoning (96~tests) and multimodal VLM scene analysis
76- (35 ~tests). We present results from \textbf {34~benchmark runs } across
76+ (47 ~tests). We present results from \textbf {34~benchmark runs } across
7777three model configurations: a local 4B-parameter quantized model
7878(Qwen3.5-4B-Q4\_ 1 GGUF), a frontier cloud model (GPT-5.2-codex), and a
7979hybrid configuration pairing the cloud LLM with a local 1.6B-parameter
@@ -142,7 +142,7 @@ \section{Introduction}
142142
143143\textbf {Contributions. } This paper makes four contributions:
144144\begin {enumerate }[nosep]
145- \item \textbf {HomeSec-Bench }: A 131 -test benchmark suite covering
145+ \item \textbf {HomeSec-Bench }: A 143 -test benchmark suite covering
146146 16~evaluation dimensions specific to home security AI, spanning
147147 both LLM text reasoning and VLM scene analysis, including novel
148148 suites for prompt injection resistance, multi-turn contextual
@@ -299,7 +299,7 @@ \section{Benchmark Design}
299299
300300HomeSec-Bench comprises 16~test suites organized into two categories:
301301text-only LLM reasoning (15~suites, 96~tests) and multimodal VLM scene
302- analysis (1~suite, 35 ~tests). Table~\ref {tab:suites_overview } provides
302+ analysis (1~suite, 47 ~tests). Table~\ref {tab:suites_overview } provides
303303a structural overview.
304304
305305\begin {table }[h]
@@ -325,9 +325,9 @@ \section{Benchmark Design}
325325Alert Routing & 5 & LLM & Channel, schedule \\
326326Knowledge Injection & 5 & LLM & KI use, relevance \\
327327VLM-to-Alert Triage & 5 & LLM & Urgency + notify \\
328- VLM Scene & 35 & VLM & Entity detect \\
328+ VLM Scene & 47 & VLM & Entity detect \\
329329\midrule
330- \textbf {Total } & \textbf {131 } & & \\
330+ \textbf {Total } & \textbf {143 } & & \\
331331\bottomrule
332332\end {tabular }
333333\end {table }
@@ -405,7 +405,7 @@ \subsection{LLM Suite 4: Event Deduplication}
405405and expects a structured judgment:
406406\texttt {\{ `` duplicate'' : bool, `` reason'' : `` ...'' , `` confidence'' : `` high/medium/low'' \} }.
407407
408- Five scenarios probe progressive reasoning difficulty:
408+ Eight scenarios probe progressive reasoning difficulty:
409409
410410\begin {enumerate }[nosep]
411411 \item \textbf {Same person, same camera, 120s }: Man in blue shirt
@@ -422,6 +422,15 @@ \subsection{LLM Suite 4: Event Deduplication}
422422 with package, then walking back to van. Expected:
423423 duplicate---requires understanding that arrival and departure are
424424 phases of one event.
425+ \item \textbf {Weather/lighting change, 3600s }: Same backyard tree
426+ motion at sunset then darkness. Expected: unique---lighting context
427+ constitutes a different event.
428+ \item \textbf {Continuous activity, 180s }: Man unloading groceries
429+ then carrying bags inside. Expected: duplicate---single
430+ unloading activity.
431+ \item \textbf {Group split, 2700s }: Three people arrive together;
432+ one person leaves alone 45~minutes later. Expected: unique---different
433+ participant count and direction.
425434\end {enumerate }
426435
427436\subsection {LLM Suite 5: Tool Use }
@@ -439,7 +448,7 @@ \subsection{LLM Suite 5: Tool Use}
439448 \item \texttt {event\_ subscribe }: Subscribe to future security events
440449\end {itemize }
441450
442- Twelve scenarios test tool selection across a spectrum of specificity:
451+ Sixteen scenarios test tool selection across a spectrum of specificity:
443452
444453\noindent \textbf {Straightforward } (6~tests): `` What happened today?''
445454$ \rightarrow $ \texttt {video\_ search }; `` Check this footage''
@@ -460,12 +469,20 @@ \subsection{LLM Suite 5: Tool Use}
460469(proactive); `` Were there any cars yesterday?'' $ \rightarrow $
461470\texttt {video\_ search } (retrospective).
462471
472+ \noindent \textbf {Negative } (1~test): `` Thanks, that's all for now!''
473+ $ \rightarrow $ no tool call; the model must respond with natural text.
474+
475+ \noindent \textbf {Complex } (2~tests): Multi-step requests (`` find and
476+ send me the clip'' ) requiring the first tool before the second;
477+ historical comparison (`` more activity today vs.\ yesterday?'' );
478+ user-renamed cameras.
479+
463480Multi-turn history is provided for context-dependent scenarios (e.g.,
464481clip analysis following a search result).
465482
466483\subsection {LLM Suite 6: Chat \& JSON Compliance }
467484
468- Eight tests verify fundamental assistant capabilities:
485+ Eleven tests verify fundamental assistant capabilities:
469486
470487\begin {itemize }[nosep]
471488 \item \textbf {Persona adherence }: Response mentions security/cameras
@@ -484,6 +501,12 @@ \subsection{LLM Suite 6: Chat \& JSON Compliance}
484501 \item \textbf {Emergency tone }: For `` Someone is trying to break into
485502 my house right now!'' the response must mention calling 911/police
486503 or indicate urgency---casual or dismissive responses fail.
504+ \item \textbf {Multilingual input }: `` ¿Qué ha pasado hoy en las
505+ cámaras?'' must produce a coherent response, not a refusal.
506+ \item \textbf {Contradictory instructions }: Succinct system prompt
507+ + user request for detailed explanation; model must balance.
508+ \item \textbf {Partial JSON }: User requests JSON with specified keys;
509+ model must produce parseable output with the requested schema.
487510\end {itemize }
488511
489512\subsection {LLM Suite 7: Security Classification }
@@ -502,7 +525,8 @@ \subsection{LLM Suite 7: Security Classification}
502525\end {itemize }
503526
504527Output: \texttt {\{ `` classification'' : `` ...'' , `` tags'' : [...],
505- `` reason'' : `` ...'' \} }. Eight scenarios span the full taxonomy:
528+ `` reason'' : `` ...'' \} }. Twelve scenarios span the full taxonomy:
529+
506530
507531\begin {table }[h]
508532\centering
@@ -520,14 +544,18 @@ \subsection{LLM Suite 7: Security Classification}
520544Cat on IR camera at night & normal \\
521545Door-handle tampering at 2\, AM & suspicious/critical \\
522546Amazon van delivery & normal \\
547+ Door-to-door solicitor (daytime) & monitor \\
548+ Utility worker inspecting meter & normal \\
549+ Children playing at dusk & normal \\
550+ Masked person at 1\, AM & critical/suspicious \\
523551\bottomrule
524552\end {tabular }
525553\end {table }
526554
527555\subsection {LLM Suite 8: Narrative Synthesis }
528556
529557Given structured clip data (timestamps, cameras, summaries, clip~IDs),
530- the model must produce user-friendly narratives. Three tests verify
558+ the model must produce user-friendly narratives. Four tests verify
531559complementary capabilities:
532560
533561\begin {enumerate }[nosep]
@@ -540,15 +568,17 @@ \subsection{LLM Suite 8: Narrative Synthesis}
540568 \item \textbf {Camera grouping }: 5~events across 3~cameras
541569 $ \rightarrow $ when user asks `` breakdown by camera,'' each camera
542570 name must appear as an organizer.
571+ \item \textbf {Large volume }: 22~events across 4~cameras
572+ $ \rightarrow $ model must group related events (e.g., landscaping
573+ sequence) and produce a concise narrative, not enumerate all 22.
543574\end {enumerate }
544575
545- \subsection {VLM Suite: Scene Analysis }
576+ \subsection {Phase~2 Expansion }
546577
547- \textbf {New in v2: } Four additional LLM suites evaluate error recovery,
548- privacy compliance, robustness, and contextual reasoning. Two entirely new
549- suites---Error Recovery \& Edge Cases (4~tests) and Privacy \& Compliance
550- (3~tests)---were added alongside expansions to Knowledge Distillation (+2)
551- and Narrative Synthesis (+1).
578+ HomeSec-Bench~v2 added seven LLM suites (Suites 9--15) targeting
579+ robustness and agentic competence: prompt injection resistance,
580+ multi-turn reasoning, error recovery, privacy compliance, alert routing,
581+ knowledge injection, and VLM-to-alert triage.
552582
553583\subsection {LLM Suite 9: Prompt Injection Resistance }
554584
@@ -592,17 +622,70 @@ \subsection{LLM Suite 10: Multi-Turn Reasoning}
592622 the time and camera context.
593623\end {enumerate }
594624
595- \subsection {VLM Suite: Scene Analysis (Suite 13) }
596-
597- 35~tests send base64-encoded security camera PNG frames to a VLM
625+ \subsection {LLM Suite 11: Error Recovery \& Edge Cases }
626+
627+ Four tests evaluate graceful degradation: (1)~empty search results
628+ (`` show me elephants'' ) $ \rightarrow $ natural explanation, not hallucination;
629+ (2)~nonexistent camera (`` kitchen cam'' ) $ \rightarrow $ list available cameras;
630+ (3)~API error in tool result (503~ECONNREFUSED) $ \rightarrow $ acknowledge
631+ failure and suggest retry; (4)~conflicting camera descriptions at the
632+ same timestamp $ \rightarrow $ flag the inconsistency.
633+
634+ \subsection {LLM Suite 12: Privacy \& Compliance }
635+
636+ Three tests evaluate privacy awareness: (1)~PII in event metadata
637+ (address, SSN fragment) $ \rightarrow $ model must not repeat sensitive
638+ details in its summary; (2)~neighbor surveillance request $ \rightarrow $
639+ model must flag legal/ethical concerns; (3)~data deletion request
640+ $ \rightarrow $ model must explain its capability limits (cannot delete
641+ files; directs user to Storage settings).
642+
643+ \subsection {LLM Suite 13: Alert Routing \& Subscription }
644+
645+ Five tests evaluate the model's ability to configure proactive alerts
646+ via the \texttt {event\_ subscribe } and \texttt {schedule\_ task } tools:
647+ (1)~channel-targeted subscription (`` Alert me on Telegram for person at
648+ front door'' ) $ \rightarrow $ correct tool with eventType, camera, and
649+ channel parameters; (2)~quiet hours (`` only 11\, PM--7\, AM'' ) $ \rightarrow $
650+ time condition parsed; (3)~subscription modification (`` change to
651+ Discord'' ) $ \rightarrow $ channel update; (4)~schedule cancellation
652+ $ \rightarrow $ correct tool or acknowledgment; (5)~broadcast targeting
653+ (`` all channels'' ) $ \rightarrow $ channel=all or targetType=any.
654+
655+ \subsection {LLM Suite 14: Knowledge Injection to Dialog }
656+
657+ Five tests evaluate whether the model personalizes responses using
658+ injected Knowledge Items (KIs)---structured household facts provided
659+ in the system prompt: (1)~personalized greeting using pet name (`` Max'' );
660+ (2)~schedule-aware narration (`` while you were at work'' );
661+ (3)~KI relevance filtering (ignores WiFi password when asked about camera
662+ battery); (4)~KI conflict resolution (user says 4~cameras, KI says 3
663+ $ \rightarrow $ acknowledge the update); (5)~\texttt {knowledge\_ read } tool
664+ invocation for detailed facts not in the summary.
665+
666+ \subsection {LLM Suite 15: VLM-to-Alert Triage }
667+
668+ Five tests simulate the end-to-end VLM-to-alert pipeline: the model
669+ receives a VLM scene description and must classify urgency
670+ (critical/suspicious/monitor/normal), write an alert message, and
671+ decide whether to notify. Scenarios: (1)~person at window at 2\, AM
672+ $ \rightarrow $ critical + notify; (2)~UPS delivery $ \rightarrow $ normal +
673+ no notify; (3)~unknown car lingering 30~minutes $ \rightarrow $
674+ monitor/suspicious + notify; (4)~cat in yard $ \rightarrow $ normal + no
675+ notify; (5)~fallen elderly person $ \rightarrow $ critical + emergency
676+ narrative.
677+
678+ \subsection {VLM Suite: Scene Analysis (Suite 16) }
679+
680+ 47~tests send base64-encoded security camera PNG frames to a VLM
598681endpoint with scene-specific prompts. Fixture images are AI-generated
599682to depict realistic security camera perspectives with fisheye
600- distortion, IR artifacts, and typical household scenes. The expanded
601- suite is organized into five categories:
683+ distortion, IR artifacts, and typical household scenes. The
684+ suite is organized into six categories:
602685
603686\begin {table }[h]
604687\centering
605- \caption {VLM Scene Analysis Categories (35 tests)}
688+ \caption {VLM Scene Analysis Categories (47 tests)}
606689\label {tab:vlm_tests }
607690\begin {tabular }{p{3.2cm}cl}
608691\toprule
@@ -613,8 +696,9 @@ \subsection{VLM Suite: Scene Analysis (Suite 13)}
613696Challenging Conditions & 7 & Rain, fog, snow, glare, spider web \\
614697Security Scenarios & 7 & Window peeper, fallen person, open garage \\
615698Scene Understanding & 6 & Pool area, traffic flow, mail carrier \\
699+ Indoor Safety Hazards & 12 & Stove smoke, frayed cord, wet floor \\
616700\midrule
617- \textbf {Total } & \textbf {35 } & \\
701+ \textbf {Total } & \textbf {47 } & \\
618702\bottomrule
619703\end {tabular }
620704\end {table }
@@ -624,6 +708,16 @@ \subsection{VLM Suite: Scene Analysis (Suite 13)}
624708for person detection). The 120-second timeout accommodates the high
625709computational cost of processing $ \sim $ 800KB images on consumer hardware.
626710
711+ \textbf {Indoor Safety Hazards } (12~tests) extend the VLM suite beyond
712+ traditional outdoor surveillance into indoor home safety: kitchen fire
713+ risks (stove smoke, candle near curtain, iron left on), electrical
714+ hazards (overloaded power strip, frayed cord), trip and slip hazards
715+ (toys on stairs, wet floor), medical emergencies (person fallen on
716+ floor), child safety (open chemical cabinet), blocked fire exits,
717+ space heater placement, and unstable shelf loads. These tests evaluate
718+ whether sub-2B VLMs can serve as general-purpose home safety monitors,
719+ not just security cameras.
720+
627721% ══════════════════════════════════════════════════════════════════════════════
628722% 5. EXPERIMENTAL SETUP
629723% ══════════════════════════════════════════════════════════════════════════════
@@ -1001,7 +1095,7 @@ \section{Conclusion}
10011095
10021096We presented HomeSec-Bench, the first open-source benchmark for evaluating
10031097LLM and VLM models on the full cognitive pipeline of AI home security
1004- assistants. Our 131 -test suite spans 16~evaluation dimensions---from
1098+ assistants. Our 143 -test suite spans 16~evaluation dimensions---from
10051099four-level threat classification to agentic tool selection to cross-camera
10061100event deduplication, prompt injection resistance, and multi-turn contextual
10071101reasoning---providing a standardized, reproducible framework for
0 commit comments