|
75 | 75 | preprocessing, tool use, security classification, prompt injection resistance, |
76 | 76 | knowledge injection, and event deduplication, plus an optional multimodal |
77 | 77 | VLM scene analysis suite (35~additional tests). We present results across |
78 | | -\textbf{seven model configurations}: four local Qwen3.5 variants |
79 | | -(9B~Q4\_K\_M, 27B~Q4\_K\_M, 35B-MoE~Q4\_K\_L, 122B-MoE~IQ1\_M) and three |
80 | | -OpenAI cloud models (GPT-5.4, GPT-5.4-mini, GPT-5.4-nano), all evaluated |
81 | | -on a single Apple M5~Pro consumer laptop (64~GB unified memory). Our |
82 | | -findings reveal that (1)~the best local model (Qwen3.5-9B) achieves |
83 | | -93.8\% accuracy vs.\ 97.9\% for GPT-5.4---a gap of only 4.1~percentage |
84 | | -points---with complete data privacy and zero API cost; (2)~the |
85 | | -Qwen3.5-35B-MoE variant produces lower first-token latency (435~ms) |
86 | | -than any OpenAI cloud endpoint tested (508~ms for GPT-5.4-nano); |
87 | | -(3)~security threat classification is universally robust across all |
88 | | -eight model sizes; and (4)~event deduplication across camera views |
89 | | -remains the hardest task, with only GPT-5.4 achieving a perfect 8/8 |
90 | | -score. HomeSec-Bench is released as an open-source DeepCamera skill, |
91 | | -enabling reproducible evaluation of any OpenAI-compatible endpoint. |
| 78 | +\textbf{sixteen model configurations} spanning five model families: Qwen3.5 |
| 79 | +(six variants from 9B to 122B-MoE), Mistral Small~4 (119B, two quants), |
| 80 | +NVIDIA Nemotron-3-Nano (4B and 30B), Liquid LFM2 (1.2B and 24B), and |
| 81 | +three OpenAI cloud models (GPT-5.4, GPT-5.4-mini, GPT-5.4-nano), all |
| 82 | +evaluated on a single Apple M5~Pro consumer laptop (64~GB unified memory). |
| 83 | +Our findings reveal that (1)~the best local model (Qwen3.5-27B~Q8) achieves |
| 84 | +95.8\% accuracy vs.\ 97.9\% for GPT-5.4---a gap of only 2.1~percentage |
| 85 | +points---with complete data privacy and zero API cost; (2)~Mistral |
| 86 | +Small~4 (119B) at Q2\_K\_XL quantization scores 89.6\%, establishing |
| 87 | +that 119B-class thinking models can run on consumer hardware with |
| 88 | +proper thinking-mode suppression; (3)~security threat classification |
| 89 | +is universally robust across all model sizes; and (4)~event deduplication |
| 90 | +across camera views remains the hardest task, with only GPT-5.4 |
| 91 | +achieving a perfect 8/8 score. HomeSec-Bench is released as an |
| 92 | +open-source DeepCamera skill, enabling reproducible evaluation of any |
| 93 | +OpenAI-compatible endpoint. |
92 | 94 | \end{abstract} |
93 | 95 |
|
94 | 96 | \begin{IEEEkeywords} |
@@ -731,39 +733,56 @@ \section{Experimental Setup} |
731 | 733 |
|
732 | 734 | \subsection{Models Under Test} |
733 | 735 |
|
734 | | -We evaluate seven model configurations spanning local and cloud |
735 | | -deployments. Local models run via \texttt{llama-server} with Metal |
736 | | -Performance Shaders (MPS/CoreML) acceleration. Cloud models route |
737 | | -through the OpenAI API. |
| 736 | +We evaluate sixteen model configurations spanning five model families |
| 737 | +across local and cloud deployments. Local models run via |
| 738 | +\texttt{llama-server} (llama.cpp build b8416) with Metal Performance |
| 739 | +Shaders acceleration on Apple M5~Pro. Cloud models route through the |
| 740 | +OpenAI API. |
738 | 741 |
|
739 | 742 | \begin{table}[h] |
740 | 743 | \centering |
741 | | -\caption{Model Configurations Under Test} |
| 744 | +\caption{Model Configurations Under Test (16 Models)} |
742 | 745 | \label{tab:models} |
743 | 746 | \small |
744 | | -\begin{tabular}{p{2.8cm}p{1.3cm}p{1.7cm}} |
| 747 | +\begin{tabular}{p{3.4cm}p{1.0cm}p{2.0cm}} |
745 | 748 | \toprule |
746 | 749 | \textbf{Model} & \textbf{Type} & \textbf{Quant / Size} \\ |
747 | 750 | \midrule |
| 751 | +\multicolumn{3}{l}{\textit{Qwen3.5 Family}} \\ |
748 | 752 | Qwen3.5-9B & Local & Q4\_K\_M, 13.8~GB \\ |
| 753 | +Qwen3.5-9B & Local & BF16, 18.5~GB \\ |
749 | 754 | Qwen3.5-27B & Local & Q4\_K\_M, 24.9~GB \\ |
| 755 | +Qwen3.5-27B & Local & Q8\_K\_XL, 30.2~GB \\ |
750 | 756 | Qwen3.5-35B-MoE & Local & Q4\_K\_L, 27.2~GB \\ |
751 | 757 | Qwen3.5-122B-MoE & Local & IQ1\_M, 40.8~GB \\ |
| 758 | +\multicolumn{3}{l}{\textit{Mistral Family}} \\ |
| 759 | +Mistral-Small-4-119B & Local & IQ1\_M, 29.0~GB \\ |
| 760 | +Mistral-Small-4-119B & Local & Q2\_K\_XL, 42.9~GB \\ |
| 761 | +\multicolumn{3}{l}{\textit{NVIDIA Nemotron}} \\ |
| 762 | +Nemotron-3-Nano-4B & Local & Q4\_K\_M, 2.5~GB \\ |
| 763 | +Nemotron-3-Nano-30B & Local & Q8\_0, 31.5~GB \\ |
| 764 | +\multicolumn{3}{l}{\textit{Liquid LFM}} \\ |
| 765 | +LFM2.5-1.2B & Local & BF16, 2.4~GB \\ |
| 766 | +LFM2-24B-MoE & Local & Q8\_0, 25.6~GB \\ |
| 767 | +\multicolumn{3}{l}{\textit{OpenAI Cloud}} \\ |
752 | 768 | GPT-5.4 & Cloud & API \\ |
753 | 769 | GPT-5.4-mini & Cloud & API \\ |
754 | 770 | GPT-5.4-nano & Cloud & API \\ |
| 771 | +GPT-5-mini (2025) & Cloud & API \\ |
755 | 772 | \bottomrule |
756 | 773 | \end{tabular} |
757 | 774 | \end{table} |
758 | 775 |
|
759 | | -All local models are GGUF variants served by \texttt{llama-server} |
760 | | -(llama.cpp). The MoE variants (35B and 122B) activate only a fraction |
761 | | -of parameters per token---approximately 3B active for the 35B |
762 | | -variant---enabling surprisingly low latency relative to parameter count. |
763 | | -GPT-5.4-mini exhibited API-level restrictions on non-default temperature |
764 | | -values; affected suites (using \texttt{temperature}$\neq$1.0) returned |
765 | | -blanket failures, so GPT-5.4-mini results should be interpreted as a |
766 | | -lower bound of true capability. |
| 776 | +All local models are GGUF variants served by \texttt{llama-server}. |
| 777 | +The MoE variants (Qwen3.5-35B, 122B; LFM2-24B) activate only a |
| 778 | +fraction of parameters per token---approximately 3B active for the |
| 779 | +35B variant---enabling surprisingly low latency relative to parameter |
| 780 | +count. Mistral Small~4 is a thinking model; we suppress reasoning |
| 781 | +tokens via \texttt{--chat-template-kwargs \{"reasoning\_effort":"none"\}} |
| 782 | +and \texttt{--parallel 1} to prevent KV cache memory exhaustion on |
| 783 | +64~GB hardware. GPT-5-mini (2025) rejected non-default temperature |
| 784 | +values; affected suites returned blanket 400 errors, so its results |
| 785 | +represent a lower bound. |
767 | 786 |
|
768 | 787 | \subsection{Hardware} |
769 | 788 |
|
@@ -795,33 +814,45 @@ \subsection{Overall Scorecard (LLM-Only, 96 Tests)} |
795 | 814 |
|
796 | 815 | \begin{table}[h] |
797 | 816 | \centering |
798 | | -\caption{Overall LLM Benchmark Results — 96 Tests} |
| 817 | +\caption{Overall LLM Benchmark Results — 96 Tests, 16 Models} |
799 | 818 | \label{tab:overall} |
800 | 819 | \small |
801 | | -\begin{tabular}{p{2.5cm}cccc} |
| 820 | +\begin{tabular}{p{3.2cm}cccc} |
802 | 821 | \toprule |
803 | 822 | \textbf{Model} & \textbf{Pass} & \textbf{Fail} & \textbf{Rate} & \textbf{Time} \\ |
804 | 823 | \midrule |
805 | 824 | GPT-5.4 & \textbf{94} & 2 & \textbf{97.9\%} & 2m 22s \\ |
806 | 825 | GPT-5.4-mini & 92 & 4 & 95.8\% & 1m 17s \\ |
807 | | -Qwen3.5-9B & 90 & 6 & 93.8\% & 5m 23s \\ |
808 | | -Qwen3.5-27B & 90 & 6 & 93.8\% & 15m 8s \\ |
| 826 | +Qwen3.5-27B Q8\_K\_XL & 92 & 4 & 95.8\% & --- \\ |
| 827 | +Qwen3.5-9B BF16 & 91 & 5 & 94.8\% & --- \\ |
| 828 | +Qwen3.5-27B Q4\_K\_M & 90 & 6 & 93.8\% & 15m 8s \\ |
| 829 | +Mistral-119B Q2\_K\_XL & 86 & 10 & 89.6\% & --- \\ |
809 | 830 | Qwen3.5-122B-MoE & 89 & 7 & 92.7\% & 8m 26s \\ |
810 | 831 | GPT-5.4-nano & 89 & 7 & 92.7\% & 1m 34s \\ |
| 832 | +Qwen3.5-9B Q4\_K\_M & 88 & 8 & 91.7\% & 5m 23s \\ |
811 | 833 | Qwen3.5-35B-MoE & 88 & 8 & 91.7\% & 3m 30s \\ |
| 834 | +Nemotron-4B$^\ddagger$ & 84 & 12 & 87.5\% & --- \\ |
| 835 | +Mistral-119B IQ1\_M & 79 & 17 & 82.3\% & --- \\ |
| 836 | +Nemotron-30B$^\ddagger$ & 78 & 18 & 81.3\% & --- \\ |
| 837 | +LFM2-24B-MoE$^\ddagger$ & 72 & 24 & 75.0\% & --- \\ |
| 838 | +LFM2.5-1.2B & 62 & 34 & 64.6\% & --- \\ |
812 | 839 | GPT-5-mini (2025)$^\dagger$ & 60 & 36 & 62.5\% & 7m 38s \\ |
813 | 840 | \midrule |
814 | | -\multicolumn{5}{l}{\footnotesize $^\dagger$API rejected non-default temperature; see §\ref{sec:limitations}.} |
| 841 | +\multicolumn{5}{l}{\footnotesize $^\dagger$API rejected non-default temperature; see §\ref{sec:limitations}.} \\ |
| 842 | +\multicolumn{5}{l}{\footnotesize $^\ddagger$Temperature restriction failures inflate fail count; see §\ref{sec:limitations}.} |
815 | 843 | \end{tabular} |
816 | 844 | \end{table} |
817 | 845 |
|
818 | | -The \textbf{Qwen3.5-9B} running entirely on a consumer laptop scores |
819 | | -\textbf{93.8\%}---only 4.1~percentage points below GPT-5.4, and within |
820 | | -2~points of GPT-5.4-mini. Strikingly, the Qwen3.5-35B-MoE model |
821 | | -(88/96) ranks last among valid local models despite having 4$\times$ |
822 | | -more parameters than the 9B variant; this is primarily attributed to |
823 | | -quantization-induced precision loss at IQ-level quants and higher |
824 | | -memory bandwidth contention on long reasoning chains. |
| 846 | +The expanded 16-model evaluation reveals several new findings. |
| 847 | +\textbf{Qwen3.5-27B at Q8\_K\_XL} quantization achieves \textbf{95.8\%}---tying |
| 848 | +GPT-5.4-mini and closing to within 2.1~points of GPT-5.4. Higher-precision |
| 849 | +quantization (Q8 vs.\ Q4) provides a 2-point lift for the 27B model. |
| 850 | +\textbf{Mistral Small~4} (119B) at Q2\_K\_XL scores \textbf{89.6\%}, |
| 851 | +demonstrating that 119B-class thinking models can produce competitive |
| 852 | +results on consumer hardware when thinking-mode is properly suppressed. |
| 853 | +Nemotron and LFM2 models are penalized by temperature-restriction errors |
| 854 | +(\texttt{temperature=0.1} unsupported); their true capability is higher |
| 855 | +than reported scores suggest. |
825 | 856 |
|
826 | 857 | \subsection{Inference Performance} |
827 | 858 |
|
@@ -860,15 +891,13 @@ \subsection{Inference Performance} |
860 | 891 | choice for threat triage, preserving privacy for the most |
861 | 892 | sensitivity-relevant task. |
862 | 893 |
|
863 | | -\textbf{Key finding 3: 9B local model closes the cloud gap.} |
864 | | -Qwen3.5-9B ties with Qwen3.5-27B at 93.8\%---a larger model provides |
865 | | -no accuracy benefit at 3.7$\times$ the inference time (5m23s vs. |
866 | | -15m8s for a full 96-test run). The 9B variant represents the |
867 | | -Pareto-optimal local configuration: |
868 | | -{ |
869 | | -\small |
870 | | -$$\text{Qwen3.5-9B}: \frac{93.8\%}{5\text{m23s}} = 17.4\%/\text{min} \quad\text{vs}\quad \text{27B}: \frac{93.8\%}{15\text{m8s}} = 6.2\%/\text{min}$$ |
871 | | -} |
| 894 | +\textbf{Key finding 3: Quantization precision matters more than parameter count.} |
| 895 | +Qwen3.5-27B at Q8\_K\_XL (95.8\%) outperforms the same model at Q4\_K\_M |
| 896 | +(93.8\%)---a 2-point lift from higher-precision quantization alone. |
| 897 | +Similarly, Mistral-119B at Q2\_K\_XL (89.6\%) outperforms its IQ1\_M |
| 898 | +variant (82.3\%) by 7.3~points. For accuracy-critical deployments, |
| 899 | +allocating more memory to higher-precision quants yields better results |
| 900 | +than increasing parameter count at aggressive quantization. |
872 | 901 |
|
873 | 902 | \textbf{Key finding 4: Context preprocessing remains universally challenging.} |
874 | 903 | All models---local and cloud---fail at least one context deduplication |
@@ -978,7 +1007,7 @@ \section{Discussion} |
978 | 1007 |
|
979 | 1008 | \subsection{Deployment Decision Matrix} |
980 | 1009 |
|
981 | | -Based on our seven-model evaluation, we propose the following guidance: |
| 1010 | +Based on our sixteen-model evaluation, we propose the following guidance: |
982 | 1011 |
|
983 | 1012 | \begin{table}[h] |
984 | 1013 | \centering |
@@ -1085,16 +1114,20 @@ \section{Conclusion} |
1085 | 1114 | multi-turn contextual reasoning---providing a standardized, reproducible |
1086 | 1115 | framework for comparing model suitability in video surveillance deployments. |
1087 | 1116 |
|
1088 | | -Evaluating seven model configurations on a single Apple~M5~Pro laptop |
1089 | | -reveals a fundamentally different landscape than the established |
1090 | | -consensus that cloud models are required for production AI accuracy. |
1091 | | -The \textbf{Qwen3.5-9B} achieves \textbf{93.8\%}---within 4.1 points |
1092 | | -of GPT-5.4 (97.9\%)---while running entirely locally with 13.8~GB of |
1093 | | -unified memory, zero API cost, and complete data privacy. The |
1094 | | -Qwen3.5-35B-MoE variant produces \textbf{lower first-token latency} |
1095 | | -(435~ms) than any cloud endpoint we tested (508~ms for GPT-5.4-nano), |
1096 | | -demonstrating that sparse MoE activation is a compelling architectural |
1097 | | -choice for latency-sensitive security alerting on consumer hardware. |
| 1117 | +Evaluating sixteen model configurations across five model families on a |
| 1118 | +single Apple~M5~Pro laptop reveals a fundamentally different landscape |
| 1119 | +than the established consensus that cloud models are required for |
| 1120 | +production AI accuracy. The \textbf{Qwen3.5-27B at Q8} achieves |
| 1121 | +\textbf{95.8\%}---within 2.1~points of GPT-5.4 (97.9\%)---while running |
| 1122 | +entirely locally with 30.2~GB of unified memory, zero API cost, and |
| 1123 | +complete data privacy. \textbf{Mistral Small~4} (119B) at Q2\_K\_XL |
| 1124 | +scores \textbf{89.6\%}, establishing that 119B-class thinking models |
| 1125 | +can serve as effective security assistants on consumer hardware when |
| 1126 | +reasoning tokens are suppressed. The Qwen3.5-35B-MoE variant produces |
| 1127 | +\textbf{lower first-token latency} (435~ms) than any cloud endpoint |
| 1128 | +tested (508~ms for GPT-5.4-nano), demonstrating that sparse MoE |
| 1129 | +activation is a compelling architectural choice for latency-sensitive |
| 1130 | +security alerting. |
1098 | 1131 |
|
1099 | 1132 | Security classification is universally robust (100\% across all models), |
1100 | 1133 | validating local inference for the most consequence-heavy task. |
|
0 commit comments