22
33## Overview
44
5- This framework implements ** LLM-as-a-Judge** methodology for evaluating language model performance on security incident analysis tasks. Rather than relying on manual expert review or simple metrics, we use a powerful LLM (GPT-4o ) acting as an experienced network security analyst to systematically assess and compare different models' outputs. The judge model evaluates each response against security-specific criteria, providing comparative rankings (1-4 positions) and quality scores (1-10 scale) with detailed justifications. This approach enables scalable, consistent, and expert-level evaluation of multiple models across dozens of real-world security incidents.
5+ This framework implements ** LLM-as-a-Judge** methodology for evaluating language model performance on security incident analysis tasks. Rather than relying on manual expert review or simple metrics, we use a local LLM (` gpt-oss-120b ` ) acting as an experienced network security analyst to systematically assess and compare different models' outputs. The judge model evaluates each response against security-specific criteria, providing comparative rankings (1-4 positions) and quality scores (1-10 scale) with detailed justifications. This approach enables scalable, consistent, and expert-level evaluation across the full dataset of real-world security incidents.
66
77The framework supports two evaluation workflows using identical methodology:
88
@@ -12,10 +12,10 @@ The framework supports two evaluation workflows using identical methodology:
1212Both workflows use comparative ranking where the judge ranks all models' outputs for each incident, avoiding the need for absolute score thresholds.
1313
1414** Key Features:**
15- - 50-sample evaluations (stratified Normal/Malware distribution)
15+ - Full dataset evaluation (532 incidents, stratified Normal/Malware distribution)
16+ - Local judge model via OpenAI-compatible API (no cloud cost)
17+ - Incremental result saving (resumable if interrupted)
1618- Interactive HTML dashboards with drill-down capabilities
17- - Cost-effective (~ $5 per 50-incident evaluation with GPT-4o judge)
18- - Reproducible methodology with configurable parameters
1919
2020---
2121
@@ -25,28 +25,60 @@ Both workflows use comparative ranking where the judge ranks all models' outputs
2525
2626``` bash
2727pip install openai python-dotenv
28- export OPENAI_API_KEY=" sk-your-key-here"
2928```
3029
31- ### Run Complete Evaluation
30+ No API key required for local endpoints.
31+
32+ ### Run Evaluation
3233
3334``` bash
35+ cd alert_summary/
36+
3437# Summarization workflow
35- ./run_evaluation_summary.sh
38+ python3 evaluate_summaries.py \
39+ --input datasets/summarization_dataset_v3.json \
40+ --output datasets/summarization_dataset_v3_results_oss.json \
41+ --judge gpt-oss-120b \
42+ --base-url http://YOUR_LOCAL_ENDPOINT/v1
3643
3744# Risk analysis workflow
38- ./run_evaluation_risk.sh
39-
40- # View results
41- cat results/summary_report.md # or results/risk_summary.md
42- firefox results/summary_dashboard.html # or results/risk_dashboard.html
45+ python3 evaluate_risk.py \
46+ --input datasets/risk_dataset.json \
47+ --output datasets/risk_dataset_results_oss.json \
48+ --judge gpt-oss-120b \
49+ --base-url http://YOUR_LOCAL_ENDPOINT/v1
4350```
4451
45- ### Test Before Full Run
52+ ### Analyze Results
4653
4754``` bash
48- # Single incident test (~$0.10)
49- python3 test_evaluation.py
55+ # Summarization
56+ python3 analyze_results.py \
57+ --results datasets/summarization_dataset_v3_results_oss.json \
58+ --summary results/summary_report_oss.md \
59+ --csv results/summary_data_oss.csv \
60+ --judge gpt-oss-120b
61+
62+ python3 generate_dashboard.py \
63+ --results datasets/summarization_dataset_v3_results_oss.json \
64+ --sample datasets/summarization_dataset_v3.json \
65+ --output results/summary_dashboard_oss.html
66+
67+ # Risk analysis
68+ python3 analyze_results.py \
69+ --results datasets/risk_dataset_results_oss.json \
70+ --summary results/risk_report_oss.md \
71+ --csv results/risk_data_oss.csv \
72+ --judge gpt-oss-120b
73+
74+ python3 generate_dashboard.py \
75+ --results datasets/risk_dataset_results_oss.json \
76+ --sample datasets/risk_dataset.json \
77+ --output results/risk_dashboard_oss.html
78+
79+ # View results
80+ cat results/summary_report_oss.md
81+ firefox results/summary_dashboard_oss.html
5082```
5183
5284---
@@ -59,8 +91,8 @@ python3 test_evaluation.py
5991
6092** Methodology:**
6193- Judge provides comparative ** ranking** (1st to 4th place) and ** quality scores** (1-10 scale)
62- - Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 15B , Qwen2.5 3B
63- - Dataset: 50 incidents (10 Normal + 40 Malware)
94+ - Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 1.5B , Qwen2.5 3B
95+ - Dataset: 532 incidents (18 Normal + 514 Malware)
6496
6597** Evaluation Criteria:**
66981 . Accuracy of threat identification
@@ -71,33 +103,39 @@ python3 test_evaluation.py
71103
72104** Manual Execution:**
73105``` bash
74- # Step 1: Sample incidents
75- python3 datasets/create_evaluation_sample.py [--size 50] [--seed 42]
76-
77- # Step 2: Judge evaluation
78- python3 evaluate_summaries.py [--judge gpt-4o] [--input FILE] [--output FILE]
79-
80- # Step 3: Analyze results
81- python3 analyze_results.py [--results FILE] [--summary FILE] [--csv FILE]
82-
83- # Step 4: Generate dashboard
84- python3 generate_dashboard.py [--results FILE] [--sample FILE] [--output FILE]
106+ # Step 1: Judge evaluation
107+ python3 evaluate_summaries.py \
108+ --input datasets/summarization_dataset_v3.json \
109+ --output datasets/summarization_dataset_v3_results_oss.json \
110+ --judge gpt-oss-120b \
111+ --base-url http://YOUR_LOCAL_ENDPOINT/v1
112+
113+ # Step 2: Analyze results
114+ python3 analyze_results.py \
115+ --results datasets/summarization_dataset_v3_results_oss.json \
116+ --summary results/summary_report_oss.md \
117+ --csv results/summary_data_oss.csv \
118+ --judge gpt-oss-120b
119+
120+ # Step 3: Generate dashboard
121+ python3 generate_dashboard.py \
122+ --results datasets/summarization_dataset_v3_results_oss.json \
123+ --output results/summary_dashboard_oss.html
85124```
86125
87126** Output Files:**
88- - ` datasets/summary_sample.json ` - Sampled incidents
89- - ` results/summary_results.json ` - Judge rankings
90- - ` results/summary_report.md ` - Statistical report
91- - ` results/summary_data.csv ` - Spreadsheet export
92- - ` results/summary_dashboard.html ` - Interactive visualization
127+ - ` datasets/summarization_dataset_v3_results_oss.json ` - Judge rankings
128+ - ` results/summary_report_oss.md ` - Statistical report
129+ - ` results/summary_data_oss.csv ` - Spreadsheet export
130+ - ` results/summary_dashboard_oss.html ` - Interactive visualization
93131
94- ** Example Results:**
132+ ** Results (judge: gpt-oss-120b, 532 incidents) :**
95133```
96134Rank Model Avg Pos Avg Score Win Rate
97- 1 GPT-4o 1.8 8.5 45.0 %
98- 2 Qwen2.5 15B 2.3 7.2 28.0 %
99- 3 GPT-4o-mini 2.7 6.8 18.0 %
100- 4 Qwen2.5 3B 3.2 5.9 9.0 %
135+ 1 GPT-4o-mini 1.66 6.35/10 46.1 %
136+ 2 GPT-4o 2.02 5.65/10 36.7 %
137+ 3 Qwen2.5 3B 2.71 4.38/10 15.6 %
138+ 4 Qwen2.5 1.5B 3.61 2.81/10 1.7 %
101139```
102140
103141- ** Win Rate** : % of times ranked #1
@@ -112,8 +150,8 @@ Rank Model Avg Pos Avg Score Win Rate
112150
113151** Methodology:**
114152- Judge provides comparative ** ranking** (1st to 4th place) and ** quality scores** (1-10 scale)
115- - Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5, Qwen2.5 3B
116- - Dataset: 50 incidents (18 Normal + 32 Malware)
153+ - Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 1.5B , Qwen2.5 3B
154+ - Dataset: 532 incidents (18 Normal + 514 Malware)
117155
118156** Evaluation Criteria:**
1191571 . ** Cause Identification Accuracy** - Correctly categorizes as Malicious/Legitimate/Misconfiguration with specific techniques
@@ -124,39 +162,104 @@ Rank Model Avg Pos Avg Score Win Rate
124162
125163** Manual Execution:**
126164``` bash
127- # Step 1: Sample incidents
128- python3 datasets/create_risk_sample.py [--size 50] [--seed 42]
129-
130- # Step 2: Judge evaluation
131- python3 evaluate_risk.py [--input FILE] [--judge gpt-4o] [--output FILE]
165+ # Step 1: Judge evaluation
166+ python3 evaluate_risk.py \
167+ --input datasets/risk_dataset.json \
168+ --output datasets/risk_dataset_results_oss.json \
169+ --judge gpt-oss-120b \
170+ --base-url http://YOUR_LOCAL_ENDPOINT/v1
171+
172+ # Step 2: Analyze results
173+ python3 analyze_results.py \
174+ --results datasets/risk_dataset_results_oss.json \
175+ --summary results/risk_report_oss.md \
176+ --csv results/risk_data_oss.csv \
177+ --judge gpt-oss-120b
178+
179+ # Step 3: Generate dashboard
180+ python3 generate_dashboard.py \
181+ --results datasets/risk_dataset_results_oss.json \
182+ --output results/risk_dashboard_oss.html
183+ ```
132184
133- # Step 3: Analyze results
134- python3 analyze_results.py --results results/risk_results.json --summary results/risk_summary.md
185+ ** Output Files:**
186+ - ` datasets/risk_dataset_results_oss.json ` - Judge rankings and scores
187+ - ` results/risk_report_oss.md ` - Statistical report
188+ - ` results/risk_data_oss.csv ` - Spreadsheet export
189+ - ` results/risk_dashboard_oss.html ` - Interactive visualization
135190
136- # Step 4: Generate dashboard
137- python3 generate_dashboard.py --results results/risk_results.json --output results/risk_dashboard.html
191+ ** Results (judge: gpt-oss-120b, 532 incidents):**
192+ ```
193+ Rank Model Avg Pos Avg Score Win Rate
194+ 1 GPT-4o 1.46 7.98/10 65.6%
195+ 2 GPT-4o-mini 1.92 7.34/10 26.3%
196+ 3 Qwen2.5 3B 3.08 5.33/10 4.9%
197+ 4 Qwen2.5 3.54 4.29/10 3.2%
138198```
139199
140- ** Output Files:**
141- - ` datasets/risk_sample.json ` - Sampled incidents
142- - ` results/risk_results.json ` - Judge rankings and scores
143- - ` results/risk_summary.md ` - Statistical report
144- - ` results/risk_data.csv ` - Spreadsheet export
145- - ` results/risk_dashboard.html ` - Interactive visualization
200+ ** By Incident Category:**
201+ ```
202+ Malware (514 incidents):
203+ 1. GPT-4o avg pos 1.45 score 8.08 wins 342
204+ 2. GPT-4o-mini avg pos 1.91 score 7.43 wins 134
205+ 3. Qwen2.5 3B avg pos 3.08 score 5.39 wins 25
206+ 4. Qwen2.5 avg pos 3.56 score 4.30 wins 13
207+
208+ Normal (18 incidents):
209+ 1. GPT-4o avg pos 1.83 score 5.17 wins 7
210+ 2. GPT-4o-mini avg pos 2.00 score 4.89 wins 6
211+ 3. Qwen2.5 avg pos 2.89 score 4.00 wins 4
212+ 4. Qwen2.5 3B avg pos 3.28 score 3.56 wins 1
213+ ```
146214
147- ** Example Results :**
215+ ** By Incident Complexity :**
148216```
149- Rank Model Avg Pos Avg Score Win Rate
150- 1 cause_risk_gpt4o 1.8 8.2 42.0%
151- 2 cause_risk_gpt4o_mini 2.4 7.5 26.0%
152- 3 cause_risk_qwen2_5 2.9 6.8 20.0%
153- 4 cause_risk_qwen2_5_3b 3.1 6.1 12.0%
217+ Simple (<500 events, 324 incidents):
218+ 1. GPT-4o avg pos 1.57 score 7.76 wins 193
219+ 2. GPT-4o-mini avg pos 1.91 score 7.29 wins 98
220+
221+ Medium (500-2000 events, 62 incidents):
222+ 1. GPT-4o avg pos 1.23 score 8.31 wins 49
223+ 2. GPT-4o-mini avg pos 1.94 score 7.39 wins 11
224+
225+ Complex (>2000 events, 146 incidents):
226+ 1. GPT-4o avg pos 1.32 score 8.32 wins 107
227+ 2. GPT-4o-mini avg pos 1.91 score 7.44 wins 31
154228```
155229
156230- ** Win Rate** : % of times ranked #1
157231- ** Avg Position** : Lower is better (1-4 scale)
158232- ** Avg Score** : Higher is better (1-10 scale)
159233
234+ ** Key observations:**
235+ - GPT-4o dominates risk analysis across all complexity levels — unlike summarization where GPT-4o-mini wins
236+ - GPT-4o's advantage grows with complexity: win rate goes from 59.6% (simple) to 73.3% (complex)
237+ - Qwen2.5 7B underperforms its own 3B variant on risk analysis (avg pos 3.54 vs 3.08)
238+ - On Normal incidents, the gap between GPT-4o and GPT-4o-mini narrows significantly
239+
240+ ---
241+
242+ ## CLI Reference
243+
244+ ### evaluate_summaries.py / evaluate_risk.py
245+
246+ | Argument | Default | Description |
247+ | ----------| ---------| -------------|
248+ | ` --input ` , ` -i ` | ` datasets/summarization_dataset_v3.json ` / ` datasets/risk_dataset.json ` | Input dataset |
249+ | ` --output ` , ` -o ` | ` results/summary_results.json ` / ` results/risk_results.json ` | Output results |
250+ | ` --judge ` , ` -j ` | ` gpt-4o ` | Judge model name |
251+ | ` --base-url ` | OpenAI default | Base URL for OpenAI-compatible API |
252+ | ` --api-key ` | ` OPENAI_API_KEY ` env var | API key (optional for local endpoints) |
253+
254+ ### analyze_results.py
255+
256+ | Argument | Default | Description |
257+ | ----------| ---------| -------------|
258+ | ` --results ` , ` -r ` | ` results/summary_results.json ` | Input results file |
259+ | ` --summary ` , ` -s ` | ` results/summary_report.md ` | Output Markdown report |
260+ | ` --csv ` , ` -c ` | ` results/summary_data.csv ` | Output CSV file |
261+ | ` --judge ` , ` -j ` | ` GPT-4o ` | Judge name shown in report |
262+
160263---
161264
162- For detailed dataset generation instructions, see [ README_SUMMARY_WORKFLOW .md] ( README_SUMMARY_WORKFLOW .md) and [ README_RISK_WORKFLOW .md] ( README_RISK_WORKFLOW .md) .
265+ For detailed dataset generation instructions, see [ README_dataset_summary_workflow .md] ( README_dataset_summary_workflow .md) and [ README_dataset_risk_workflow .md] ( README_dataset_risk_workflow .md) .
0 commit comments