Skip to content

Commit 46cce53

Browse files
committed
Merge remote-tracking branch 'origin/develop' into develop
2 parents f7c0da7 + 1e1b3de commit 46cce53

1 file changed

Lines changed: 165 additions & 62 deletions

File tree

docs/immune/LLM_EVALUATION_GUIDE.md

Lines changed: 165 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Overview
44

5-
This framework implements **LLM-as-a-Judge** methodology for evaluating language model performance on security incident analysis tasks. Rather than relying on manual expert review or simple metrics, we use a powerful LLM (GPT-4o) acting as an experienced network security analyst to systematically assess and compare different models' outputs. The judge model evaluates each response against security-specific criteria, providing comparative rankings (1-4 positions) and quality scores (1-10 scale) with detailed justifications. This approach enables scalable, consistent, and expert-level evaluation of multiple models across dozens of real-world security incidents.
5+
This framework implements **LLM-as-a-Judge** methodology for evaluating language model performance on security incident analysis tasks. Rather than relying on manual expert review or simple metrics, we use a local LLM (`gpt-oss-120b`) acting as an experienced network security analyst to systematically assess and compare different models' outputs. The judge model evaluates each response against security-specific criteria, providing comparative rankings (1-4 positions) and quality scores (1-10 scale) with detailed justifications. This approach enables scalable, consistent, and expert-level evaluation across the full dataset of real-world security incidents.
66

77
The framework supports two evaluation workflows using identical methodology:
88

@@ -12,10 +12,10 @@ The framework supports two evaluation workflows using identical methodology:
1212
Both workflows use comparative ranking where the judge ranks all models' outputs for each incident, avoiding the need for absolute score thresholds.
1313

1414
**Key Features:**
15-
- 50-sample evaluations (stratified Normal/Malware distribution)
15+
- Full dataset evaluation (532 incidents, stratified Normal/Malware distribution)
16+
- Local judge model via OpenAI-compatible API (no cloud cost)
17+
- Incremental result saving (resumable if interrupted)
1618
- Interactive HTML dashboards with drill-down capabilities
17-
- Cost-effective (~$5 per 50-incident evaluation with GPT-4o judge)
18-
- Reproducible methodology with configurable parameters
1919

2020
---
2121

@@ -25,28 +25,60 @@ Both workflows use comparative ranking where the judge ranks all models' outputs
2525

2626
```bash
2727
pip install openai python-dotenv
28-
export OPENAI_API_KEY="sk-your-key-here"
2928
```
3029

31-
### Run Complete Evaluation
30+
No API key required for local endpoints.
31+
32+
### Run Evaluation
3233

3334
```bash
35+
cd alert_summary/
36+
3437
# Summarization workflow
35-
./run_evaluation_summary.sh
38+
python3 evaluate_summaries.py \
39+
--input datasets/summarization_dataset_v3.json \
40+
--output datasets/summarization_dataset_v3_results_oss.json \
41+
--judge gpt-oss-120b \
42+
--base-url http://YOUR_LOCAL_ENDPOINT/v1
3643

3744
# Risk analysis workflow
38-
./run_evaluation_risk.sh
39-
40-
# View results
41-
cat results/summary_report.md # or results/risk_summary.md
42-
firefox results/summary_dashboard.html # or results/risk_dashboard.html
45+
python3 evaluate_risk.py \
46+
--input datasets/risk_dataset.json \
47+
--output datasets/risk_dataset_results_oss.json \
48+
--judge gpt-oss-120b \
49+
--base-url http://YOUR_LOCAL_ENDPOINT/v1
4350
```
4451

45-
### Test Before Full Run
52+
### Analyze Results
4653

4754
```bash
48-
# Single incident test (~$0.10)
49-
python3 test_evaluation.py
55+
# Summarization
56+
python3 analyze_results.py \
57+
--results datasets/summarization_dataset_v3_results_oss.json \
58+
--summary results/summary_report_oss.md \
59+
--csv results/summary_data_oss.csv \
60+
--judge gpt-oss-120b
61+
62+
python3 generate_dashboard.py \
63+
--results datasets/summarization_dataset_v3_results_oss.json \
64+
--sample datasets/summarization_dataset_v3.json \
65+
--output results/summary_dashboard_oss.html
66+
67+
# Risk analysis
68+
python3 analyze_results.py \
69+
--results datasets/risk_dataset_results_oss.json \
70+
--summary results/risk_report_oss.md \
71+
--csv results/risk_data_oss.csv \
72+
--judge gpt-oss-120b
73+
74+
python3 generate_dashboard.py \
75+
--results datasets/risk_dataset_results_oss.json \
76+
--sample datasets/risk_dataset.json \
77+
--output results/risk_dashboard_oss.html
78+
79+
# View results
80+
cat results/summary_report_oss.md
81+
firefox results/summary_dashboard_oss.html
5082
```
5183

5284
---
@@ -59,8 +91,8 @@ python3 test_evaluation.py
5991

6092
**Methodology:**
6193
- Judge provides comparative **ranking** (1st to 4th place) and **quality scores** (1-10 scale)
62-
- Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 15B, Qwen2.5 3B
63-
- Dataset: 50 incidents (10 Normal + 40 Malware)
94+
- Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 1.5B, Qwen2.5 3B
95+
- Dataset: 532 incidents (18 Normal + 514 Malware)
6496

6597
**Evaluation Criteria:**
6698
1. Accuracy of threat identification
@@ -71,33 +103,39 @@ python3 test_evaluation.py
71103

72104
**Manual Execution:**
73105
```bash
74-
# Step 1: Sample incidents
75-
python3 datasets/create_evaluation_sample.py [--size 50] [--seed 42]
76-
77-
# Step 2: Judge evaluation
78-
python3 evaluate_summaries.py [--judge gpt-4o] [--input FILE] [--output FILE]
79-
80-
# Step 3: Analyze results
81-
python3 analyze_results.py [--results FILE] [--summary FILE] [--csv FILE]
82-
83-
# Step 4: Generate dashboard
84-
python3 generate_dashboard.py [--results FILE] [--sample FILE] [--output FILE]
106+
# Step 1: Judge evaluation
107+
python3 evaluate_summaries.py \
108+
--input datasets/summarization_dataset_v3.json \
109+
--output datasets/summarization_dataset_v3_results_oss.json \
110+
--judge gpt-oss-120b \
111+
--base-url http://YOUR_LOCAL_ENDPOINT/v1
112+
113+
# Step 2: Analyze results
114+
python3 analyze_results.py \
115+
--results datasets/summarization_dataset_v3_results_oss.json \
116+
--summary results/summary_report_oss.md \
117+
--csv results/summary_data_oss.csv \
118+
--judge gpt-oss-120b
119+
120+
# Step 3: Generate dashboard
121+
python3 generate_dashboard.py \
122+
--results datasets/summarization_dataset_v3_results_oss.json \
123+
--output results/summary_dashboard_oss.html
85124
```
86125

87126
**Output Files:**
88-
- `datasets/summary_sample.json` - Sampled incidents
89-
- `results/summary_results.json` - Judge rankings
90-
- `results/summary_report.md` - Statistical report
91-
- `results/summary_data.csv` - Spreadsheet export
92-
- `results/summary_dashboard.html` - Interactive visualization
127+
- `datasets/summarization_dataset_v3_results_oss.json` - Judge rankings
128+
- `results/summary_report_oss.md` - Statistical report
129+
- `results/summary_data_oss.csv` - Spreadsheet export
130+
- `results/summary_dashboard_oss.html` - Interactive visualization
93131

94-
**Example Results:**
132+
**Results (judge: gpt-oss-120b, 532 incidents):**
95133
```
96134
Rank Model Avg Pos Avg Score Win Rate
97-
1 GPT-4o 1.8 8.5 45.0%
98-
2 Qwen2.5 15B 2.3 7.2 28.0%
99-
3 GPT-4o-mini 2.7 6.8 18.0%
100-
4 Qwen2.5 3B 3.2 5.9 9.0%
135+
1 GPT-4o-mini 1.66 6.35/10 46.1%
136+
2 GPT-4o 2.02 5.65/10 36.7%
137+
3 Qwen2.5 3B 2.71 4.38/10 15.6%
138+
4 Qwen2.5 1.5B 3.61 2.81/10 1.7%
101139
```
102140

103141
- **Win Rate**: % of times ranked #1
@@ -112,8 +150,8 @@ Rank Model Avg Pos Avg Score Win Rate
112150

113151
**Methodology:**
114152
- Judge provides comparative **ranking** (1st to 4th place) and **quality scores** (1-10 scale)
115-
- Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5, Qwen2.5 3B
116-
- Dataset: 50 incidents (18 Normal + 32 Malware)
153+
- Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 1.5B, Qwen2.5 3B
154+
- Dataset: 532 incidents (18 Normal + 514 Malware)
117155

118156
**Evaluation Criteria:**
119157
1. **Cause Identification Accuracy** - Correctly categorizes as Malicious/Legitimate/Misconfiguration with specific techniques
@@ -124,39 +162,104 @@ Rank Model Avg Pos Avg Score Win Rate
124162

125163
**Manual Execution:**
126164
```bash
127-
# Step 1: Sample incidents
128-
python3 datasets/create_risk_sample.py [--size 50] [--seed 42]
129-
130-
# Step 2: Judge evaluation
131-
python3 evaluate_risk.py [--input FILE] [--judge gpt-4o] [--output FILE]
165+
# Step 1: Judge evaluation
166+
python3 evaluate_risk.py \
167+
--input datasets/risk_dataset.json \
168+
--output datasets/risk_dataset_results_oss.json \
169+
--judge gpt-oss-120b \
170+
--base-url http://YOUR_LOCAL_ENDPOINT/v1
171+
172+
# Step 2: Analyze results
173+
python3 analyze_results.py \
174+
--results datasets/risk_dataset_results_oss.json \
175+
--summary results/risk_report_oss.md \
176+
--csv results/risk_data_oss.csv \
177+
--judge gpt-oss-120b
178+
179+
# Step 3: Generate dashboard
180+
python3 generate_dashboard.py \
181+
--results datasets/risk_dataset_results_oss.json \
182+
--output results/risk_dashboard_oss.html
183+
```
132184

133-
# Step 3: Analyze results
134-
python3 analyze_results.py --results results/risk_results.json --summary results/risk_summary.md
185+
**Output Files:**
186+
- `datasets/risk_dataset_results_oss.json` - Judge rankings and scores
187+
- `results/risk_report_oss.md` - Statistical report
188+
- `results/risk_data_oss.csv` - Spreadsheet export
189+
- `results/risk_dashboard_oss.html` - Interactive visualization
135190

136-
# Step 4: Generate dashboard
137-
python3 generate_dashboard.py --results results/risk_results.json --output results/risk_dashboard.html
191+
**Results (judge: gpt-oss-120b, 532 incidents):**
192+
```
193+
Rank Model Avg Pos Avg Score Win Rate
194+
1 GPT-4o 1.46 7.98/10 65.6%
195+
2 GPT-4o-mini 1.92 7.34/10 26.3%
196+
3 Qwen2.5 3B 3.08 5.33/10 4.9%
197+
4 Qwen2.5 3.54 4.29/10 3.2%
138198
```
139199

140-
**Output Files:**
141-
- `datasets/risk_sample.json` - Sampled incidents
142-
- `results/risk_results.json` - Judge rankings and scores
143-
- `results/risk_summary.md` - Statistical report
144-
- `results/risk_data.csv` - Spreadsheet export
145-
- `results/risk_dashboard.html` - Interactive visualization
200+
**By Incident Category:**
201+
```
202+
Malware (514 incidents):
203+
1. GPT-4o avg pos 1.45 score 8.08 wins 342
204+
2. GPT-4o-mini avg pos 1.91 score 7.43 wins 134
205+
3. Qwen2.5 3B avg pos 3.08 score 5.39 wins 25
206+
4. Qwen2.5 avg pos 3.56 score 4.30 wins 13
207+
208+
Normal (18 incidents):
209+
1. GPT-4o avg pos 1.83 score 5.17 wins 7
210+
2. GPT-4o-mini avg pos 2.00 score 4.89 wins 6
211+
3. Qwen2.5 avg pos 2.89 score 4.00 wins 4
212+
4. Qwen2.5 3B avg pos 3.28 score 3.56 wins 1
213+
```
146214

147-
**Example Results:**
215+
**By Incident Complexity:**
148216
```
149-
Rank Model Avg Pos Avg Score Win Rate
150-
1 cause_risk_gpt4o 1.8 8.2 42.0%
151-
2 cause_risk_gpt4o_mini 2.4 7.5 26.0%
152-
3 cause_risk_qwen2_5 2.9 6.8 20.0%
153-
4 cause_risk_qwen2_5_3b 3.1 6.1 12.0%
217+
Simple (<500 events, 324 incidents):
218+
1. GPT-4o avg pos 1.57 score 7.76 wins 193
219+
2. GPT-4o-mini avg pos 1.91 score 7.29 wins 98
220+
221+
Medium (500-2000 events, 62 incidents):
222+
1. GPT-4o avg pos 1.23 score 8.31 wins 49
223+
2. GPT-4o-mini avg pos 1.94 score 7.39 wins 11
224+
225+
Complex (>2000 events, 146 incidents):
226+
1. GPT-4o avg pos 1.32 score 8.32 wins 107
227+
2. GPT-4o-mini avg pos 1.91 score 7.44 wins 31
154228
```
155229

156230
- **Win Rate**: % of times ranked #1
157231
- **Avg Position**: Lower is better (1-4 scale)
158232
- **Avg Score**: Higher is better (1-10 scale)
159233

234+
**Key observations:**
235+
- GPT-4o dominates risk analysis across all complexity levels — unlike summarization where GPT-4o-mini wins
236+
- GPT-4o's advantage grows with complexity: win rate goes from 59.6% (simple) to 73.3% (complex)
237+
- Qwen2.5 7B underperforms its own 3B variant on risk analysis (avg pos 3.54 vs 3.08)
238+
- On Normal incidents, the gap between GPT-4o and GPT-4o-mini narrows significantly
239+
240+
---
241+
242+
## CLI Reference
243+
244+
### evaluate_summaries.py / evaluate_risk.py
245+
246+
| Argument | Default | Description |
247+
|----------|---------|-------------|
248+
| `--input`, `-i` | `datasets/summarization_dataset_v3.json` / `datasets/risk_dataset.json` | Input dataset |
249+
| `--output`, `-o` | `results/summary_results.json` / `results/risk_results.json` | Output results |
250+
| `--judge`, `-j` | `gpt-4o` | Judge model name |
251+
| `--base-url` | OpenAI default | Base URL for OpenAI-compatible API |
252+
| `--api-key` | `OPENAI_API_KEY` env var | API key (optional for local endpoints) |
253+
254+
### analyze_results.py
255+
256+
| Argument | Default | Description |
257+
|----------|---------|-------------|
258+
| `--results`, `-r` | `results/summary_results.json` | Input results file |
259+
| `--summary`, `-s` | `results/summary_report.md` | Output Markdown report |
260+
| `--csv`, `-c` | `results/summary_data.csv` | Output CSV file |
261+
| `--judge`, `-j` | `GPT-4o` | Judge name shown in report |
262+
160263
---
161264

162-
For detailed dataset generation instructions, see [README_SUMMARY_WORKFLOW.md](README_SUMMARY_WORKFLOW.md) and [README_RISK_WORKFLOW.md](README_RISK_WORKFLOW.md).
265+
For detailed dataset generation instructions, see [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md) and [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md).

0 commit comments

Comments
 (0)