add new information about LLM evaluation

harpomaxx · harpomaxx · commit 1e1b3def73a4 · 2026-04-01T11:28:56.000Z
diff --git a/docs/immune/LLM_EVALUATION_GUIDE.md b/docs/immune/LLM_EVALUATION_GUIDE.md
@@ -2,7 +2,7 @@
 
 ## Overview
 
-This framework implements **LLM-as-a-Judge** methodology for evaluating language model performance on security incident analysis tasks. Rather than relying on manual expert review or simple metrics, we use a powerful LLM (GPT-4o) acting as an experienced network security analyst to systematically assess and compare different models' outputs. The judge model evaluates each response against security-specific criteria, providing comparative rankings (1-4 positions) and quality scores (1-10 scale) with detailed justifications. This approach enables scalable, consistent, and expert-level evaluation of multiple models across dozens of real-world security incidents.
+This framework implements **LLM-as-a-Judge** methodology for evaluating language model performance on security incident analysis tasks. Rather than relying on manual expert review or simple metrics, we use a local LLM (`gpt-oss-120b`) acting as an experienced network security analyst to systematically assess and compare different models' outputs. The judge model evaluates each response against security-specific criteria, providing comparative rankings (1-4 positions) and quality scores (1-10 scale) with detailed justifications. This approach enables scalable, consistent, and expert-level evaluation across the full dataset of real-world security incidents.
 
 The framework supports two evaluation workflows using identical methodology:
 
@@ -12,10 +12,10 @@ The framework supports two evaluation workflows using identical methodology:
 Both workflows use comparative ranking where the judge ranks all models' outputs for each incident, avoiding the need for absolute score thresholds.
 
 **Key Features:**
-- 50-sample evaluations (stratified Normal/Malware distribution)
+- Full dataset evaluation (532 incidents, stratified Normal/Malware distribution)
+- Local judge model via OpenAI-compatible API (no cloud cost)
+- Incremental result saving (resumable if interrupted)
 - Interactive HTML dashboards with drill-down capabilities
-- Cost-effective (~$5 per 50-incident evaluation with GPT-4o judge)
-- Reproducible methodology with configurable parameters
 
 ---
 
@@ -25,28 +25,60 @@ Both workflows use comparative ranking where the judge ranks all models' outputs
 
 ```bash
 pip install openai python-dotenv
-export OPENAI_API_KEY="sk-your-key-here"
 ```
 
-### Run Complete Evaluation
+No API key required for local endpoints.
+
+### Run Evaluation
 
 ```bash
+cd alert_summary/
+
 # Summarization workflow
-./run_evaluation_summary.sh
+python3 evaluate_summaries.py \
+  --input datasets/summarization_dataset_v3.json \
+  --output datasets/summarization_dataset_v3_results_oss.json \
+  --judge gpt-oss-120b \
+  --base-url http://YOUR_LOCAL_ENDPOINT/v1
 
 # Risk analysis workflow
-./run_evaluation_risk.sh
-
-# View results
-cat results/summary_report.md  # or results/risk_summary.md
-firefox results/summary_dashboard.html  # or results/risk_dashboard.html
+python3 evaluate_risk.py \
+  --input datasets/risk_dataset.json \
+  --output datasets/risk_dataset_results_oss.json \
+  --judge gpt-oss-120b \
+  --base-url http://YOUR_LOCAL_ENDPOINT/v1
 ```
 
-### Test Before Full Run
+### Analyze Results
 
 ```bash
-# Single incident test (~$0.10)
-python3 test_evaluation.py
+# Summarization
+python3 analyze_results.py \
+  --results datasets/summarization_dataset_v3_results_oss.json \
+  --summary results/summary_report_oss.md \
+  --csv results/summary_data_oss.csv \
+  --judge gpt-oss-120b
+
+python3 generate_dashboard.py \
+  --results datasets/summarization_dataset_v3_results_oss.json \
+  --sample datasets/summarization_dataset_v3.json \
+  --output results/summary_dashboard_oss.html
+
+# Risk analysis
+python3 analyze_results.py \
+  --results datasets/risk_dataset_results_oss.json \
+  --summary results/risk_report_oss.md \
+  --csv results/risk_data_oss.csv \
+  --judge gpt-oss-120b
+
+python3 generate_dashboard.py \
+  --results datasets/risk_dataset_results_oss.json \
+  --sample datasets/risk_dataset.json \
+  --output results/risk_dashboard_oss.html
+
+# View results
+cat results/summary_report_oss.md
+firefox results/summary_dashboard_oss.html
 ```
 
 ---
@@ -59,8 +91,8 @@ python3 test_evaluation.py
 
 **Methodology:**
 - Judge provides comparative **ranking** (1st to 4th place) and **quality scores** (1-10 scale)
-- Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 15B, Qwen2.5 3B
-- Dataset: 50 incidents (10 Normal + 40 Malware)
+- Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 1.5B, Qwen2.5 3B
+- Dataset: 532 incidents (18 Normal + 514 Malware)
 
 **Evaluation Criteria:**
 1. Accuracy of threat identification
@@ -71,33 +103,39 @@ python3 test_evaluation.py
 
 **Manual Execution:**
 ```bash
-# Step 1: Sample incidents
-python3 datasets/create_evaluation_sample.py [--size 50] [--seed 42]
-
-# Step 2: Judge evaluation
-python3 evaluate_summaries.py [--judge gpt-4o] [--input FILE] [--output FILE]
-
-# Step 3: Analyze results
-python3 analyze_results.py [--results FILE] [--summary FILE] [--csv FILE]
-
-# Step 4: Generate dashboard
-python3 generate_dashboard.py [--results FILE] [--sample FILE] [--output FILE]
+# Step 1: Judge evaluation
+python3 evaluate_summaries.py \
+  --input datasets/summarization_dataset_v3.json \
+  --output datasets/summarization_dataset_v3_results_oss.json \
+  --judge gpt-oss-120b \
+  --base-url http://YOUR_LOCAL_ENDPOINT/v1
+
+# Step 2: Analyze results
+python3 analyze_results.py \
+  --results datasets/summarization_dataset_v3_results_oss.json \
+  --summary results/summary_report_oss.md \
+  --csv results/summary_data_oss.csv \
+  --judge gpt-oss-120b
+
+# Step 3: Generate dashboard
+python3 generate_dashboard.py \
+  --results datasets/summarization_dataset_v3_results_oss.json \
+  --output results/summary_dashboard_oss.html
 ```
 
 **Output Files:**
-- `datasets/summary_sample.json` - Sampled incidents
-- `results/summary_results.json` - Judge rankings
-- `results/summary_report.md` - Statistical report
-- `results/summary_data.csv` - Spreadsheet export
-- `results/summary_dashboard.html` - Interactive visualization
+- `datasets/summarization_dataset_v3_results_oss.json` - Judge rankings
+- `results/summary_report_oss.md` - Statistical report
+- `results/summary_data_oss.csv` - Spreadsheet export
+- `results/summary_dashboard_oss.html` - Interactive visualization
 
-**Example Results:**
+**Results (judge: gpt-oss-120b, 532 incidents):**
 ```
 Rank   Model                Avg Pos    Avg Score    Win Rate
-1      GPT-4o               1.8        8.5          45.0%
-2      Qwen2.5 15B          2.3        7.2          28.0%
-3      GPT-4o-mini          2.7        6.8          18.0%
-4      Qwen2.5 3B           3.2        5.9          9.0%
+1      GPT-4o-mini          1.66       6.35/10      46.1%
+2      GPT-4o               2.02       5.65/10      36.7%
+3      Qwen2.5 3B           2.71       4.38/10      15.6%
+4      Qwen2.5 1.5B         3.61       2.81/10       1.7%
 ```
 
 - **Win Rate**: % of times ranked #1
@@ -112,8 +150,8 @@ Rank   Model                Avg Pos    Avg Score    Win Rate
 
 **Methodology:**
 - Judge provides comparative **ranking** (1st to 4th place) and **quality scores** (1-10 scale)
-- Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5, Qwen2.5 3B
-- Dataset: 50 incidents (18 Normal + 32 Malware)
+- Models evaluated: GPT-4o, GPT-4o-mini, Qwen2.5 1.5B, Qwen2.5 3B
+- Dataset: 532 incidents (18 Normal + 514 Malware)
 
 **Evaluation Criteria:**
 1. **Cause Identification Accuracy** - Correctly categorizes as Malicious/Legitimate/Misconfiguration with specific techniques
@@ -124,39 +162,104 @@ Rank   Model                Avg Pos    Avg Score    Win Rate
 
 **Manual Execution:**
 ```bash
-# Step 1: Sample incidents
-python3 datasets/create_risk_sample.py [--size 50] [--seed 42]
-
-# Step 2: Judge evaluation
-python3 evaluate_risk.py [--input FILE] [--judge gpt-4o] [--output FILE]
+# Step 1: Judge evaluation
+python3 evaluate_risk.py \
+  --input datasets/risk_dataset.json \
+  --output datasets/risk_dataset_results_oss.json \
+  --judge gpt-oss-120b \
+  --base-url http://YOUR_LOCAL_ENDPOINT/v1
+
+# Step 2: Analyze results
+python3 analyze_results.py \
+  --results datasets/risk_dataset_results_oss.json \
+  --summary results/risk_report_oss.md \
+  --csv results/risk_data_oss.csv \
+  --judge gpt-oss-120b
+
+# Step 3: Generate dashboard
+python3 generate_dashboard.py \
+  --results datasets/risk_dataset_results_oss.json \
+  --output results/risk_dashboard_oss.html
+```
 
-# Step 3: Analyze results
-python3 analyze_results.py --results results/risk_results.json --summary results/risk_summary.md
+**Output Files:**
+- `datasets/risk_dataset_results_oss.json` - Judge rankings and scores
+- `results/risk_report_oss.md` - Statistical report
+- `results/risk_data_oss.csv` - Spreadsheet export
+- `results/risk_dashboard_oss.html` - Interactive visualization
 
-# Step 4: Generate dashboard
-python3 generate_dashboard.py --results results/risk_results.json --output results/risk_dashboard.html
+**Results (judge: gpt-oss-120b, 532 incidents):**
+```
+Rank   Model          Avg Pos    Avg Score    Win Rate
+1      GPT-4o          1.46       7.98/10      65.6%
+2      GPT-4o-mini     1.92       7.34/10      26.3%
+3      Qwen2.5 3B      3.08       5.33/10       4.9%
+4      Qwen2.5         3.54       4.29/10       3.2%
 ```
 
-**Output Files:**
-- `datasets/risk_sample.json` - Sampled incidents
-- `results/risk_results.json` - Judge rankings and scores
-- `results/risk_summary.md` - Statistical report
-- `results/risk_data.csv` - Spreadsheet export
-- `results/risk_dashboard.html` - Interactive visualization
+**By Incident Category:**
+```
+Malware (514 incidents):
+  1. GPT-4o        avg pos 1.45   score 8.08   wins 342
+  2. GPT-4o-mini   avg pos 1.91   score 7.43   wins 134
+  3. Qwen2.5 3B    avg pos 3.08   score 5.39   wins  25
+  4. Qwen2.5       avg pos 3.56   score 4.30   wins  13
+
+Normal (18 incidents):
+  1. GPT-4o        avg pos 1.83   score 5.17   wins   7
+  2. GPT-4o-mini   avg pos 2.00   score 4.89   wins   6
+  3. Qwen2.5       avg pos 2.89   score 4.00   wins   4
+  4. Qwen2.5 3B    avg pos 3.28   score 3.56   wins   1
+```
 
-**Example Results:**
+**By Incident Complexity:**
 ```
-Rank   Model                     Avg Pos    Avg Score    Win Rate
-1      cause_risk_gpt4o          1.8        8.2          42.0%
-2      cause_risk_gpt4o_mini     2.4        7.5          26.0%
-3      cause_risk_qwen2_5        2.9        6.8          20.0%
-4      cause_risk_qwen2_5_3b     3.1        6.1          12.0%
+Simple (<500 events, 324 incidents):
+  1. GPT-4o        avg pos 1.57   score 7.76   wins 193
+  2. GPT-4o-mini   avg pos 1.91   score 7.29   wins  98
+
+Medium (500-2000 events, 62 incidents):
+  1. GPT-4o        avg pos 1.23   score 8.31   wins  49
+  2. GPT-4o-mini   avg pos 1.94   score 7.39   wins  11
+
+Complex (>2000 events, 146 incidents):
+  1. GPT-4o        avg pos 1.32   score 8.32   wins 107
+  2. GPT-4o-mini   avg pos 1.91   score 7.44   wins  31
 ```
 
 - **Win Rate**: % of times ranked #1
 - **Avg Position**: Lower is better (1-4 scale)
 - **Avg Score**: Higher is better (1-10 scale)
 
+**Key observations:**
+- GPT-4o dominates risk analysis across all complexity levels — unlike summarization where GPT-4o-mini wins
+- GPT-4o's advantage grows with complexity: win rate goes from 59.6% (simple) to 73.3% (complex)
+- Qwen2.5 7B underperforms its own 3B variant on risk analysis (avg pos 3.54 vs 3.08)
+- On Normal incidents, the gap between GPT-4o and GPT-4o-mini narrows significantly
+
+---
+
+## CLI Reference
+
+### evaluate_summaries.py / evaluate_risk.py
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--input`, `-i` | `datasets/summarization_dataset_v3.json` / `datasets/risk_dataset.json` | Input dataset |
+| `--output`, `-o` | `results/summary_results.json` / `results/risk_results.json` | Output results |
+| `--judge`, `-j` | `gpt-4o` | Judge model name |
+| `--base-url` | OpenAI default | Base URL for OpenAI-compatible API |
+| `--api-key` | `OPENAI_API_KEY` env var | API key (optional for local endpoints) |
+
+### analyze_results.py
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--results`, `-r` | `results/summary_results.json` | Input results file |
+| `--summary`, `-s` | `results/summary_report.md` | Output Markdown report |
+| `--csv`, `-c` | `results/summary_data.csv` | Output CSV file |
+| `--judge`, `-j` | `GPT-4o` | Judge name shown in report |
+
 ---
 
-For detailed dataset generation instructions, see [README_SUMMARY_WORKFLOW.md](README_SUMMARY_WORKFLOW.md) and [README_RISK_WORKFLOW.md](README_RISK_WORKFLOW.md).
+For detailed dataset generation instructions, see [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md) and [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md).