Skip to content

Commit a589df2

Browse files
Document baseline comparison results
Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 3aa5aea commit a589df2

2 files changed

Lines changed: 36 additions & 1 deletion

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,7 @@ See [`docs/PROMPT_FORMAT.md`](docs/PROMPT_FORMAT.md) for the canonical system pr
208208

209209
## Evaluation and Baselines
210210

211-
Baseline comparison infrastructure is available in [`docs/EVALS.md`](docs/EVALS.md). It supports reproducible comparisons against:
211+
Baseline comparison against base `Qwen/Qwen2.5-Coder-7B-Instruct` and Semgrep is documented in [`docs/EVALS.md`](docs/EVALS.md). It supports reproducible comparisons against:
212212

213213
- the base `Qwen/Qwen2.5-Coder-7B-Instruct` model without the Nullsec adapter;
214214
- Semgrep with local benchmark rules and explicit coverage limitations.

docs/EVALS.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,13 +99,48 @@ python benchmarks/compare_baselines.py \
9999
The generated comparison is a report artifact and should not be committed unless
100100
explicitly approved.
101101

102+
## Baseline comparison
103+
104+
Generated with `benchmarks/compare_baselines.py` from local reports. Raw
105+
generated reports remain ignored under `benchmarks/reports/`.
106+
107+
| System / tool | Total cases | Outputs / analyzable | Precision | Recall | F1 | false_safe_rate | hallucination_rate | Notes / coverage limits |
108+
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
109+
| Nullsec-1 | 111 | 110 | 0.9423 | 0.9074 | 0.9245 | 0.0 | 0.0667 | RC2/v1.1 release or local run |
110+
| Qwen2.5-Coder-7B-Instruct (base, no Nullsec adapter) | 111 | 4 | 0.3333 | 0.0093 | 0.018 | 0.0 | 0.5 | base model, no Nullsec adapter |
111+
| Semgrep (local rules baseline) | 111 | 111 | 0.8627 | 0.4074 | 0.5535 | 0.5625 | 0.3333 | static rules; partial category coverage |
112+
113+
### Output-count note
114+
115+
The GitHub Release records `111/111` raw model outputs. The comparison table's
116+
`Outputs / analyzable` column uses `results.summary.total_outputs` from the
117+
report, which counts outputs that were alignable and scorable as structured
118+
verdicts by the benchmark pipeline. For the Nullsec-S1 report used here, one raw
119+
output was not alignable for scoring, so the comparison table shows `110`.
120+
121+
### Interpretation
122+
123+
- **Nullsec-S1** shows stronger structured security-verdict performance on this
124+
repo-authored benchmark.
125+
- **Base Qwen2.5-Coder-7B-Instruct** mostly failed to produce scorable
126+
Nullsec-style JSON security verdicts. This shows why the fine-tune and
127+
deterministic alignment layer matter for this output format and task.
128+
- **Semgrep** detects some static patterns with high precision, but has partial
129+
category coverage and lower recall on this benchmark. This is a local-rules
130+
Semgrep baseline on the Nullsec benchmark, not a general claim about Semgrep
131+
quality.
132+
102133
## Limitations
103134

104135
- The benchmark is security-specific and repo-authored; it is not an independent
105136
third-party benchmark.
106137
- Baseline comparisons are meaningful only when all systems are run on the same
107138
dataset version.
139+
- Results should be reproduced from the scripts above; do not hand-enter metrics.
108140
- Semgrep is not expected to cover all categories and should be interpreted as a
109141
static-analysis baseline, not a security LLM.
142+
- Frontier/API model baselines such as Claude, GPT, or other hosted models are
143+
not included yet.
144+
- This comparison does not prove universal vulnerability detection performance.
110145
- Do not claim Nullsec-S1 beats another model or tool unless the comparison
111146
script output proves it.

0 commit comments

Comments
 (0)