Skip to content

Commit 86634b3

Browse files
authored
feat(security): consensus-additive risk score + report transparency (Spec 076 US4) (#776)
T020: risk-score aggregation now treats independent signals on a tool as additive instead of collapsing agreement. The deterministic scanner emits one ScanFinding per tool whose Signals list every check that fired; CalculateRiskScore weights each (deduplicated) finding by its distinct-signal count, so a tool flagged by several checks scores higher than one flagged by a single check (FR-006, SC-007). Legacy/cross-scanner findings carry no signals and weigh 1, so existing scoring and the same-rule+location de-duplication are unchanged. T021: surface confidence + signals in the CLI report (printFindingsList renders "Confidence:" and "Signals:" lines) and confirm they serialize through the REST aggregated scan report. Docs note added under security-commands.md. Tests: consensus-raises-score + cross-scanner-dedup-retained scoring tests, CLI render + absent-field tests, report-level serialization test. Related #MCP-3578
1 parent 02227b8 commit 86634b3

7 files changed

Lines changed: 202 additions & 15 deletions

File tree

cmd/mcpproxy/security_cmd.go

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2072,6 +2072,25 @@ func printFindingsList(findings []interface{}) {
20722072
}
20732073
}
20742074

2075+
// Deterministic-scanner transparency (Spec 076 US4): the combined
2076+
// confidence and the independent checks that contributed to this
2077+
// finding, so an operator can see WHY a tool was flagged and that
2078+
// agreement among checks raised its score.
2079+
if conf, ok := finding["confidence"].(float64); ok && conf > 0 {
2080+
fmt.Printf(" Confidence: %.2f\n", conf)
2081+
}
2082+
if rawSignals, ok := finding["signals"].([]interface{}); ok && len(rawSignals) > 0 {
2083+
names := make([]string, 0, len(rawSignals))
2084+
for _, s := range rawSignals {
2085+
if name, ok := s.(string); ok && name != "" {
2086+
names = append(names, name)
2087+
}
2088+
}
2089+
if len(names) > 0 {
2090+
fmt.Println(" Signals: " + strings.Join(names, ", "))
2091+
}
2092+
}
2093+
20752094
// Package info
20762095
if pkg != "" {
20772096
pkgLine := " Package: " + pkg

cmd/mcpproxy/security_cmd_test.go

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,63 @@ func TestPrintFindingsListSkipsEmptyFields(t *testing.T) {
241241
}
242242
}
243243

244+
// TestPrintFindingsListRendersConfidenceAndSignals verifies Spec-076 US4
245+
// (FR-010 / SC-007): the CLI surfaces each finding's combined confidence and
246+
// the deterministic checks that contributed to it, so an operator can see WHY a
247+
// tool was flagged.
248+
func TestPrintFindingsListRendersConfidenceAndSignals(t *testing.T) {
249+
out := captureStdout(t, func() {
250+
printFindingsList([]interface{}{
251+
map[string]interface{}{
252+
"severity": "high",
253+
"rule_id": "detect.unicode.hidden",
254+
"title": "Hidden Unicode in tool description",
255+
"location": "srv:exfiltrate",
256+
"scanner": "tpa-descriptions",
257+
"threat_level": "dangerous",
258+
"threat_type": "tool_poisoning",
259+
"confidence": 0.92,
260+
"signals": []interface{}{"unicode.hidden", "directive.imperative"},
261+
},
262+
})
263+
})
264+
265+
wants := []string{
266+
"Confidence:",
267+
"0.92",
268+
"Signals:",
269+
"unicode.hidden",
270+
"directive.imperative",
271+
}
272+
for _, w := range wants {
273+
if !strings.Contains(out, w) {
274+
t.Errorf("output missing %q\n--- output ---\n%s", w, out)
275+
}
276+
}
277+
}
278+
279+
// TestPrintFindingsListSkipsConfidenceWhenAbsent ensures plain CVE-style
280+
// findings (no deterministic-scanner data) do not grow a noisy empty
281+
// "Confidence:"/"Signals:" row.
282+
func TestPrintFindingsListSkipsConfidenceWhenAbsent(t *testing.T) {
283+
out := captureStdout(t, func() {
284+
printFindingsList([]interface{}{
285+
map[string]interface{}{
286+
"severity": "high",
287+
"rule_id": "CVE-2024-0001",
288+
"title": "CVE-2024-0001",
289+
"location": "node_modules/foo/index.js",
290+
},
291+
})
292+
})
293+
if strings.Contains(out, "Confidence:") {
294+
t.Errorf("expected no 'Confidence:' row when confidence is absent; got:\n%s", out)
295+
}
296+
if strings.Contains(out, "Signals:") {
297+
t.Errorf("expected no 'Signals:' row when signals are absent; got:\n%s", out)
298+
}
299+
}
300+
244301
// TestPrintScanContextSection verifies that the scan-context block renders
245302
// the source method, container, and file/tool counts. This is the signal
246303
// users need to triage a finding (e.g. is "tools.json:85" from a real Docker

docs/cli/security-commands.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -498,7 +498,13 @@ The `Scanners: X run, Y failed (names) of Z` line surfaces per-scanner failures
498498

499499
- `risk_score` — composite 0-100 score
500500
- `summary` — severity counts (`critical`, `high`, `medium`, `low`, `info`, `dangerous`, `warnings`, `info_level`, `total`)
501-
- `findings` — normalized findings across all scanners
501+
- `findings` — normalized findings across all scanners. Findings from the
502+
deterministic tool-scanner (Spec 076) additionally carry `confidence` (0.0–1.0
503+
combined confidence) and `signals` (the independent check IDs that fired, e.g.
504+
`unicode.hidden`, `directive.imperative`). When several independent checks
505+
agree on one tool, that agreement **adds** to the composite `risk_score`
506+
rather than being collapsed — the table report renders these as `Confidence:`
507+
and `Signals:` lines under the finding.
502508
- `reports` — per-scanner raw results (also includes SARIF when `?include_sarif=true` is passed to the REST endpoint)
503509
- `scanner_statuses` — per-scanner execution records, each with `scanner_id`, `status`, `started_at`, `completed_at`, `duration_ms` (wall-clock execution time in milliseconds), `findings_count`, and `error`
504510
- `scan_context` — source method, source path, scanned file list

internal/security/scanner/sarif.go

Lines changed: 37 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -253,12 +253,20 @@ func categorizeFromRule(ruleID string, props map[string]any, rules map[string]SA
253253
//
254254
// This uses logarithmic diminishing returns so duplicate findings from multiple
255255
// scanners don't inflate the score, while still reflecting cumulative risk.
256-
// Findings are deduplicated by (rule_id + location) before scoring.
256+
// Identical findings reported by several scanners (same rule_id + location) are
257+
// deduplicated before scoring.
257258
//
258-
// Formula per category: category_score = weight * log2(1 + unique_count)
259-
// - Dangerous: weight 25 (1 finding=25, 2=40, 4=58, 8=72, cap 80)
260-
// - Warning: weight 6 (1 finding=6, 2=10, 4=15, 8=18, cap 25)
261-
// - Info: weight 2 (1 finding=2, 2=3, 4=5, 8=6, cap 10)
259+
// Spec 076 (FR-006, SC-007) — consensus is additive: the deterministic scanner
260+
// emits ONE finding per tool whose Signals list every independent check that
261+
// fired. A finding contributes its consensus weight (the count of distinct
262+
// contributing signals, min 1) to its category, so a tool flagged by several
263+
// checks raises the score instead of being collapsed to one. Findings from
264+
// scanners that emit no per-signal data weigh 1, so legacy scoring is unchanged.
265+
//
266+
// Formula per category: category_score = weight * log2(1 + weighted_count)
267+
// - Dangerous: weight 25 (1=25, 2=40, 4=58, 8=72, cap 80)
268+
// - Warning: weight 6 (1=6, 2=10, 4=15, 8=18, cap 25)
269+
// - Info: weight 2 (1=2, 2=3, 4=5, 8=6, cap 10)
262270
//
263271
// Note: This score is an experimental heuristic. There is no industry standard
264272
// for aggregating multi-scanner MCP security findings into a single number.
@@ -282,24 +290,28 @@ func CalculateRiskScore(findings []ScanFinding) int {
282290
seen[key] = true
283291
}
284292

293+
// Consensus weight: independent signals on one tool ADD to the score
294+
// (FR-006). A single-signal or signal-less finding weighs 1.
295+
weight := consensusWeight(f)
296+
285297
switch f.ThreatLevel {
286298
case ThreatLevelDangerous:
287-
dangerousCount++
299+
dangerousCount += weight
288300
case ThreatLevelWarning:
289-
warningCount++
301+
warningCount += weight
290302
case ThreatLevelInfo:
291-
infoCount++
303+
infoCount += weight
292304
default:
293305
// Unclassified: use severity as fallback
294306
switch f.Severity {
295307
case SeverityCritical:
296-
dangerousCount++
308+
dangerousCount += weight
297309
case SeverityHigh:
298-
warningCount++
310+
warningCount += weight
299311
case SeverityMedium:
300-
warningCount++
312+
warningCount += weight
301313
case SeverityLow:
302-
infoCount++
314+
infoCount += weight
303315
}
304316
}
305317
}
@@ -327,6 +339,19 @@ func CalculateRiskScore(findings []ScanFinding) int {
327339
return score
328340
}
329341

342+
// consensusWeight returns how much a single (deduplicated) finding contributes
343+
// to its risk category. The deterministic scanner (Spec 076) aggregates every
344+
// independent check that fired on a tool into one finding's Signals list, so a
345+
// finding flagged by N distinct checks weighs N — agreement raises the score
346+
// rather than being collapsed (FR-006). Findings with zero or one signal — and
347+
// every legacy scanner finding, which carries none — weigh 1.
348+
func consensusWeight(f ScanFinding) int {
349+
if n := len(f.Signals); n > 1 {
350+
return n
351+
}
352+
return 1
353+
}
354+
330355
// SummarizeFindings produces a ReportSummary from findings
331356
func SummarizeFindings(findings []ScanFinding) ReportSummary {
332357
summary := ReportSummary{Total: len(findings)}

internal/security/scanner/sarif_test.go

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,56 @@ func TestCalculateRiskScore(t *testing.T) {
286286
}
287287
}
288288

289+
// TestCalculateRiskScoreConsensusRaisesScore proves Spec-076 FR-006 / SC-007:
290+
// independent signals on one tool ADD to the risk score instead of being
291+
// collapsed, so a tool flagged by several checks scores higher than one flagged
292+
// by a single check (consensus is visible). The deterministic scanner emits ONE
293+
// ScanFinding per tool carrying every contributing check in Signals; the score
294+
// must weigh that finding by its signal count.
295+
func TestCalculateRiskScoreConsensusRaisesScore(t *testing.T) {
296+
single := []ScanFinding{{
297+
RuleID: "detect.unicode.hidden",
298+
Location: "srv:tool_a",
299+
ThreatLevel: ThreatLevelWarning,
300+
Signals: []string{"unicode.hidden"},
301+
}}
302+
consensus := []ScanFinding{{
303+
RuleID: "detect.unicode.hidden",
304+
Location: "srv:tool_a",
305+
ThreatLevel: ThreatLevelWarning,
306+
Signals: []string{"unicode.hidden", "directive.imperative", "capability.mismatch"},
307+
}}
308+
309+
singleScore := CalculateRiskScore(single)
310+
consensusScore := CalculateRiskScore(consensus)
311+
312+
// single: 1 signal → warning count 1 → 6*log2(2)=6
313+
if singleScore != 6 {
314+
t.Errorf("single-signal score = %d, want 6", singleScore)
315+
}
316+
// consensus: 3 signals → warning count 3 → 6*log2(4)=12
317+
if consensusScore != 12 {
318+
t.Errorf("consensus score = %d, want 12", consensusScore)
319+
}
320+
if consensusScore <= singleScore {
321+
t.Errorf("consensus score %d must exceed single-signal score %d", consensusScore, singleScore)
322+
}
323+
}
324+
325+
// TestCalculateRiskScoreCrossScannerDedupRetained guards that the consensus
326+
// change does NOT break the legitimate cross-scanner de-duplication: the same
327+
// finding (rule_id+location) reported by multiple scanners with no per-signal
328+
// data still counts once. Only independent signals WITHIN a finding add.
329+
func TestCalculateRiskScoreCrossScannerDedupRetained(t *testing.T) {
330+
findings := []ScanFinding{
331+
{RuleID: "MCP-TP-003", Location: "tool:add_numbers", ThreatLevel: ThreatLevelDangerous, Scanner: "scanner-a"},
332+
{RuleID: "MCP-TP-003", Location: "tool:add_numbers", ThreatLevel: ThreatLevelDangerous, Scanner: "scanner-b"},
333+
}
334+
if got := CalculateRiskScore(findings); got != 25 {
335+
t.Errorf("cross-scanner duplicate score = %d, want 25 (deduped to one dangerous)", got)
336+
}
337+
}
338+
289339
func TestSummarizeFindings(t *testing.T) {
290340
findings := []ScanFinding{
291341
{Severity: SeverityCritical},

internal/security/scanner/types_test.go

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,36 @@ func TestScanFindingConfidenceSignalsRoundTrip(t *testing.T) {
4040
}
4141
}
4242

43+
// TestAggregatedReportSerializesConfidenceSignals confirms the Spec-076 US4
44+
// transparency fields survive serialization at the REST scan-report level (the
45+
// payload returned by GET /api/v1/.../scan-report), not just on a bare
46+
// ScanFinding — operators consume confidence/signals through the report API.
47+
func TestAggregatedReportSerializesConfidenceSignals(t *testing.T) {
48+
report := AggregatedReport{
49+
JobID: "job-1",
50+
ServerName: "srv",
51+
RiskScore: 42,
52+
Findings: []ScanFinding{{
53+
RuleID: "detect.unicode.hidden",
54+
Severity: SeverityHigh,
55+
ThreatLevel: ThreatLevelDangerous,
56+
Location: "srv:exfiltrate",
57+
Confidence: 0.92,
58+
Signals: []string{"unicode.hidden", "directive.imperative"},
59+
}},
60+
}
61+
data, err := json.Marshal(report)
62+
if err != nil {
63+
t.Fatalf("marshal: %v", err)
64+
}
65+
if !strings.Contains(string(data), `"confidence":0.92`) {
66+
t.Errorf("report finding confidence not serialized: %s", data)
67+
}
68+
if !strings.Contains(string(data), `"signals":["unicode.hidden","directive.imperative"]`) {
69+
t.Errorf("report finding signals not serialized: %s", data)
70+
}
71+
}
72+
4373
// TestScanFindingOmitEmpty ensures the new fields are omitempty so existing
4474
// consumers and stored findings are byte-unaffected when they are unset.
4575
func TestScanFindingOmitEmpty(t *testing.T) {

specs/076-deterministic-tool-scanner/tasks.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,8 @@ Single Go module. New package `internal/security/detect/` (engine + `checks/`);
8585

8686
**Independent test**: a multi-signal tool yields a finding listing each check, carrying confidence, with severity rising by signal count.
8787

88-
- [ ] T020 [US4] Update the risk-score aggregation in `internal/security/scanner/` (types.go / sarif.go scoring) so independent signals on a tool ADD to the score rather than dedup by `(rule_id+location)`; write a scoring test proving consensus raises the score (FR-006, SC-007).
89-
- [ ] T021 [P] [US4] Surface `confidence` + `signals` in the CLI report (`cmd/mcpproxy/security_cmd.go` printReportTable) and confirm they serialize in the REST scan report; add/update the report-rendering test.
88+
- [x] T020 [US4] Update the risk-score aggregation in `internal/security/scanner/` (types.go / sarif.go scoring) so independent signals on a tool ADD to the score rather than dedup by `(rule_id+location)`; write a scoring test proving consensus raises the score (FR-006, SC-007).
89+
- [x] T021 [P] [US4] Surface `confidence` + `signals` in the CLI report (`cmd/mcpproxy/security_cmd.go` printReportTable) and confirm they serialize in the REST scan report; add/update the report-rendering test.
9090

9191
**Checkpoint**: operator can see why a tool was flagged and how strongly.
9292

0 commit comments

Comments
 (0)