Detection is in-memory and stateless; the only persisted state is the existing tool-approval/quarantine store (unchanged). These are the core in-process types.
Enum: TierHard, TierSoft.
TierHard→ contributes to auto-quarantine; near-zero FP by construction.TierSoft→ review-raise only; never auto-quarantines alone.
Reuses the existing report vocabulary: tool_poisoning, prompt_injection, rug_pull, exfiltration, malicious_code, uncategorized. (Maps onto ScanFinding.ThreatType.)
Read-only projection of one tool, supplied to checks:
Server string— owning server nameName string— tool nameDescription string— raw description (un-normalized)InputSchema json.RawMessageOutputSchema json.RawMessageNormalizedText string— precomputed normalized description+schema-text (lazily, once per tool)
Validation: empty description/schema is valid input → yields zero signals (no error).
Read-only snapshot of all servers' current tools, enabling cross-server checks:
Tools []ToolViewToolsByName map[string][]ToolView— for collision detectionToolNames map[string]struct{}— fast membership for "description references another tool"- Built once per scan, passed to every check.
Emitted by a Check.Inspect:
CheckID string— stable identifier, e.g."unicode.hidden","shadowing.cross_server"Tier TierThreatType stringConfidence float64— 0.0–1.0; for soft signals this is pre-discount-then-discounted by the position classifierEvidence string— render-safe (truncated, control-char/zero-width escaped); forpayload.decodedthis is the decoded contentDetail string— short human explanation
Validation: Confidence clamped to [0,1]; Evidence length-capped; CheckID must be from the registered check set.
type Check interface {
ID() string
Inspect(tool ToolView, reg RegistryView) []Signal
}- Pure and total: no I/O, no panics escape (engine wraps each
Inspectinrecover()). - Deterministic: identical inputs → identical output ordering.
Per-tool aggregation of signals:
- Any
TierHardsignal → findingThreatLevel=dangerous, action = quarantine; severity from the hard signal (critical for escalated unicode/decoded, high otherwise). - Else soft signals → severity by count of distinct soft CheckIDs: 1→
low, 2→medium, 3+→high;ThreatLevel=warning, action = review. - Confidence = combined (independent signals add, capped at 1.0) — agreement raises it.
New fields added to internal/security/scanner/types.go::ScanFinding:
Confidence float64Signals []string— contributing CheckIDs
- Stop deduplicating by
(rule_id+location)in a way that hides agreement. Independent signals on a tool add to the score (still bounded 0–100). Hard findings dominate; soft findings accumulate by distinct-signal count.
Per-scan:
ChecksRun int,ChecksFailed int,FailedCheckIDs []string- A failed check (recovered panic/error) increments
ChecksFailed; the report surfaces "degraded confidence" exactly as today'sscanners_failedpath does. The scan never aborts.
In specs/065-evaluation-foundation/datasets/:
id string,description string(and optionallyname,input_schema,serverfor new checks)label∈ {malicious,benign}category∈ {tool_poisoning,prompt_injection,shadowing,rug_pull,unicode_smuggling,decoded_payload,capability_mismatch,benign,hard_negative}- New entries add the
unicode_smuggling,decoded_payload,capability_mismatchcategories and morehard_negatives.
No new state machine. Hard findings feed the existing quarantine state machine (pending/approved/changed) via the existing quarantine integration — a hard finding marks the tool for quarantine through the current path; soft findings are report-only and do not transition approval state.