Skip to content

Commit e77490d

Browse files
committed
fix: resolve mypy failure and sync spec for task 112 (output_harmful_content)
- Fix `cost_tier: Literal[...]` type annotation in OutputHarmfulContent — the Detector protocol expects `str`; all other detectors use `str`. This was the root cause of CI and Release-check failures on the task-112 push. - Add B-018 behavior to docs/spec/behaviors.md for the new opt-in output.harmful_content detector (two-stage regex + LLM confirmation). - Update README corpus counts to reflect actual row counts: single-shot 311 → 262, multi-turn 41 → 34.
1 parent c49578c commit e77490d

3 files changed

Lines changed: 20 additions & 5 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,8 @@ Numbers below are local preview measurements from 2026-05-05, generated by [`tes
3939
| Honeypot P95 latency budget | **≤ 16,000 ms** (empirical ~11,875–15,500 ms steady-state on the hardware envelope above) | [`tests/fitness/test_llm_p95_latency.py`](tests/fitness/test_llm_p95_latency.py); see [ADR-023](docs/architecture/decisions/023-llm-budget-soft-fail.md) for the budget rationale and measurement methodology |
4040
| Daemon cold-start budget | **≤ 5,000 ms** on the hardware envelope above | [`tests/fitness/test_cold_start_budget.py`](tests/fitness/test_cold_start_budget.py) |
4141
| Validator + honeypot model size | **~462 MB** GGUF (Q4_K_M) | [ADR-018](docs/architecture/decisions/018-validator-model-choice.md) |
42-
| Red-team corpus rows (single-shot) | **311** across 7 attack families (direct_injection, exfiltration, indirect_injection, jailbreak, obfuscation, tool_abuse, probe_attacks) | [`tests/eval/corpus/`](tests/eval/corpus/) |
43-
| Multi-turn scenario rows | **41** (chunked + scenarios) | [`tests/eval/corpus/`](tests/eval/corpus/) |
42+
| Red-team corpus rows (single-shot) | **262** across 7 attack families (direct_injection, exfiltration, indirect_injection, jailbreak, obfuscation, tool_abuse, probe_attacks) | [`tests/eval/corpus/`](tests/eval/corpus/) |
43+
| Multi-turn scenario rows | **34** (chunked + scenarios) | [`tests/eval/corpus/`](tests/eval/corpus/) |
4444

4545
Re-run the full benchmark per the [Reproduce the model-selection benchmark](#reproduce-the-model-selection-benchmark) section. Fitness budgets are re-checked on every `make fitness` run.
4646

docs/spec/behaviors.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Behaviors
22

33
**Project:** armor
4-
**Last updated:** 2026-05-17
4+
**Last updated:** 2026-05-23
55

66
What the system does, observably. Each behavior describes a triggering condition, the system's response, and any externally-visible side effects.
77

@@ -323,6 +323,22 @@ Behaviors are numbered `B-001`, `B-002`, … sequentially. Numbers are stable re
323323
- **Failure modes:** Detector raises → that detector's verdict is recorded as `error`, pipeline continues (fail-open per detector). Whole pipeline raises → daemon returns `block` (fail-closed at pipeline level) and logs an internal error.
324324
- **References:** ADR-033, ADR-041
325325

326+
### B-018: Detect harmful commands in model output (opt-in)
327+
328+
- **Trigger:** Output check runs detector `output.harmful_content` when `detector.output_harmful_content.enabled = true` in `armor.toml` (disabled by default). Only fires on payloads with `source == MODEL_OUTPUT`.
329+
- **Response:** Two-stage detection:
330+
1. **Stage 1 — regex fast path:** Scans the output for runnable attack commands in four families:
331+
- **Cloud credential exfil:** `aws s3 cp … credentials`, `gsutil cp … credentials`, `az storage blob upload`.
332+
- **Credential file access:** `cat ~/.aws/credentials`, `cat /etc/shadow`, `cat ~/.netrc`, `cat ~/.ssh/id_rsa`, `find / -name *.aws`, `find / -name credentials`.
333+
- **IMDS / metadata endpoints:** `169.254.169.254`, `metadata.google.internal`, `100.100.100.200`.
334+
- **Privilege escalation chains:** `aws ssm get-parameter --with-decryption`, `aws iam pass-role`, `aws iam attach-role-policy`.
335+
If no pattern fires, returns `pass` immediately.
336+
2. **Stage 2 — LLM confirmation:** If a pattern fires and the LLM session is available, the output is sent to the validator LLM with a dedicated prompt (`src/armor/llm/prompts/output_harmful_content.txt`). Returns `block` with `signal_id = output.harmful_content:confirmed` if the LLM returns `risky` with `confidence ≥ block_threshold` (default 0.6). Returns `advisory` with `signal_id = output.harmful_content:pattern_match` if the LLM returns `risky` below threshold, or if the LLM is unavailable.
337+
- **Configuration:** Controlled by `detector.output_harmful_content.enabled` (bool, default `false`) and `detector.output_harmful_content.block_threshold` (float, default 0.6). See `docs/spec/configuration.md`.
338+
- **Side effects:** On `block`: forensic record written with `attack_category = "output_harmful_content"`, severity `critical`. On `advisory`: session risk score incremented.
339+
- **Failure modes:** Stage 1 pattern raises → `error` verdict returned, pipeline continues (fail-open per detector). LLM unavailable → soft-fail to `advisory` with `signal_id = output.harmful_content:pattern_match`. LLM exception → soft-fail to `advisory` with error details in `details["error"]`.
340+
- **References:** task 112, corpus at `tests/eval/corpus/scenarios_multi_turn.yaml` (family: "authority_pedagogy_framing"), configuration keys at `docs/spec/configuration.md`.
341+
326342
---
327343

328344
## Edge cases and error behaviors

src/armor/detectors/output_harmful_content.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99

1010
import re
1111
from pathlib import Path
12-
from typing import Literal
1312

1413
from armor.llm.validator import validate
1514
from armor.types import Payload, SessionContext, Source, Verdict
@@ -32,7 +31,7 @@ class OutputHarmfulContent:
3231

3332
id: str = "output.harmful_content"
3433
category: str = "output"
35-
cost_tier: Literal["static", "semantic", "llm"] = "llm"
34+
cost_tier: str = "llm"
3635

3736
# Compiled patterns — shared across all instances
3837
_patterns: list[re.Pattern[str]] | None = None

0 commit comments

Comments
 (0)