fix: resolve mypy failure and sync spec for task 112 (output_harmful_content)

tkdtaylor · tkdtaylor · commit e77490d9e486 · 2026-05-23T11:38:29.000-04:00
- Fix `cost_tier: Literal[...]` type annotation in OutputHarmfulContent —
  the Detector protocol expects `str`; all other detectors use `str`.
  This was the root cause of CI and Release-check failures on the task-112 push.
- Add B-018 behavior to docs/spec/behaviors.md for the new opt-in
  output.harmful_content detector (two-stage regex + LLM confirmation).
- Update README corpus counts to reflect actual row counts:
  single-shot 311 → 262, multi-turn 41 → 34.
diff --git a/README.md b/README.md
@@ -39,8 +39,8 @@ Numbers below are local preview measurements from 2026-05-05, generated by [`tes
 | Honeypot P95 latency budget | **≤ 16,000 ms** (empirical ~11,875–15,500 ms steady-state on the hardware envelope above) | [`tests/fitness/test_llm_p95_latency.py`](tests/fitness/test_llm_p95_latency.py); see [ADR-023](docs/architecture/decisions/023-llm-budget-soft-fail.md) for the budget rationale and measurement methodology |
 | Daemon cold-start budget | **≤ 5,000 ms** on the hardware envelope above | [`tests/fitness/test_cold_start_budget.py`](tests/fitness/test_cold_start_budget.py) |
 | Validator + honeypot model size | **~462 MB** GGUF (Q4_K_M) | [ADR-018](docs/architecture/decisions/018-validator-model-choice.md) |
-| Red-team corpus rows (single-shot) | **311** across 7 attack families (direct_injection, exfiltration, indirect_injection, jailbreak, obfuscation, tool_abuse, probe_attacks) | [`tests/eval/corpus/`](tests/eval/corpus/) |
-| Multi-turn scenario rows | **41** (chunked + scenarios) | [`tests/eval/corpus/`](tests/eval/corpus/) |
+| Red-team corpus rows (single-shot) | **262** across 7 attack families (direct_injection, exfiltration, indirect_injection, jailbreak, obfuscation, tool_abuse, probe_attacks) | [`tests/eval/corpus/`](tests/eval/corpus/) |
+| Multi-turn scenario rows | **34** (chunked + scenarios) | [`tests/eval/corpus/`](tests/eval/corpus/) |
 
 Re-run the full benchmark per the [Reproduce the model-selection benchmark](#reproduce-the-model-selection-benchmark) section. Fitness budgets are re-checked on every `make fitness` run.
 
diff --git a/docs/spec/behaviors.md b/docs/spec/behaviors.md
@@ -1,7 +1,7 @@
 # Behaviors
 
 **Project:** armor
-**Last updated:** 2026-05-17
+**Last updated:** 2026-05-23
 
 What the system does, observably. Each behavior describes a triggering condition, the system's response, and any externally-visible side effects.
 
@@ -323,6 +323,22 @@ Behaviors are numbered `B-001`, `B-002`, … sequentially. Numbers are stable re
 - **Failure modes:** Detector raises → that detector's verdict is recorded as `error`, pipeline continues (fail-open per detector). Whole pipeline raises → daemon returns `block` (fail-closed at pipeline level) and logs an internal error.
 - **References:** ADR-033, ADR-041
 
+### B-018: Detect harmful commands in model output (opt-in)
+
+- **Trigger:** Output check runs detector `output.harmful_content` when `detector.output_harmful_content.enabled = true` in `armor.toml` (disabled by default). Only fires on payloads with `source == MODEL_OUTPUT`.
+- **Response:** Two-stage detection:
+  1. **Stage 1 — regex fast path:** Scans the output for runnable attack commands in four families:
+     - **Cloud credential exfil:** `aws s3 cp … credentials`, `gsutil cp … credentials`, `az storage blob upload`.
+     - **Credential file access:** `cat ~/.aws/credentials`, `cat /etc/shadow`, `cat ~/.netrc`, `cat ~/.ssh/id_rsa`, `find / -name *.aws`, `find / -name credentials`.
+     - **IMDS / metadata endpoints:** `169.254.169.254`, `metadata.google.internal`, `100.100.100.200`.
+     - **Privilege escalation chains:** `aws ssm get-parameter --with-decryption`, `aws iam pass-role`, `aws iam attach-role-policy`.
+     If no pattern fires, returns `pass` immediately.
+  2. **Stage 2 — LLM confirmation:** If a pattern fires and the LLM session is available, the output is sent to the validator LLM with a dedicated prompt (`src/armor/llm/prompts/output_harmful_content.txt`). Returns `block` with `signal_id = output.harmful_content:confirmed` if the LLM returns `risky` with `confidence ≥ block_threshold` (default 0.6). Returns `advisory` with `signal_id = output.harmful_content:pattern_match` if the LLM returns `risky` below threshold, or if the LLM is unavailable.
+- **Configuration:** Controlled by `detector.output_harmful_content.enabled` (bool, default `false`) and `detector.output_harmful_content.block_threshold` (float, default 0.6). See `docs/spec/configuration.md`.
+- **Side effects:** On `block`: forensic record written with `attack_category = "output_harmful_content"`, severity `critical`. On `advisory`: session risk score incremented.
+- **Failure modes:** Stage 1 pattern raises → `error` verdict returned, pipeline continues (fail-open per detector). LLM unavailable → soft-fail to `advisory` with `signal_id = output.harmful_content:pattern_match`. LLM exception → soft-fail to `advisory` with error details in `details["error"]`.
+- **References:** task 112, corpus at `tests/eval/corpus/scenarios_multi_turn.yaml` (family: "authority_pedagogy_framing"), configuration keys at `docs/spec/configuration.md`.
+
 ---
 
 ## Edge cases and error behaviors
diff --git a/src/armor/detectors/output_harmful_content.py b/src/armor/detectors/output_harmful_content.py
@@ -9,7 +9,6 @@
 
 import re
 from pathlib import Path
-from typing import Literal
 
 from armor.llm.validator import validate
 from armor.types import Payload, SessionContext, Source, Verdict
@@ -32,7 +31,7 @@ class OutputHarmfulContent:
 
     id: str = "output.harmful_content"
     category: str = "output"
-    cost_tier: Literal["static", "semantic", "llm"] = "llm"
+    cost_tier: str = "llm"
 
     # Compiled patterns — shared across all instances
     _patterns: list[re.Pattern[str]] | None = None