feat: config-pluggable refusal classifiers and leak detectors by DevamShah · Pull Request #321 · msoedov/agentic_security

DevamShah · 2026-06-23T01:36:36Z

Summary

Adds a detector registry and an agentic_security.toml [detectors] section so users can enable, disable, or register custom refusal-classifier and leak-detector plugins through config instead of editing source.

Closes #82

Problem / motivation

PIIDetector and SandboxEscapeDetector were instantiated and wired directly in probe_actor/refusal.py, and refusal_classifier_manager was populated with a hardcoded plugin list. The RefusalClassifierPlugin ABC already existed, but there was no supported way to:

toggle the bundled leak detectors per scan, or
drop in an organization-specific detector (e.g. infrastructure-fingerprint leak signatures) without forking.

Anyone needing a custom signature had to patch the module, which does not survive upgrades and is invisible to the rest of the pipeline. This is the gap #82 (a good first issue / help wanted) asks to close.

Change

New agentic_security/refusal_classifier/registry.py:
- DetectorRegistry maps a plugin name to a zero-argument factory and assembles the enabled set from parsed config via build_from_config(...).
- load_plugin_class("pkg.module:ClassName") (also accepts dotted form) resolves custom detectors by import path, with explicit ValueError / ImportError / TypeError on bad paths.
- A module-level registry registers the built-in leak detectors pii and sandbox_escape (disabled by default, so refusal_heuristic behaviour is unchanged).
- Built-ins are validated to implement is_refusal(response) before use.
probe_actor/refusal.py:
- Registers default and ml_classifier on the shared registry (kept here so the trained model is not imported eagerly by the registry module).
- New build_refusal_manager(config=None) reads the [detectors] table via settings_var("detectors", ...) and populates a RefusalClassifierManager. Public symbols (refusal_classifier_manager, pii_detector, sandbox_escape_detector, the heuristics, the ABC) are preserved.
config.py: the generated default agentic_security.toml now documents a [detectors] section with the four built-ins and a commented custom-plugin example.

A detector value may be a bool (toggle a registered plugin) or a table (class import path + optional options kwargs + optional enabled).

Security rationale

Refusal and leak detection are the scanner's signal layer for OWASP LLM Top 10 findings -- LLM02 (Sensitive Information Disclosure) and LLM06 (Excessive Agency / sandbox break-out), with credential and key exposure mapping to CWE-200 and CWE-312. Detection coverage is environment-specific: the secrets, PII formats, and infrastructure-fingerprint strings that matter to one deployment are not the same as another's. Forcing teams to fork to add a signature is a coverage and supply-chain liability. Making detectors config-pluggable lets a security team extend coverage in-tree, version it alongside the rest of their config, and keep custom signatures out of the codebase. Defaults are unchanged, so the change is non-regressive: the leak detectors ship registered but disabled, and refusal_heuristic still runs only the marker and ML classifiers unless explicitly enabled.

Testing / validation

Validated in a clean virtualenv against the cloned repo.

tests/unit/refusal_classifier/test_registry.py (18 tests): registry register/unregister/introspection, build_from_config defaults, bool enable/disable, custom plugin via class path + options propagation, custom-plugin disable, dotted/colon import-path resolution, and the KeyError / TypeError guards.
tests/unit/probe_actor/test_refusal_config.py (5 tests): build_refusal_manager({}) reproduces the legacy {default, ml_classifier} set; the module-level manager matches it; pii and sandbox_escape can be enabled via config and flag the intended positives (123-45-6789, /var/run/docker.sock) while a benign string ("how do I bake bread?") is not flagged; a custom detector loads by class path.
Existing test_refusal.py, test_pii_detector.py, test_sandbox_escape_detector.py still pass.

Result: 36 passed for the targeted suite (23 new + 13 existing). black --check and flake8 (max-line-length=160) clean on all touched files; the generated agentic_security.toml parses and exposes the [detectors] table.

Note: pre-existing test_fuzzer.py failures are an unrelated scikit-learn version-mismatch when unpickling the bundled model; they reproduce identically on an untouched checkout and are not affected by this change.

PIIDetector and SandboxEscapeDetector were wired directly in probe_actor/refusal.py and the refusal classifier manager was populated from a hardcoded list, so the only way to toggle a bundled detector or add an organization-specific signature was to patch the module. Add a DetectorRegistry mapping plugin names to factories, assembled from an agentic_security.toml [detectors] section via build_from_config. Custom detectors load by import path ("pkg.module:ClassName"). refusal.py gains build_refusal_manager(config=None) reading the [detectors] table; all public symbols are preserved. Built-in leak detectors ship registered but disabled, so default refusal_heuristic behaviour is unchanged. Closes msoedov#82 Signed-off-by: Devam Shah <devamshah91@gmail.com>

msoedov · 2026-06-23T07:11:33Z

@DevamShah thx a lot for the patch! LGTM

msoedov merged commit e6459a5 into msoedov:main Jun 23, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: config-pluggable refusal classifiers and leak detectors#321

feat: config-pluggable refusal classifiers and leak detectors#321
msoedov merged 1 commit into
msoedov:mainfrom
DevamShah:config-pluggable-detectors

DevamShah commented Jun 23, 2026

Uh oh!

msoedov commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DevamShah commented Jun 23, 2026

Summary

Problem / motivation

Change

Security rationale

Testing / validation

Uh oh!

msoedov commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants