Skip to content

feat: config-pluggable refusal classifiers and leak detectors#321

Merged
msoedov merged 1 commit into
msoedov:mainfrom
DevamShah:config-pluggable-detectors
Jun 23, 2026
Merged

feat: config-pluggable refusal classifiers and leak detectors#321
msoedov merged 1 commit into
msoedov:mainfrom
DevamShah:config-pluggable-detectors

Conversation

@DevamShah

Copy link
Copy Markdown
Contributor

Summary

Adds a detector registry and an agentic_security.toml [detectors] section so users can enable, disable, or register custom refusal-classifier and leak-detector plugins through config instead of editing source.

Closes #82

Problem / motivation

PIIDetector and SandboxEscapeDetector were instantiated and wired directly in probe_actor/refusal.py, and refusal_classifier_manager was populated with a hardcoded plugin list. The RefusalClassifierPlugin ABC already existed, but there was no supported way to:

  • toggle the bundled leak detectors per scan, or
  • drop in an organization-specific detector (e.g. infrastructure-fingerprint leak signatures) without forking.

Anyone needing a custom signature had to patch the module, which does not survive upgrades and is invisible to the rest of the pipeline. This is the gap #82 (a good first issue / help wanted) asks to close.

Change

  • New agentic_security/refusal_classifier/registry.py:
    • DetectorRegistry maps a plugin name to a zero-argument factory and assembles the enabled set from parsed config via build_from_config(...).
    • load_plugin_class("pkg.module:ClassName") (also accepts dotted form) resolves custom detectors by import path, with explicit ValueError / ImportError / TypeError on bad paths.
    • A module-level registry registers the built-in leak detectors pii and sandbox_escape (disabled by default, so refusal_heuristic behaviour is unchanged).
    • Built-ins are validated to implement is_refusal(response) before use.
  • probe_actor/refusal.py:
    • Registers default and ml_classifier on the shared registry (kept here so the trained model is not imported eagerly by the registry module).
    • New build_refusal_manager(config=None) reads the [detectors] table via settings_var("detectors", ...) and populates a RefusalClassifierManager. Public symbols (refusal_classifier_manager, pii_detector, sandbox_escape_detector, the heuristics, the ABC) are preserved.
  • config.py: the generated default agentic_security.toml now documents a [detectors] section with the four built-ins and a commented custom-plugin example.

A detector value may be a bool (toggle a registered plugin) or a table (class import path + optional options kwargs + optional enabled).

Security rationale

Refusal and leak detection are the scanner's signal layer for OWASP LLM Top 10 findings -- LLM02 (Sensitive Information Disclosure) and LLM06 (Excessive Agency / sandbox break-out), with credential and key exposure mapping to CWE-200 and CWE-312. Detection coverage is environment-specific: the secrets, PII formats, and infrastructure-fingerprint strings that matter to one deployment are not the same as another's. Forcing teams to fork to add a signature is a coverage and supply-chain liability. Making detectors config-pluggable lets a security team extend coverage in-tree, version it alongside the rest of their config, and keep custom signatures out of the codebase. Defaults are unchanged, so the change is non-regressive: the leak detectors ship registered but disabled, and refusal_heuristic still runs only the marker and ML classifiers unless explicitly enabled.

Testing / validation

Validated in a clean virtualenv against the cloned repo.

  • tests/unit/refusal_classifier/test_registry.py (18 tests): registry register/unregister/introspection, build_from_config defaults, bool enable/disable, custom plugin via class path + options propagation, custom-plugin disable, dotted/colon import-path resolution, and the KeyError / TypeError guards.
  • tests/unit/probe_actor/test_refusal_config.py (5 tests): build_refusal_manager({}) reproduces the legacy {default, ml_classifier} set; the module-level manager matches it; pii and sandbox_escape can be enabled via config and flag the intended positives (123-45-6789, /var/run/docker.sock) while a benign string ("how do I bake bread?") is not flagged; a custom detector loads by class path.
  • Existing test_refusal.py, test_pii_detector.py, test_sandbox_escape_detector.py still pass.

Result: 36 passed for the targeted suite (23 new + 13 existing). black --check and flake8 (max-line-length=160) clean on all touched files; the generated agentic_security.toml parses and exposes the [detectors] table.

Note: pre-existing test_fuzzer.py failures are an unrelated scikit-learn version-mismatch when unpickling the bundled model; they reproduce identically on an untouched checkout and are not affected by this change.

PIIDetector and SandboxEscapeDetector were wired directly in
probe_actor/refusal.py and the refusal classifier manager was populated from
a hardcoded list, so the only way to toggle a bundled detector or add an
organization-specific signature was to patch the module.

Add a DetectorRegistry mapping plugin names to factories, assembled from an
agentic_security.toml [detectors] section via build_from_config. Custom
detectors load by import path ("pkg.module:ClassName"). refusal.py gains
build_refusal_manager(config=None) reading the [detectors] table; all public
symbols are preserved. Built-in leak detectors ship registered but disabled,
so default refusal_heuristic behaviour is unchanged.

Closes msoedov#82

Signed-off-by: Devam Shah <devamshah91@gmail.com>
@msoedov

msoedov commented Jun 23, 2026

Copy link
Copy Markdown
Owner

@DevamShah thx a lot for the patch! LGTM

@msoedov msoedov merged commit e6459a5 into msoedov:main Jun 23, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable custom refusal classifiers and leak detectors

2 participants