feat: config-pluggable refusal classifiers and leak detectors#321
Merged
Conversation
PIIDetector and SandboxEscapeDetector were wired directly in
probe_actor/refusal.py and the refusal classifier manager was populated from
a hardcoded list, so the only way to toggle a bundled detector or add an
organization-specific signature was to patch the module.
Add a DetectorRegistry mapping plugin names to factories, assembled from an
agentic_security.toml [detectors] section via build_from_config. Custom
detectors load by import path ("pkg.module:ClassName"). refusal.py gains
build_refusal_manager(config=None) reading the [detectors] table; all public
symbols are preserved. Built-in leak detectors ship registered but disabled,
so default refusal_heuristic behaviour is unchanged.
Closes msoedov#82
Signed-off-by: Devam Shah <devamshah91@gmail.com>
Owner
|
@DevamShah thx a lot for the patch! LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a detector registry and an
agentic_security.toml[detectors]section so users can enable, disable, or register custom refusal-classifier and leak-detector plugins through config instead of editing source.Closes #82
Problem / motivation
PIIDetectorandSandboxEscapeDetectorwere instantiated and wired directly inprobe_actor/refusal.py, andrefusal_classifier_managerwas populated with a hardcoded plugin list. TheRefusalClassifierPluginABC already existed, but there was no supported way to:Anyone needing a custom signature had to patch the module, which does not survive upgrades and is invisible to the rest of the pipeline. This is the gap #82 (a
good first issue/help wanted) asks to close.Change
agentic_security/refusal_classifier/registry.py:DetectorRegistrymaps a plugin name to a zero-argument factory and assembles the enabled set from parsed config viabuild_from_config(...).load_plugin_class("pkg.module:ClassName")(also accepts dotted form) resolves custom detectors by import path, with explicitValueError/ImportError/TypeErroron bad paths.registryregisters the built-in leak detectorspiiandsandbox_escape(disabled by default, sorefusal_heuristicbehaviour is unchanged).is_refusal(response)before use.probe_actor/refusal.py:defaultandml_classifieron the shared registry (kept here so the trained model is not imported eagerly by the registry module).build_refusal_manager(config=None)reads the[detectors]table viasettings_var("detectors", ...)and populates aRefusalClassifierManager. Public symbols (refusal_classifier_manager,pii_detector,sandbox_escape_detector, the heuristics, the ABC) are preserved.config.py: the generated defaultagentic_security.tomlnow documents a[detectors]section with the four built-ins and a commented custom-plugin example.A detector value may be a bool (toggle a registered plugin) or a table (
classimport path + optionaloptionskwargs + optionalenabled).Security rationale
Refusal and leak detection are the scanner's signal layer for OWASP LLM Top 10 findings -- LLM02 (Sensitive Information Disclosure) and LLM06 (Excessive Agency / sandbox break-out), with credential and key exposure mapping to CWE-200 and CWE-312. Detection coverage is environment-specific: the secrets, PII formats, and infrastructure-fingerprint strings that matter to one deployment are not the same as another's. Forcing teams to fork to add a signature is a coverage and supply-chain liability. Making detectors config-pluggable lets a security team extend coverage in-tree, version it alongside the rest of their config, and keep custom signatures out of the codebase. Defaults are unchanged, so the change is non-regressive: the leak detectors ship registered but disabled, and
refusal_heuristicstill runs only the marker and ML classifiers unless explicitly enabled.Testing / validation
Validated in a clean virtualenv against the cloned repo.
tests/unit/refusal_classifier/test_registry.py(18 tests): registry register/unregister/introspection,build_from_configdefaults, bool enable/disable, custom plugin via class path +optionspropagation, custom-plugin disable, dotted/colon import-path resolution, and theKeyError/TypeErrorguards.tests/unit/probe_actor/test_refusal_config.py(5 tests):build_refusal_manager({})reproduces the legacy{default, ml_classifier}set; the module-level manager matches it;piiandsandbox_escapecan be enabled via config and flag the intended positives (123-45-6789,/var/run/docker.sock) while a benign string ("how do I bake bread?") is not flagged; a custom detector loads by class path.test_refusal.py,test_pii_detector.py,test_sandbox_escape_detector.pystill pass.Result: 36 passed for the targeted suite (23 new + 13 existing).
black --checkandflake8 (max-line-length=160)clean on all touched files; the generatedagentic_security.tomlparses and exposes the[detectors]table.