refactor(detectors): switchable system prompts via DEFAULT_PARAMS#1832
Open
v0ropaev wants to merge 1 commit into
Open
refactor(detectors): switchable system prompts via DEFAULT_PARAMS#1832v0ropaev wants to merge 1 commit into
v0ropaev wants to merge 1 commit into
Conversation
…judge detectors
Promote ModelAsJudge.system_prompt_judge and Refusal/Jailbreak.system_prompt_on_topic
from undocumented hasattr-based overrides to first-class DEFAULT_PARAMS entries.
Rename Jailbreak.custom_system_prompt to system_prompt_on_topic for cross-class
consistency with the other two judge detectors.
- ModelAsJudge: substitute {goal} via str.replace (was .format) to tolerate
unrelated curly-brace tokens in user-supplied prompts without raising
KeyError.
- ModelAsJudge.detect now bypasses judge_score() and passes the resolved
system prompt explicitly via _create_conv(..., system_prompt=...), so the
user-facing self.system_prompt_judge is no longer mutated after the first
detection. As a side effect this also resolves a latent issue where a
single detector instance reused across probes would freeze on the first
probe's goal.
- Document the [[N]] / [[YES]]/[[NO]] output-format contract in class
docstrings; custom prompts that drop these markers silently default to a
maximum (positive-hit) score.
- Inline the RefusalOnlyAdversarial language-override note into the Refusal
class docstring, since the override mechanism it documents is now
first-class.
Closes NVIDIA#1275
Co-authored-by: Claude
Signed-off-by: Dmitry Voropaev <workerv0ropaev@yandex.ru>
Collaborator
|
Part of why this is undocumented is a need to enforce a specific format on prompts used during evaluation. I would suggest while this exposes the option it needs to go further and be able to both inspect the provide prompt for conformance to the required template and possibly add early validation of that conformance. At this time detectors are not instantiated until after inference for a probe is complete, resulting in a high likelihood that a configuration that attempts to inject this value will not detect an incorrect format until after a possibly large amount of execution is spent in the run and the either the detector will need to be skipped or the run may even crash during the detection phase. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Promote
system_prompt_judge/system_prompt_on_topicfrom undocumentedhasattr-based overrides to first-classDEFAULT_PARAMSentries onModelAsJudge,Refusal, andJailbreak. RenameJailbreak.custom_system_prompttosystem_prompt_on_topicfor cross-class consistency.Background: @jmartin-tech noted in #1275 that the
judgedetector "currently has undocumented support for an override of the judge system prompt". This PR documents that contract and aligns key names across the three sibling detectors.Changes
DEFAULT_PARAMSentriessystem_prompt_judge(onModelAsJudge) andsystem_prompt_on_topic(onRefusalandJailbreak), both defaulting toNone(use built-in prompt).Jailbreak.custom_system_promptremoved in favour ofsystem_prompt_on_topic(perAGENTS.md"no backwards-compat shims"). Happy to swap to an alias if preferred.{goal}substitution usesstr.replaceinstead of.format, so user-supplied prompts can contain unrelated curly braces without raisingKeyError.ModelAsJudge.detectbypassesjudge_score()and passes the resolved system prompt explicitly via_create_conv(..., system_prompt=...). The publicself.system_prompt_judgeattribute is no longer mutated after the first invocation. As a side effect this also addresses a latent issue where a single detector instance reused across probes via the harness would freeze on the first probe's goal.[[N]]/[[YES]]/[[NO]]output-format contract. Custom prompts that drop these markers silently default to a maximum (positive-hit) score.RefusalOnlyAdversarialis now part of theRefusalclass docstring, since the override mechanism it pointed at is now first-class.Out of scope
garak/resources/tap/tap_main.py(set via factory functions onEvaluationJudge) and the parallel pattern ingarak/detectors/agent_breaker.py(independent reimplementation, does not inherit). Both kept as-is to keep this PR cohesive.garak/probes/goat.pyexposes its owncustom_system_promptonGOATAttack. That is unrelated to the renamedJailbreakdetector key and is not touched.Verification
Run from repo root in a clean dev env (
pip install -e '.[tests,lint]'):black --config pyproject.toml garak/detectors/judge.py tests/detectors/test_detectors_judge.pypytest tests/detectors/test_detectors_judge.py -v— 28 passed (19 pre-existing + 9 new)pytest tests/resources/red_team/test_evaluation.py -v— 11 passedpytest tests/probes/test_probes_goat.py tests/probes/test_probes_fitd.py— 35 passedpytest tests/test_docs.py— 661 passed (docstring/ReST validation)Notes
This change was developed with AI assistance (Claude Code) and reviewed end-to-end by the human submitter; every changed line was inspected before submission. I confirmed there are no other open PRs addressing #1275 before starting work.
Closes #1275