refactor(detectors): switchable system prompts via DEFAULT_PARAMS by v0ropaev · Pull Request #1832 · NVIDIA/garak

v0ropaev · 2026-06-03T19:03:45Z

Promote system_prompt_judge / system_prompt_on_topic from undocumented hasattr-based overrides to first-class DEFAULT_PARAMS entries on ModelAsJudge, Refusal, and Jailbreak. Rename Jailbreak.custom_system_prompt to system_prompt_on_topic for cross-class consistency.

Background: @jmartin-tech noted in #1275 that the judge detector "currently has undocumented support for an override of the judge system prompt". This PR documents that contract and aligns key names across the three sibling detectors.

Changes

New DEFAULT_PARAMS entries system_prompt_judge (on ModelAsJudge) and system_prompt_on_topic (on Refusal and Jailbreak), both defaulting to None (use built-in prompt).
Jailbreak.custom_system_prompt removed in favour of system_prompt_on_topic (per AGENTS.md "no backwards-compat shims"). Happy to swap to an alias if preferred.
{goal} substitution uses str.replace instead of .format, so user-supplied prompts can contain unrelated curly braces without raising KeyError.
ModelAsJudge.detect bypasses judge_score() and passes the resolved system prompt explicitly via _create_conv(..., system_prompt=...). The public self.system_prompt_judge attribute is no longer mutated after the first invocation. As a side effect this also addresses a latent issue where a single detector instance reused across probes via the harness would freeze on the first probe's goal.
Class docstrings updated to describe the override mechanism and the [[N]] / [[YES]] / [[NO]] output-format contract. Custom prompts that drop these markers silently default to a maximum (positive-hit) score.
The language-override note from RefusalOnlyAdversarial is now part of the Refusal class docstring, since the override mechanism it pointed at is now first-class.

Out of scope

The same attribute names are used in garak/resources/tap/tap_main.py (set via factory functions on EvaluationJudge) and the parallel pattern in garak/detectors/agent_breaker.py (independent reimplementation, does not inherit). Both kept as-is to keep this PR cohesive.
garak/probes/goat.py exposes its own custom_system_prompt on GOATAttack. That is unrelated to the renamed Jailbreak detector key and is not touched.

Verification

Run from repo root in a clean dev env (pip install -e '.[tests,lint]'):

black --config pyproject.toml garak/detectors/judge.py tests/detectors/test_detectors_judge.py
pytest tests/detectors/test_detectors_judge.py -v — 28 passed (19 pre-existing + 9 new)
pytest tests/resources/red_team/test_evaluation.py -v — 11 passed
pytest tests/probes/test_probes_goat.py tests/probes/test_probes_fitd.py — 35 passed
pytest tests/test_docs.py — 661 passed (docstring/ReST validation)

Notes

This change was developed with AI assistance (Claude Code) and reviewed end-to-end by the human submitter; every changed line was inspected before submission. I confirmed there are no other open PRs addressing #1275 before starting work.

Closes #1275

…judge detectors Promote ModelAsJudge.system_prompt_judge and Refusal/Jailbreak.system_prompt_on_topic from undocumented hasattr-based overrides to first-class DEFAULT_PARAMS entries. Rename Jailbreak.custom_system_prompt to system_prompt_on_topic for cross-class consistency with the other two judge detectors. - ModelAsJudge: substitute {goal} via str.replace (was .format) to tolerate unrelated curly-brace tokens in user-supplied prompts without raising KeyError. - ModelAsJudge.detect now bypasses judge_score() and passes the resolved system prompt explicitly via _create_conv(..., system_prompt=...), so the user-facing self.system_prompt_judge is no longer mutated after the first detection. As a side effect this also resolves a latent issue where a single detector instance reused across probes would freeze on the first probe's goal. - Document the [[N]] / [[YES]]/[[NO]] output-format contract in class docstrings; custom prompts that drop these markers silently default to a maximum (positive-hit) score. - Inline the RefusalOnlyAdversarial language-override note into the Refusal class docstring, since the override mechanism it documents is now first-class. Closes NVIDIA#1275 Co-authored-by: Claude Signed-off-by: Dmitry Voropaev <workerv0ropaev@yandex.ru>

jmartin-tech · 2026-06-03T19:31:56Z

Part of why this is undocumented is a need to enforce a specific format on prompts used during evaluation. I would suggest while this exposes the option it needs to go further and be able to both inspect the provide prompt for conformance to the required template and possibly add early validation of that conformance.

At this time detectors are not instantiated until after inference for a probe is complete, resulting in a high likelihood that a configuration that attempts to inject this value will not detect an incorrect format until after a possibly large amount of execution is spent in the run and the either the detector will need to be skipped or the run may even crash during the detection phase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(detectors): switchable system prompts via DEFAULT_PARAMS#1832

refactor(detectors): switchable system prompts via DEFAULT_PARAMS#1832
v0ropaev wants to merge 1 commit into
NVIDIA:mainfrom
v0ropaev:refactor/switchable-judge-prompts

v0ropaev commented Jun 3, 2026

Uh oh!

jmartin-tech commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

v0ropaev commented Jun 3, 2026

Changes

Out of scope

Verification

Notes

Uh oh!

jmartin-tech commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants