Skip to content

refactor(detectors): switchable system prompts via DEFAULT_PARAMS#1832

Open
v0ropaev wants to merge 1 commit into
NVIDIA:mainfrom
v0ropaev:refactor/switchable-judge-prompts
Open

refactor(detectors): switchable system prompts via DEFAULT_PARAMS#1832
v0ropaev wants to merge 1 commit into
NVIDIA:mainfrom
v0ropaev:refactor/switchable-judge-prompts

Conversation

@v0ropaev
Copy link
Copy Markdown

@v0ropaev v0ropaev commented Jun 3, 2026

Promote system_prompt_judge / system_prompt_on_topic from undocumented hasattr-based overrides to first-class DEFAULT_PARAMS entries on ModelAsJudge, Refusal, and Jailbreak. Rename Jailbreak.custom_system_prompt to system_prompt_on_topic for cross-class consistency.

Background: @jmartin-tech noted in #1275 that the judge detector "currently has undocumented support for an override of the judge system prompt". This PR documents that contract and aligns key names across the three sibling detectors.

Changes

  • New DEFAULT_PARAMS entries system_prompt_judge (on ModelAsJudge) and system_prompt_on_topic (on Refusal and Jailbreak), both defaulting to None (use built-in prompt).
  • Jailbreak.custom_system_prompt removed in favour of system_prompt_on_topic (per AGENTS.md "no backwards-compat shims"). Happy to swap to an alias if preferred.
  • {goal} substitution uses str.replace instead of .format, so user-supplied prompts can contain unrelated curly braces without raising KeyError.
  • ModelAsJudge.detect bypasses judge_score() and passes the resolved system prompt explicitly via _create_conv(..., system_prompt=...). The public self.system_prompt_judge attribute is no longer mutated after the first invocation. As a side effect this also addresses a latent issue where a single detector instance reused across probes via the harness would freeze on the first probe's goal.
  • Class docstrings updated to describe the override mechanism and the [[N]] / [[YES]] / [[NO]] output-format contract. Custom prompts that drop these markers silently default to a maximum (positive-hit) score.
  • The language-override note from RefusalOnlyAdversarial is now part of the Refusal class docstring, since the override mechanism it pointed at is now first-class.

Out of scope

  • The same attribute names are used in garak/resources/tap/tap_main.py (set via factory functions on EvaluationJudge) and the parallel pattern in garak/detectors/agent_breaker.py (independent reimplementation, does not inherit). Both kept as-is to keep this PR cohesive.
  • garak/probes/goat.py exposes its own custom_system_prompt on GOATAttack. That is unrelated to the renamed Jailbreak detector key and is not touched.

Verification

Run from repo root in a clean dev env (pip install -e '.[tests,lint]'):

  • black --config pyproject.toml garak/detectors/judge.py tests/detectors/test_detectors_judge.py
  • pytest tests/detectors/test_detectors_judge.py -v — 28 passed (19 pre-existing + 9 new)
  • pytest tests/resources/red_team/test_evaluation.py -v — 11 passed
  • pytest tests/probes/test_probes_goat.py tests/probes/test_probes_fitd.py — 35 passed
  • pytest tests/test_docs.py — 661 passed (docstring/ReST validation)

Notes

This change was developed with AI assistance (Claude Code) and reviewed end-to-end by the human submitter; every changed line was inspected before submission. I confirmed there are no other open PRs addressing #1275 before starting work.

Closes #1275

…judge detectors

Promote ModelAsJudge.system_prompt_judge and Refusal/Jailbreak.system_prompt_on_topic
from undocumented hasattr-based overrides to first-class DEFAULT_PARAMS entries.
Rename Jailbreak.custom_system_prompt to system_prompt_on_topic for cross-class
consistency with the other two judge detectors.

- ModelAsJudge: substitute {goal} via str.replace (was .format) to tolerate
  unrelated curly-brace tokens in user-supplied prompts without raising
  KeyError.
- ModelAsJudge.detect now bypasses judge_score() and passes the resolved
  system prompt explicitly via _create_conv(..., system_prompt=...), so the
  user-facing self.system_prompt_judge is no longer mutated after the first
  detection. As a side effect this also resolves a latent issue where a
  single detector instance reused across probes would freeze on the first
  probe's goal.
- Document the [[N]] / [[YES]]/[[NO]] output-format contract in class
  docstrings; custom prompts that drop these markers silently default to a
  maximum (positive-hit) score.
- Inline the RefusalOnlyAdversarial language-override note into the Refusal
  class docstring, since the override mechanism it documents is now
  first-class.

Closes NVIDIA#1275

Co-authored-by: Claude

Signed-off-by: Dmitry Voropaev <workerv0ropaev@yandex.ru>
@jmartin-tech
Copy link
Copy Markdown
Collaborator

Part of why this is undocumented is a need to enforce a specific format on prompts used during evaluation. I would suggest while this exposes the option it needs to go further and be able to both inspect the provide prompt for conformance to the required template and possibly add early validation of that conformance.

At this time detectors are not instantiated until after inference for a probe is complete, resulting in a high likelihood that a configuration that attempts to inject this value will not detect an incorrect format until after a possibly large amount of execution is spent in the run and the either the detector will need to be skipped or the run may even crash during the detection phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: make llmaaj prompts switchable

2 participants