probe: add replay filter evasion through in-context learning by Carlos-Projects · Pull Request #1818 · NVIDIA/garak

Carlos-Projects · 2026-05-30T14:52:39Z

Adds a new probe that uses few-shot in-context learning examples to prime LLMs to reproduce training data, bypassing output filters that would normally block direct replay attempts.

Implements a privacy side channel inspired by arXiv:2309.05610.

8 target texts covering common training data categories
3 ICL demonstration pairs per target
Uses leakreplay.StartsWith detector
Closes probe: replay filter evasion through icl #378

(Resubmitted with unique branch name as requested.)

Uses few-shot ICL examples to prime LLMs to reproduce training data, bypassing output filters that would normally block direct replay attempts. Implements the privacy side channel described in arXiv:2309.05610 where system-level components (output filters) can be evaded by framing replay as a continuation task learned from in-context examples. - 8 target texts (well-known passages models likely memorized) - 3 ICL demonstration pairs priming continuation behavior - leakreplay.StartsWith detector integration - 8 tests covering loading, prompt generation, trigger propagation - Full variant with longer targets at COMPETE_WITH_SOTA tier Co-authored-by: OpenCode Signed-off-by: Carlos <carlos@aiagentobservatory.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probe: add replay filter evasion through in-context learning#1818

probe: add replay filter evasion through in-context learning#1818
Carlos-Projects wants to merge 1 commit into
NVIDIA:mainfrom
Carlos-Projects:probe/replay-filter-evasion

Carlos-Projects commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Carlos-Projects commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant