Skip to content

probe: add replay filter evasion through in-context learning#1818

Open
Carlos-Projects wants to merge 1 commit into
NVIDIA:mainfrom
Carlos-Projects:probe/replay-filter-evasion
Open

probe: add replay filter evasion through in-context learning#1818
Carlos-Projects wants to merge 1 commit into
NVIDIA:mainfrom
Carlos-Projects:probe/replay-filter-evasion

Conversation

@Carlos-Projects
Copy link
Copy Markdown

Adds a new probe that uses few-shot in-context learning examples to prime LLMs to reproduce training data, bypassing output filters that would normally block direct replay attempts.

Implements a privacy side channel inspired by arXiv:2309.05610.

(Resubmitted with unique branch name as requested.)

Uses few-shot ICL examples to prime LLMs to reproduce training data,
bypassing output filters that would normally block direct replay attempts.

Implements the privacy side channel described in arXiv:2309.05610 where
system-level components (output filters) can be evaded by framing replay
as a continuation task learned from in-context examples.

- 8 target texts (well-known passages models likely memorized)
- 3 ICL demonstration pairs priming continuation behavior
- leakreplay.StartsWith detector integration
- 8 tests covering loading, prompt generation, trigger propagation
- Full variant with longer targets at COMPETE_WITH_SOTA tier

Co-authored-by: OpenCode
Signed-off-by: Carlos <carlos@aiagentobservatory.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

probe: replay filter evasion through icl

1 participant