probe: add replay filter evasion through in-context learning#1818
Open
Carlos-Projects wants to merge 1 commit into
Open
probe: add replay filter evasion through in-context learning#1818Carlos-Projects wants to merge 1 commit into
Carlos-Projects wants to merge 1 commit into
Conversation
Uses few-shot ICL examples to prime LLMs to reproduce training data, bypassing output filters that would normally block direct replay attempts. Implements the privacy side channel described in arXiv:2309.05610 where system-level components (output filters) can be evaded by framing replay as a continuation task learned from in-context examples. - 8 target texts (well-known passages models likely memorized) - 3 ICL demonstration pairs priming continuation behavior - leakreplay.StartsWith detector integration - 8 tests covering loading, prompt generation, trigger propagation - Full variant with longer targets at COMPETE_WITH_SOTA tier Co-authored-by: OpenCode Signed-off-by: Carlos <carlos@aiagentobservatory.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new probe that uses few-shot in-context learning examples to prime LLMs to reproduce training data, bypassing output filters that would normally block direct replay attempts.
Implements a privacy side channel inspired by arXiv:2309.05610.
leakreplay.StartsWithdetector(Resubmitted with unique branch name as requested.)