feat: Add BijectionConverter and BijectionAttack (#1903)#1942
feat: Add BijectionConverter and BijectionAttack (#1903)#1942sajisanchu1913-source wants to merge 32 commits into
Conversation
…dup and harm categories
… fix imports and ordering
- _RemoteDatasetLoader._fetch_zip_from_url:
- keyword-only args (source, inner_files, cache)
- streams download (requests stream=True + iter_content) to avoid
double-buffering large archives
- md5-keyed disk cache under DB_DATA_PATH / seed-prompt-entries when
cache=True; named temp file otherwise (cleaned up after parse)
- validates each inner_files extension against FILE_TYPE_HANDLERS;
raises ValueError with a member preview if an inner file is missing
- parses inner files via FILE_TYPE_HANDLERS and returns parsed dicts,
so the open ZipFile never escapes the worker thread
- adds the missing import zipfile that broke the previous commit
- _MICDataset:
- drops unused io / json / requests imports (helper handles them)
- delegates download + parse to the helper; only owns the seed
construction loop
- guards non-string Q values (in addition to NaN moral values)
- forwards cache from fetch_dataset_async to the helper
- factors authors into AUTHORS class constant
- Tests:
- test_moral_integrity_corpus_dataset.py: stops mocking requests.get
directly; patches _fetch_zip_from_url to return parsed dicts so
tests don't depend on the helper's internal shape
- adds test_fetch_dataset_non_string_q and
test_fetch_dataset_passes_cache_flag
- hoists imports into the right groups so ruff I001 stops firing
- removes trailing whitespace / extra newlines
- test_remote_dataset_loader.py: adds TestFetchZipFromUrl covering
happy path, on-disk caching (hits 1 network call across 2 fetches),
cache=False does not persist, missing inner file raises ValueError,
unsupported extension raises ValueError
Verified live against the real MIC.zip: 35,408 unique seeds across
all 6 moral foundations in ~2.4s cold / ~1.3s warm. All 559 dataset
unit tests pass; ruff clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use tempfile.NamedTemporaryFile instead of fixed temp_audio.wav to prevent concurrent call collisions - Wrap Azure upload in try/finally to ensure temp file is always deleted even when upload fails - Add regression test to verify cleanup on upload failure Fixes microsoft#1894
- Add BijectionConverter that generates random letter-to-letter mapping - Add BijectionAttack that teaches the mapping to target AI and encodes harmful prompts - Add unit tests for both converter and attack - Add notebook demonstrating usage - Update __init__.py files to register new classes Based on arXiv:2410.01294 (Haize Labs bijection-learning)
romanlutz
left a comment
There was a problem hiding this comment.
This is a great start! There are a few things that need addressing but we're pretty close.
- Remove @pytest.mark.asyncio decorators (asyncio_mode=auto) - Fix __init__.py alphabetical ordering for BijectionConverter - Use patch_central_database fixture in attack tests - Use MagicMock(spec=PromptTarget) instead of plain MagicMock - Remove dead num_digits parameter - Add BijectionType StrEnum for bijection_type validation - Use private attributes with underscore prefix - Add _build_identifier() method - Fix teaching shots cap with programmatic cycling - Fix alternating user/assistant roles in teaching messages - Fix response decoding in _perform_async - Add BijectionConverter to _request_converters pipeline - Fix notebook format and add paired .py jupytext file - Register BijectionAttack in executor/attack/__init__.py
|
Hi @romanlutz I've addressed all the review comments:
Ready for re-review! |
|
Hi @romanlutz I've addressed the remaining review comments:
Ready for re-review |
- Change Optional[X] to X | None (PEP 604) - Change bijection_type: str to BijectionType in attack - Register BijectionType in prompt_converter __init__.py - Store decoded response in metadata instead of mutating last_response - Fix teaching shots: user sends English, assistant responds in cipher - Fix brittle test assertions to check structural properties - Update end-to-end test to check metadata for decoded response
|
Hi @romanlutz
Example usage: # Default letter mapping
attack = BijectionAttack(objective_target=target)
# Custom letter mapping with seed
attack = BijectionAttack(
objective_target=target,
bijection_converter=LetterBijectionConverter(fixed_size=5, seed=42),
)
# Digit mapping
attack = BijectionAttack(
objective_target=target,
bijection_converter=DigitBijectionConverter(num_digits=10, seed=42),
)Ready for re-review.... |
|
Hi @romanlutz |
|
I agree @sajisanchu1913-source. |
| }) | ||
|
|
||
|
|
||
| class DigitBijectionConverter(BijectionConverter): |
There was a problem hiding this comment.
Nice‑to‑have follow‑up — tokenizer bijection mode.
The paper (§2) describes a third bijection type beyond letter and digit modes: "tokens from the target model's tokenizer" — each English letter maps to a randomly‑sampled distinct token from the target's vocabulary. The paper notes these complexity parameters (fixed_size for letter mode, ℓ for digit mode, vocab subset for token mode) are what give the attack its scale‑adaptive property, so token mode is meaningful for evaluating frontier models.
Not blocking for this PR — the abstract base class makes adding it later straightforward (TokenBijectionConverter(BijectionConverter) that takes a tokenizer reference). Worth opening as a follow‑up issue once this lands. Or if you want to tackle it yourself you can do that as well, of course.
There was a problem hiding this comment.
can you open a GH issue for this and note that you'll do that yourself (if you want to!)? Just so that it's tracked.
- Remove duplicate docstring line in bijection_attack.py - Remove unused SeedPrompt import - Bump num_teaching_shots default to 10 per paper spec - Fix DigitBijectionConverter to map letters to digit strings (not digits to digits) - Add num_digits validation (must be 1-4, raises ValueError) - Fix bijection_attack.py notebook to correct jupytext format - Fix bijection_attack.ipynb to use new API (LetterBijectionConverter) - Add test_digit_converter_encodes_letters test - Add test_digit_converter_invalid_num_digits test
|
Hi @romanlutz addressed the latest review comments:
All 25 tests passing. Ready for re-review! |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
I'll make another pass in the morning. Thanks for your patience in addressing all my concerns so far 🙂 |
- Rewrite docstrings in imperative mood with Args/Returns/Raises (D401, DOC201, DOC501) - Add docstrings to public properties and subclass __init__ (D102, D107) - Use dict comprehension/update and zip(strict=True) (PERF403, B905) - Type _build_identifier as ComponentIdentifier; import it - Add type: ignore[ty:invalid-parameter-default] on REQUIRED_VALUE default - Wrap long intro prompt line (E501) and sort __init__ imports (I001) - Register bijection_attack.ipynb in doc/myst.yml; strip kernelspec metadata Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… test - Add LetterBijectionConverter and DigitBijectionConverter examples to doc/code/converters/1_text_to_text_converters (.py and .ipynb) - Add abstract BijectionConverter to test_converter_documentation exceptions, consistent with other abstract base classes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BijectionConverter is an abstract base class (ABC with abstractmethod), so enumerating it broke the converter-service instantiation test. Skip classes with non-empty __abstractmethods__ in get_converter_modalities so both the documentation and instantiation tests only see concrete converters. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add Bijection Learning paper to references and cite it from docs/source - Fix digit bijection decoding for multi-character encoded tokens - Reject one-digit digit mappings because 26 letters need 26 distinct values - Add coverage for explicit mappings, identifiers, digit round trips, teaching messages, and attack metadata paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use a system setup prompt that tells the target to answer only in the bijection code - Make the final task instruction explicit about private decoding and coded answers - Only expose decoded_response metadata when decoding appears to produce English - Show a skip status for plaintext or invalid cipher responses instead of bogus decoded text - Re-execute the bijection attack notebook after the semantic fix Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use a system setup message for targets that support system prompts - Fold setup instructions into the first user teaching shot for targets that do not - Preserve user/assistant alternation in the fallback path - Cover unsupported-system and zero-shot fallback behavior in tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace live OpenAIChatTarget usage with a local deterministic demo target - Keep the notebook executable without external credentials - Demonstrate a valid bijection-coded response and decoded_response metadata Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore the live OpenAIChatTarget example and original red-team objective - Use Azure CLI credential with extended process timeout so notebook execution succeeds locally - Commit the executed notebook output from the live target run Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Hi @romanlutz I can see there are merge conflicts. Should I resolve them, or would you prefer to handle it on your end? Happy to do whatever is most helpful! |
|
Hi @romanlutz just wanted to check in on the merge conflicts, happy to resolve them myself if that would help move things along. Let me know! |
|
Hi @romanlutz I've resolved the merge conflicts, kept both the bijection citation and the new upstream citations in bibliography.md, fixed the init.py conflict by keeping both DecompositionConverter and DigitBijectionConverter alphabetically, and updated myst.yml to the new executor folder structure. Should be ready to merge now! |
|
Hi @romanlutz |
|
I've tried this and it essentially didn't work. The target model didn't seem to learn the mapping and the result was garbage (or at least I couldn't decode it with the provided mapping). Did you have a different experience? I've tried fixing that in a few different ways and so far not succeeded. |
|
Hi @romanlutz, I hadn't tested it end-to-end with a real model ,I focused on getting the architecture and unit tests right but didn't validate the actual attack effectiveness. Thanks for trying it! Given that it's not working, would you prefer I:
Happy to go with whatever keeps this PR on track! |
|
Hi Roman, Thank you for taking the time to review my PR. I'd really like to get this feature merged, and I'm happy to put in whatever work is needed to make it meet the project's expectations. If there are specific issues with the implementation or areas where my approach doesn't align with the project's design, I'd really appreciate your guidance. I'm committed to addressing the feedback and learning from it. I'm currently a Master's student and this is one of my first open-source feature contributions. Having a feature implementation merged into PyRIT would be a significant milestone for me, both as a learning experience and for my resume. That said, I want to earn the merge by making the implementation as good as it can be. Thanks again for your time and feedback—I appreciate any suggestions you have, and I'll work through them. |
|
I understand, @sajisanchu1913-source . I've not yet seen an LLM actually respond in a way that the decoding worked. For example, I tried and got this response which maps to the following (inverse): I need to start seeing a working bijection. I'm not even saying a successful attack, but a working bijection. We can explore options of restructuring it, or do a detailed comparison to the paper. But something appears to be off. |
|
Thanks, Roman. I understand the issue now. I'll revisit my implementation, compare it more closely with the paper and the reference implementation, and focus on getting the model to produce a working bijection before worrying about the attack itself. I'll work on this and update the PR by tomorrow morning with my findings and any changes. I really appreciate your feedback and the time you've taken to review it. As a Master's student, this is one of my first feature contributions to a large open-source project, so your guidance is incredibly valuable. Thank you! |
Summary
Implements the Bijection Attack from arXiv:2410.01294 (Haize Labs) into PyRIT.
The attack works by teaching a target LLM a secret character mapping through
demonstration shots, then sending harmful prompts encoded in that mapping to
bypass safety filters. Responses are decoded using the inverse mapping.
Changes
New Files
pyrit/prompt_converter/bijection_converter.py— generates random letter-to-letter mapping, encodes prompts, decodes responsespyrit/executor/attack/single_turn/bijection_attack.py— runs full bijection attack with teaching phasetests/unit/prompt_converter/test_bijection_converter.py— 11 unit tests for convertertests/unit/executor/test_bijection_attack.py— 5 unit tests for attackdoc/code/executor/attack/bijection_attack.ipynb— usage notebookModified Files
pyrit/prompt_converter/__init__.py— registered BijectionConverterpyrit/executor/attack/single_turn/__init__.py— registered BijectionAttackHow It Works
BijectionConvertergenerates a random secret mapping (e.g. a→q, b→x...)BijectionAttacksends teaching messages to target AI to teach the mappingTASK is '⟪encoded prompt⟫'Pattern Followed
BijectionConverterfollowsFlipConverterpatternBijectionAttackfollowsFlipAttackpatternReference