Skip to content

feat: Add BijectionConverter and BijectionAttack (#1903)#1942

Open
sajisanchu1913-source wants to merge 32 commits into
microsoft:mainfrom
sajisanchu1913-source:feat/bijection-attack
Open

feat: Add BijectionConverter and BijectionAttack (#1903)#1942
sajisanchu1913-source wants to merge 32 commits into
microsoft:mainfrom
sajisanchu1913-source:feat/bijection-attack

Conversation

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor

Summary

Implements the Bijection Attack from arXiv:2410.01294 (Haize Labs) into PyRIT.

The attack works by teaching a target LLM a secret character mapping through
demonstration shots, then sending harmful prompts encoded in that mapping to
bypass safety filters. Responses are decoded using the inverse mapping.

Changes

New Files

  • pyrit/prompt_converter/bijection_converter.py — generates random letter-to-letter mapping, encodes prompts, decodes responses
  • pyrit/executor/attack/single_turn/bijection_attack.py — runs full bijection attack with teaching phase
  • tests/unit/prompt_converter/test_bijection_converter.py — 11 unit tests for converter
  • tests/unit/executor/test_bijection_attack.py — 5 unit tests for attack
  • doc/code/executor/attack/bijection_attack.ipynb — usage notebook

Modified Files

  • pyrit/prompt_converter/__init__.py — registered BijectionConverter
  • pyrit/executor/attack/single_turn/__init__.py — registered BijectionAttack

How It Works

  1. BijectionConverter generates a random secret mapping (e.g. a→q, b→x...)
  2. BijectionAttack sends teaching messages to target AI to teach the mapping
  3. Harmful prompt is encoded and sent as TASK is '⟪encoded prompt⟫'
  4. Response is decoded using inverse mapping
  5. Decoded response is scored by the judge

Pattern Followed

  • BijectionConverter follows FlipConverter pattern
  • BijectionAttack follows FlipAttack pattern

Reference

sajisanchu1913-source and others added 12 commits May 28, 2026 17:14
- _RemoteDatasetLoader._fetch_zip_from_url:
  - keyword-only args (source, inner_files, cache)
  - streams download (requests stream=True + iter_content) to avoid
    double-buffering large archives
  - md5-keyed disk cache under DB_DATA_PATH / seed-prompt-entries when
    cache=True; named temp file otherwise (cleaned up after parse)
  - validates each inner_files extension against FILE_TYPE_HANDLERS;
    raises ValueError with a member preview if an inner file is missing
  - parses inner files via FILE_TYPE_HANDLERS and returns parsed dicts,
    so the open ZipFile never escapes the worker thread
  - adds the missing import zipfile that broke the previous commit
- _MICDataset:
  - drops unused io / json / requests imports (helper handles them)
  - delegates download + parse to the helper; only owns the seed
    construction loop
  - guards non-string Q values (in addition to NaN moral values)
  - forwards cache from fetch_dataset_async to the helper
  - factors authors into AUTHORS class constant
- Tests:
  - test_moral_integrity_corpus_dataset.py: stops mocking requests.get
    directly; patches _fetch_zip_from_url to return parsed dicts so
    tests don't depend on the helper's internal shape
  - adds test_fetch_dataset_non_string_q and
    test_fetch_dataset_passes_cache_flag
  - hoists imports into the right groups so ruff I001 stops firing
  - removes trailing whitespace / extra newlines
- test_remote_dataset_loader.py: adds TestFetchZipFromUrl covering
  happy path, on-disk caching (hits 1 network call across 2 fetches),
  cache=False does not persist, missing inner file raises ValueError,
  unsupported extension raises ValueError

Verified live against the real MIC.zip: 35,408 unique seeds across
all 6 moral foundations in ~2.4s cold / ~1.3s warm. All 559 dataset
unit tests pass; ruff clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use tempfile.NamedTemporaryFile instead of fixed temp_audio.wav
  to prevent concurrent call collisions
- Wrap Azure upload in try/finally to ensure temp file is always
  deleted even when upload fails
- Add regression test to verify cleanup on upload failure

Fixes microsoft#1894
- Add BijectionConverter that generates random letter-to-letter mapping
- Add BijectionAttack that teaches the mapping to target AI and encodes harmful prompts
- Add unit tests for both converter and attack
- Add notebook demonstrating usage
- Update __init__.py files to register new classes

Based on arXiv:2410.01294 (Haize Labs bijection-learning)

@romanlutz romanlutz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start! There are a few things that need addressing but we're pretty close.

Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
Comment thread pyrit/prompt_converter/__init__.py Outdated
Comment thread tests/unit/executor/test_bijection_attack.py Outdated
Comment thread pyrit/prompt_converter/bijection_converter.py
Comment thread pyrit/prompt_converter/bijection_converter.py Outdated
Comment thread pyrit/models/data_type_serializer.py Outdated
Comment thread doc/code/executor/bijection_attack.ipynb
Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
Comment thread pyrit/prompt_converter/bijection_converter.py
Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
- Remove @pytest.mark.asyncio decorators (asyncio_mode=auto)
- Fix __init__.py alphabetical ordering for BijectionConverter
- Use patch_central_database fixture in attack tests
- Use MagicMock(spec=PromptTarget) instead of plain MagicMock
- Remove dead num_digits parameter
- Add BijectionType StrEnum for bijection_type validation
- Use private attributes with underscore prefix
- Add _build_identifier() method
- Fix teaching shots cap with programmatic cycling
- Fix alternating user/assistant roles in teaching messages
- Fix response decoding in _perform_async
- Add BijectionConverter to _request_converters pipeline
- Fix notebook format and add paired .py jupytext file
- Register BijectionAttack in executor/attack/__init__.py
@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz I've addressed all the review comments:

  • Removed @pytest.mark.asyncio decorators
  • Fixed init.py alphabetical ordering
  • Used patch_central_database fixture in attack tests
  • Used MagicMock(spec=PromptTarget) instead of plain MagicMock
  • Removed dead num_digits parameter
  • Added BijectionType StrEnum for validation
  • Used private attributes with underscore prefix
  • Added _build_identifier() method
  • Fixed teaching shots cap with programmatic cycling
  • Fixed alternating user/assistant roles in teaching messages
  • Fixed response decoding in _perform_async
  • Added BijectionConverter to _request_converters pipeline
  • Fixed notebook format and added paired .py jupytext file

Ready for re-review!

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz I've addressed the remaining review comments:

  • Resolved merge conflicts with upstream/main (kept BidiConverter from main, added BijectionConverter alphabetically)
  • Added end-to-end test in TestBijectionAttackEndToEnd that uses MockPromptTarget, returns a cipher-text response, and asserts the result is decoded back to plain text
  • Fixed ComponentIdentifier import to use pyrit.models.identifiers

Ready for re-review

Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
Comment thread tests/unit/executor/test_bijection_attack.py Outdated
Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
Comment thread pyrit/prompt_converter/bijection_converter.py Outdated
- Change Optional[X] to X | None (PEP 604)
- Change bijection_type: str to BijectionType in attack
- Register BijectionType in prompt_converter __init__.py
- Store decoded response in metadata instead of mutating last_response
- Fix teaching shots: user sends English, assistant responds in cipher
- Fix brittle test assertions to check structural properties
- Update end-to-end test to check metadata for decoded response
@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz
I've completed the architectural restructure

  • BijectionConverter is now an abstract base class
  • Added LetterBijectionConverter with fixed_size and seed parameters
  • Added DigitBijectionConverter with num_digits and seed parameters
  • Added seed parameter for reproducibility across all converters
  • Added explicit mapping parameter for replay/deterministic experiments
  • BijectionAttack now accepts a bijection_converter instance instead of bijection_type/fixed_size
  • Updated all tests to use the concrete subclasses
  • All 23 tests passing

Example usage:

# Default letter mapping
attack = BijectionAttack(objective_target=target)

# Custom letter mapping with seed
attack = BijectionAttack(
    objective_target=target,
    bijection_converter=LetterBijectionConverter(fixed_size=5, seed=42),
)

# Digit mapping
attack = BijectionAttack(
    objective_target=target,
    bijection_converter=DigitBijectionConverter(num_digits=10, seed=42),
)

Ready for re-review....

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz
For TokenBijectionConverte, since it requires a tokenizer dependency, I'll implement it in a separate follow-up PR to keep this one focused. Let me know if you'd prefer it here instead!

@romanlutz

Copy link
Copy Markdown
Contributor

I agree @sajisanchu1913-source.

})


class DigitBijectionConverter(BijectionConverter):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice‑to‑have follow‑up — tokenizer bijection mode.

The paper (§2) describes a third bijection type beyond letter and digit modes: "tokens from the target model's tokenizer" — each English letter maps to a randomly‑sampled distinct token from the target's vocabulary. The paper notes these complexity parameters (fixed_size for letter mode, ℓ for digit mode, vocab subset for token mode) are what give the attack its scale‑adaptive property, so token mode is meaningful for evaluating frontier models.

Not blocking for this PR — the abstract base class makes adding it later straightforward (TokenBijectionConverter(BijectionConverter) that takes a tokenizer reference). Worth opening as a follow‑up issue once this lands. Or if you want to tackle it yourself you can do that as well, of course.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you open a GH issue for this and note that you'll do that yourself (if you want to!)? Just so that it's tracked.

Comment thread pyrit/executor/attack/single_turn/bijection_attack.py
Comment thread pyrit/prompt_converter/bijection_converter.py Outdated
Comment thread doc/code/executor/attack/bijection_attack.py Outdated
Comment thread doc/code/executor/bijection_attack.ipynb
- Remove duplicate docstring line in bijection_attack.py
- Remove unused SeedPrompt import
- Bump num_teaching_shots default to 10 per paper spec
- Fix DigitBijectionConverter to map letters to digit strings (not digits to digits)
- Add num_digits validation (must be 1-4, raises ValueError)
- Fix bijection_attack.py notebook to correct jupytext format
- Fix bijection_attack.ipynb to use new API (LetterBijectionConverter)
- Add test_digit_converter_encodes_letters test
- Add test_digit_converter_invalid_num_digits test
@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz addressed the latest review comments:

  • Removed duplicate docstring line
  • Removed unused SeedPrompt import
  • Bumped num_teaching_shots default to 10 per paper spec
  • Fixed DigitBijectionConverter. now maps letters to digit
  • Added num_digits validation (raises ValueError for values outside 1-4)
  • Fixed bijection_attack.py notebook to correct jupytext percent format
  • Fixed bijection_attack.ipynb to use new API (LetterBijectionConverter)
  • Added test_digit_converter_encodes_letters and test_digit_converter_invalid_num_digits tests

All 25 tests passing. Ready for re-review!

@romanlutz

Copy link
Copy Markdown
Contributor

I'll make another pass in the morning. Thanks for your patience in addressing all my concerns so far 🙂

Copilot AI added 10 commits June 16, 2026 05:42
- Rewrite docstrings in imperative mood with Args/Returns/Raises (D401, DOC201, DOC501)
- Add docstrings to public properties and subclass __init__ (D102, D107)
- Use dict comprehension/update and zip(strict=True) (PERF403, B905)
- Type _build_identifier as ComponentIdentifier; import it
- Add type: ignore[ty:invalid-parameter-default] on REQUIRED_VALUE default
- Wrap long intro prompt line (E501) and sort __init__ imports (I001)
- Register bijection_attack.ipynb in doc/myst.yml; strip kernelspec metadata

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… test

- Add LetterBijectionConverter and DigitBijectionConverter examples to
  doc/code/converters/1_text_to_text_converters (.py and .ipynb)
- Add abstract BijectionConverter to test_converter_documentation exceptions,
  consistent with other abstract base classes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BijectionConverter is an abstract base class (ABC with abstractmethod), so
enumerating it broke the converter-service instantiation test. Skip classes
with non-empty __abstractmethods__ in get_converter_modalities so both the
documentation and instantiation tests only see concrete converters.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add Bijection Learning paper to references and cite it from docs/source
- Fix digit bijection decoding for multi-character encoded tokens
- Reject one-digit digit mappings because 26 letters need 26 distinct values
- Add coverage for explicit mappings, identifiers, digit round trips, teaching messages, and attack metadata paths

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use a system setup prompt that tells the target to answer only in the bijection code
- Make the final task instruction explicit about private decoding and coded answers
- Only expose decoded_response metadata when decoding appears to produce English
- Show a skip status for plaintext or invalid cipher responses instead of bogus decoded text
- Re-execute the bijection attack notebook after the semantic fix

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use a system setup message for targets that support system prompts
- Fold setup instructions into the first user teaching shot for targets that do not
- Preserve user/assistant alternation in the fallback path
- Cover unsupported-system and zero-shot fallback behavior in tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace live OpenAIChatTarget usage with a local deterministic demo target
- Keep the notebook executable without external credentials
- Demonstrate a valid bijection-coded response and decoded_response metadata

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Restore the live OpenAIChatTarget example and original red-team objective
- Use Azure CLI credential with extended process timeout so notebook execution succeeds locally
- Commit the executed notebook output from the live target run

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz I can see there are merge conflicts. Should I resolve them, or would you prefer to handle it on your end? Happy to do whatever is most helpful!

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz just wanted to check in on the merge conflicts, happy to resolve them myself if that would help move things along. Let me know!

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz I've resolved the merge conflicts, kept both the bijection citation and the new upstream citations in bibliography.md, fixed the init.py conflict by keeping both DecompositionConverter and DigitBijectionConverter alphabetically, and updated myst.yml to the new executor folder structure. Should be ready to merge now!

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz
I've resolved the merge conflicts. One thing to flag I forgot to create a separate branch before starting work on TokenBijectionConverter, so those changes ended up here. I've gone ahead and included it since the architecture was already in place, but happy to revert it out and open a separate PR if you'd prefer to keep this one focused on the original scope. Let me know what works best!

@romanlutz

Copy link
Copy Markdown
Contributor

I've tried this and it essentially didn't work. The target model didn't seem to learn the mapping and the result was garbage (or at least I couldn't decode it with the provided mapping). Did you have a different experience?

I've tried fixing that in a few different ways and so far not succeeded.

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz, I hadn't tested it end-to-end with a real model ,I focused on getting the architecture and unit tests right but didn't validate the actual attack effectiveness. Thanks for trying it!

Given that it's not working, would you prefer I:

  1. Remove TokenBijectionConverter from this PR and open a separate issue to investigate the correct approach
  2. Keep it here but mark it as experimental with a note in the docstring

Happy to go with whatever keeps this PR on track!

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi Roman,

Thank you for taking the time to review my PR. I'd really like to get this feature merged, and I'm happy to put in whatever work is needed to make it meet the project's expectations.

If there are specific issues with the implementation or areas where my approach doesn't align with the project's design, I'd really appreciate your guidance. I'm committed to addressing the feedback and learning from it.

I'm currently a Master's student and this is one of my first open-source feature contributions. Having a feature implementation merged into PyRIT would be a significant milestone for me, both as a learning experience and for my resume. That said, I want to earn the merge by making the implementation as good as it can be.

Thanks again for your time and feedback—I appreciate any suggestions you have, and I'll work through them.

@romanlutz

Copy link
Copy Markdown
Contributor

I understand, @sajisanchu1913-source .

I've not yet seen an LLM actually respond in a way that the decoding worked.

For example, I tried

a=q, b=m, c=j, d=z, e=t, f=g, g=f, h=k, i=p, j=w, k=l, l=s, m=b,
n=o, o=x, p=n, q=c, r=r, s=y, t=e, u=v, v=h, w=i, x=a, y=d, z=u

and got this response

zqqy bm ksp fxa wdykydy xf c bzxu pltkchv gtrktfs twttks

which maps to the following (inverse):

daas mb hli gox jyshsys og q mdoz ikehqvu ferhegl ejeehl

I need to start seeing a working bijection. I'm not even saying a successful attack, but a working bijection.

We can explore options of restructuring it, or do a detailed comparison to the paper. But something appears to be off.

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Thanks, Roman. I understand the issue now.

I'll revisit my implementation, compare it more closely with the paper and the reference implementation, and focus on getting the model to produce a working bijection before worrying about the attack itself. I'll work on this and update the PR by tomorrow morning with my findings and any changes.

I really appreciate your feedback and the time you've taken to review it. As a Master's student, this is one of my first feature contributions to a large open-source project, so your guidance is incredibly valuable. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT Bijection

3 participants