Skip to content

FEAT plumb media output through adversarial feedback loop (#6a)#3

Open
fitzpr wants to merge 577 commits into
mainfrom
feature/media-feedback-loop-v2
Open

FEAT plumb media output through adversarial feedback loop (#6a)#3
fitzpr wants to merge 577 commits into
mainfrom
feature/media-feedback-loop-v2

Conversation

@fitzpr

@fitzpr fitzpr commented Feb 19, 2026

Copy link
Copy Markdown
Owner

Description

When the red teaming attack loop targets an image/video generator (e.g. DALL-E), the adversarial chat (e.g. GPT-4o) never sees the generated media - only the scorer's text feedback is sent back. The adversarial LLM refines its prompts blindly.

This PR makes media outputs flow through the feedback loop as multimodal messages (text + image/video), so the adversarial chat can see what the target actually produced and give better-informed follow-up prompts.

Before: Target → image → Scorer → "missing a hat" (text only) → Adversarial LLM
After: Target → image → Scorer → "missing a hat" + image → Adversarial LLM

Microsoft Security Hackathon 2026, project #6a. @romanlutz for visibility.

Changes

red_teaming.py - core change (3 methods):

  • _handle_adversarial_file_response now returns a (feedback_text, media_piece) tuple instead of just a string
  • _build_adversarial_prompt returns Union[str, tuple] - string for text responses, tuple for media
  • _generate_next_prompt_async builds a multimodal Message with text + media pieces when a tuple is returned

openai_chat_target.py - 1-line bug fix:

  • Content-filtered responses have data type error, which crashed multimodal message building with ValueError: Multimodal data type error is not yet supported. Now treated as text alongside "text".

No new files, classes, or dependencies. Fully backward compatible - text-only workflows are unchanged.

Tests and Documentation

test_red_teaming.py - 5 new tests, 2 updated:

  • test_generate_next_prompt_sends_multimodal_message_for_image_response - verifies 2-piece multimodal message construction (text + image)
  • test_generate_next_prompt_sends_multimodal_message_for_video_response - same for video
  • test_generate_next_prompt_text_response_stays_text_only - regression test ensuring text path is unaffected
  • test_build_adversarial_prompt_returns_tuple_for_image_response - return type assertion
  • test_build_adversarial_prompt_returns_str_for_text_response - return type assertion
  • 2 existing tests updated to assert tuple returns from _handle_adversarial_file_response

All 161 tests pass (70 red_teaming + 91 chat_target). JupyText not run - no notebook changes in this PR.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the red-teaming multi-turn attack loop so that when an objective target returns generated media (e.g., images/videos), that media is included alongside scorer feedback when prompting the adversarial chat model, enabling multimodal refinement rather than text-only “blind” iteration.

Changes:

  • Plumbs (feedback_text, media_piece) through the adversarial feedback path and constructs a multimodal Message (text + media) when applicable.
  • Adjusts _handle_adversarial_file_response to return both feedback and the originating media MessagePiece.
  • Treats OpenAI chat “error” pieces as text content parts when building multimodal chat messages.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
pyrit/executor/attack/multi_turn/red_teaming.py Builds multimodal adversarial prompts for media responses and changes handler return types.
pyrit/prompt_target/openai/openai_chat_target.py Allows converted_value_data_type == "error" to be serialized as a text content-part in multimodal mode.
tests/unit/executor/attack/multi_turn/test_red_teaming.py Adds/updates unit tests to validate tuple return types and multimodal message construction.

Comment on lines +420 to +423
async def _build_adversarial_prompt(
self,
context: MultiTurnAttackContext[Any],
) -> str:
) -> Union[str, tuple[str, Optional[MessagePiece]]]:

Copilot AI Feb 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an async method but its name doesn’t end with the required _async suffix. Please rename it (e.g., _build_adversarial_prompt_async) and update the call site(s) and tests accordingly to match the project’s async naming convention.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +590 to 592
if message_piece.converted_value_data_type in ("text", "error"):
entry = {"type": "text", "text": message_piece.converted_value}
content.append(entry)

Copilot AI Feb 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating converted_value_data_type == "error" as a text content part avoids the previous exception, but note that any conversation containing an "error" piece will still route through the multimodal message format (content array) because _is_text_message_format only accepts "text". For OpenAI-compatible endpoints that don’t support the content-parts schema, this can still fail even though the payload is effectively text-only. Consider also treating "error" as text in the text-format detection (or normalizing error pieces to "text") so purely textual conversations continue using the plain string format.

Copilot uses AI. Check for mistakes.
Comment on lines +367 to +385
if isinstance(prompt_result, tuple):
feedback_text, media_piece = prompt_result
# Use a shared conversation_id so Message validation passes
shared_conversation_id = str(uuid.uuid4())
pieces = [
MessagePiece(
original_value=feedback_text,
role="user",
conversation_id=shared_conversation_id,
)
]
if media_piece is not None:
pieces.append(
MessagePiece(
original_value=media_piece.converted_value,
role="user",
original_value_data_type=media_piece.converted_value_data_type,
conversation_id=shared_conversation_id,
)

Copilot AI Feb 19, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_generate_next_prompt_async now builds a multimodal Message containing the objective target’s media piece. This can break when the adversarial chat target doesn’t support that data type (notably, OpenAIChatTarget only validates text/image_path/audio_path and will reject video_path). Consider detecting/whitelisting supported prompt data types for the adversarial target (or catching the validation error) and falling back to a text-only prompt (e.g., include the feedback text + a textual reference to the media path) to preserve backward compatibility.

Copilot uses AI. Check for mistakes.
fitzpr pushed a commit that referenced this pull request Feb 19, 2026
Address Roman's feedback items #2 and #3:
- Change _build_adversarial_prompt to return Message instead of Union type
- Extract message construction logic into separate helper methods
- Add _build_text_message() for simple text prompts
- Add _build_multimodal_message() for media responses
- Simplify caller code by removing tuple handling logic
- Improve logging to work with Message objects

These architectural improvements prepare the code to integrate with
the modality support detection system from separate PR.
rlundeen2 and others added 26 commits May 20, 2026 22:20
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…detection (microsoft#1704)

Co-authored-by: francose <13445813+francose@users.noreply.github.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
Co-authored-by: Richard Lundeen <rlundeen@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Richard Lundeen <137218279+rlundeen2@users.noreply.github.com>
…crosoft#1756)

Co-authored-by: romanlutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>
…ternal refs (closes microsoft#1741) (microsoft#1745)

Co-authored-by: romanlutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Roman Lutz <roman.lutz@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#1750)

Co-authored-by: romanlutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#1755)

Co-authored-by: romanlutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: romanlutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
)

Co-authored-by: behnamousat <behnamousat@microsoft.com>
…attack and 0_output (microsoft#1777)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: romanlutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icTextConverter (microsoft#1714)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rosoft#1776)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: romanlutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…osoft#1773)

Co-authored-by: Behnam Ousat <behnamousat@microsoft.com>
…microsoft#1778)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1736)

Co-authored-by: Roman Lutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…soft#1781)

Co-authored-by: Behnam Ousat <behnamousat@microsoft.com>
…icrosoft#1768)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#1770)

Co-authored-by: romanlutz <romanlutz@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
microsoft#1793)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-2026-45409) (microsoft#1796)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dependabot Bot and others added 27 commits June 17, 2026 21:04
…es (microsoft#2037)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…y-minor-and-patch group across 1 directory (microsoft#2031)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
…#1994)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#2029)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…#2030)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#2041)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
… tag (microsoft#2036)

Co-authored-by: Behnam Ousat <behnamousat@microsoft.com>
…PAIR (microsoft#2039)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…#2047)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… scenario E2E tests (microsoft#2048)

Co-authored-by: Behnam Ousat <behnamousat@microsoft.com>
- Reset doc/ to match origin/main (flat numbered notebook structure)

- Remove old attack/, workflow/, benchmark/, promptgen/ subdirectory notebooks

- Add doc/code/executor/8_modality_feedback.py/.ipynb: two-seed Crescendo

  modality-feedback example (roakey + sailboat, hybrid capability profile)

- Update 0_executor.md and myst.yml to include notebook microsoft#8 in navigation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…h 2 updates (microsoft#2052)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: hannahwestra25 <hannahwestra@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: hannahwestra25 <hannahwestra@users.noreply.github.com>
…icter scorer for multi-turn demo

- Section 6 now uses IPyImage(data=bytes) to embed all images directly in
  the notebook so they render without re-running (no more unresolvable paths).
- Replaced custom adversarial system_prompt with SeedPrompt loaded from the
  built-in crescendo/image_generation.yaml, which has proper multi-turn
  escalation (starts simple, builds up) forces 2-4 turns instead of 1.
- Fixed image_generation.yaml JSON response keys: renamed generated_question
  -> next_message and rationale_behind_jailbreak -> rationale to match what
  CrescendoAttack expects.
- Tightened SelfAskTrueFalseScorer true_description to require ALL five
  visual elements simultaneously, making single-turn success unlikely.
- Added EXECUTOR_SEED_PROMPT_PATH and SeedPrompt imports.
- Removed unused MarkdownConversationMemoryPrinter and IPythonMarkdownSink.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…loop-v2' into feature/media-feedback-loop-v2
…t#2034)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…patch group across 1 directory (microsoft#2055)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
…eedback-loop-v2' into feature/media-feedback-loop-v2"

This reverts commit 4135eee, reversing
changes made to 7d5721a.
…ft#1902)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…l naming

- Tighten modality notebook objective + scorer criteria to preserve the seeded raccoon identity.
- Regenerate 8_modality_feedback.ipynb outputs from the updated notebook source.
- Strengthen Crescendo image_generation guidance for seeded non-human anchors and aligned rationale key naming.
- Rename ModalityFeedbackRouter constructor keyword from adversarial_target to adversarial_chat.
- Rename property objective_requires_media_on_first_turn to objective_target_requires_media_on_first_turn.
- Update all affected multi-turn attack callsites and unit tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI added 2 commits June 22, 2026 13:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ema test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.