FEAT plumb media output through adversarial feedback loop (#6a)#3
FEAT plumb media output through adversarial feedback loop (#6a)#3fitzpr wants to merge 577 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the red-teaming multi-turn attack loop so that when an objective target returns generated media (e.g., images/videos), that media is included alongside scorer feedback when prompting the adversarial chat model, enabling multimodal refinement rather than text-only “blind” iteration.
Changes:
- Plumbs
(feedback_text, media_piece)through the adversarial feedback path and constructs a multimodalMessage(text + media) when applicable. - Adjusts
_handle_adversarial_file_responseto return both feedback and the originating mediaMessagePiece. - Treats OpenAI chat “error” pieces as text content parts when building multimodal chat messages.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pyrit/executor/attack/multi_turn/red_teaming.py |
Builds multimodal adversarial prompts for media responses and changes handler return types. |
pyrit/prompt_target/openai/openai_chat_target.py |
Allows converted_value_data_type == "error" to be serialized as a text content-part in multimodal mode. |
tests/unit/executor/attack/multi_turn/test_red_teaming.py |
Adds/updates unit tests to validate tuple return types and multimodal message construction. |
| async def _build_adversarial_prompt( | ||
| self, | ||
| context: MultiTurnAttackContext[Any], | ||
| ) -> str: | ||
| ) -> Union[str, tuple[str, Optional[MessagePiece]]]: |
There was a problem hiding this comment.
This is an async method but its name doesn’t end with the required _async suffix. Please rename it (e.g., _build_adversarial_prompt_async) and update the call site(s) and tests accordingly to match the project’s async naming convention.
| if message_piece.converted_value_data_type in ("text", "error"): | ||
| entry = {"type": "text", "text": message_piece.converted_value} | ||
| content.append(entry) |
There was a problem hiding this comment.
Treating converted_value_data_type == "error" as a text content part avoids the previous exception, but note that any conversation containing an "error" piece will still route through the multimodal message format (content array) because _is_text_message_format only accepts "text". For OpenAI-compatible endpoints that don’t support the content-parts schema, this can still fail even though the payload is effectively text-only. Consider also treating "error" as text in the text-format detection (or normalizing error pieces to "text") so purely textual conversations continue using the plain string format.
| if isinstance(prompt_result, tuple): | ||
| feedback_text, media_piece = prompt_result | ||
| # Use a shared conversation_id so Message validation passes | ||
| shared_conversation_id = str(uuid.uuid4()) | ||
| pieces = [ | ||
| MessagePiece( | ||
| original_value=feedback_text, | ||
| role="user", | ||
| conversation_id=shared_conversation_id, | ||
| ) | ||
| ] | ||
| if media_piece is not None: | ||
| pieces.append( | ||
| MessagePiece( | ||
| original_value=media_piece.converted_value, | ||
| role="user", | ||
| original_value_data_type=media_piece.converted_value_data_type, | ||
| conversation_id=shared_conversation_id, | ||
| ) |
There was a problem hiding this comment.
_generate_next_prompt_async now builds a multimodal Message containing the objective target’s media piece. This can break when the adversarial chat target doesn’t support that data type (notably, OpenAIChatTarget only validates text/image_path/audio_path and will reject video_path). Consider detecting/whitelisting supported prompt data types for the adversarial target (or catching the validation error) and falling back to a text-only prompt (e.g., include the feedback text + a textual reference to the media path) to preserve backward compatibility.
Address Roman's feedback items #2 and #3: - Change _build_adversarial_prompt to return Message instead of Union type - Extract message construction logic into separate helper methods - Add _build_text_message() for simple text prompts - Add _build_multimodal_message() for media responses - Simplify caller code by removing tuple handling logic - Improve logging to work with Message objects These architectural improvements prepare the code to integrate with the modality support detection system from separate PR.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…detection (microsoft#1704) Co-authored-by: francose <13445813+francose@users.noreply.github.com> Co-authored-by: Roman Lutz <romanlutz13@gmail.com> Co-authored-by: Richard Lundeen <rlundeen@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Richard Lundeen <137218279+rlundeen2@users.noreply.github.com>
…crosoft#1756) Co-authored-by: romanlutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>
…ternal refs (closes microsoft#1741) (microsoft#1745) Co-authored-by: romanlutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Roman Lutz <roman.lutz@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t#1750) Co-authored-by: romanlutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#1755) Co-authored-by: romanlutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: romanlutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…attack and 0_output (microsoft#1777) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: romanlutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icTextConverter (microsoft#1714) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rosoft#1776) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: romanlutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…osoft#1773) Co-authored-by: Behnam Ousat <behnamousat@microsoft.com>
…microsoft#1778) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1736) Co-authored-by: Roman Lutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…soft#1781) Co-authored-by: Behnam Ousat <behnamousat@microsoft.com>
…icrosoft#1768) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#1770) Co-authored-by: romanlutz <romanlutz@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
microsoft#1793) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-2026-45409) (microsoft#1796) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…es (microsoft#2037) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…y-minor-and-patch group across 1 directory (microsoft#2031) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
…#1994) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#2029) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…#2030) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#2041) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
… tag (microsoft#2036) Co-authored-by: Behnam Ousat <behnamousat@microsoft.com>
…PAIR (microsoft#2039) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…#2047) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… scenario E2E tests (microsoft#2048) Co-authored-by: Behnam Ousat <behnamousat@microsoft.com>
- Reset doc/ to match origin/main (flat numbered notebook structure) - Remove old attack/, workflow/, benchmark/, promptgen/ subdirectory notebooks - Add doc/code/executor/8_modality_feedback.py/.ipynb: two-seed Crescendo modality-feedback example (roakey + sailboat, hybrid capability profile) - Update 0_executor.md and myst.yml to include notebook microsoft#8 in navigation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…h 2 updates (microsoft#2052) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: hannahwestra25 <hannahwestra@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: hannahwestra25 <hannahwestra@users.noreply.github.com>
…icter scorer for multi-turn demo - Section 6 now uses IPyImage(data=bytes) to embed all images directly in the notebook so they render without re-running (no more unresolvable paths). - Replaced custom adversarial system_prompt with SeedPrompt loaded from the built-in crescendo/image_generation.yaml, which has proper multi-turn escalation (starts simple, builds up) forces 2-4 turns instead of 1. - Fixed image_generation.yaml JSON response keys: renamed generated_question -> next_message and rationale_behind_jailbreak -> rationale to match what CrescendoAttack expects. - Tightened SelfAskTrueFalseScorer true_description to require ALL five visual elements simultaneously, making single-turn success unlikely. - Added EXECUTOR_SEED_PROMPT_PATH and SeedPrompt imports. - Removed unused MarkdownConversationMemoryPrinter and IPythonMarkdownSink. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…loop-v2' into feature/media-feedback-loop-v2
…t#2034) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…patch group across 1 directory (microsoft#2055) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
…ft#1902) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…l naming - Tighten modality notebook objective + scorer criteria to preserve the seeded raccoon identity. - Regenerate 8_modality_feedback.ipynb outputs from the updated notebook source. - Strengthen Crescendo image_generation guidance for seeded non-human anchors and aligned rationale key naming. - Rename ModalityFeedbackRouter constructor keyword from adversarial_target to adversarial_chat. - Rename property objective_requires_media_on_first_turn to objective_target_requires_media_on_first_turn. - Update all affected multi-turn attack callsites and unit tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ema test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
When the red teaming attack loop targets an image/video generator (e.g. DALL-E), the adversarial chat (e.g. GPT-4o) never sees the generated media - only the scorer's text feedback is sent back. The adversarial LLM refines its prompts blindly.
This PR makes media outputs flow through the feedback loop as multimodal messages (text + image/video), so the adversarial chat can see what the target actually produced and give better-informed follow-up prompts.
Before: Target → image → Scorer → "missing a hat" (text only) → Adversarial LLM
After: Target → image → Scorer → "missing a hat" + image → Adversarial LLM
Microsoft Security Hackathon 2026, project #6a. @romanlutz for visibility.
Changes
red_teaming.py- core change (3 methods):_handle_adversarial_file_responsenow returns a(feedback_text, media_piece)tuple instead of just a string_build_adversarial_promptreturnsUnion[str, tuple]- string for text responses, tuple for media_generate_next_prompt_asyncbuilds a multimodalMessagewith text + media pieces when a tuple is returnedopenai_chat_target.py- 1-line bug fix:error, which crashed multimodal message building withValueError: Multimodal data type error is not yet supported. Now treated as text alongside"text".No new files, classes, or dependencies. Fully backward compatible - text-only workflows are unchanged.
Tests and Documentation
test_red_teaming.py- 5 new tests, 2 updated:test_generate_next_prompt_sends_multimodal_message_for_image_response- verifies 2-piece multimodal message construction (text + image)test_generate_next_prompt_sends_multimodal_message_for_video_response- same for videotest_generate_next_prompt_text_response_stays_text_only- regression test ensuring text path is unaffectedtest_build_adversarial_prompt_returns_tuple_for_image_response- return type assertiontest_build_adversarial_prompt_returns_str_for_text_response- return type assertion_handle_adversarial_file_responseAll 161 tests pass (70 red_teaming + 91 chat_target). JupyText not run - no notebook changes in this PR.