diff --git a/doc/code/framework.md b/doc/code/framework.md index 6425557ffa..e2d0ee9af6 100644 --- a/doc/code/framework.md +++ b/doc/code/framework.md @@ -10,14 +10,19 @@ Learn how to use PyRIT's components to build red teaming workflows. Load, create, and manage seed datasets for red teaming campaigns. :::: -::::{card} ⚔️ Attacks & Executors +::::{card} ⚔️ Attacks :link: ./executor/0_executor Run single-turn and multi-turn attacks — Crescendo, TAP, Skeleton Key, and more. :::: -::::{card} 🔌 Targets -:link: ./targets/0_prompt_targets -Connect to OpenAI, Azure, Anthropic, HuggingFace, HTTP endpoints, and custom targets. +::::{card} 🧩 Attack Techniques +:link: ./scenarios/0_attack_techniques +Package a configured attack — role-play, many-shot, crescendo, a jailbreak template — as a reusable, named recipe. +:::: + +::::{card} 📋 Scenarios +:link: ./scenarios/0_scenarios +Run standardized evaluation scenarios at scale across harm categories. :::: ::::{card} 🔄 Converters @@ -25,34 +30,39 @@ Connect to OpenAI, Azure, Anthropic, HuggingFace, HTTP endpoints, and custom tar Transform prompts with text, audio, image, and video converters. :::: +::::{card} 🔌 Targets +:link: ./targets/0_prompt_targets +Connect to OpenAI, Azure, Anthropic, HuggingFace, HTTP endpoints, and custom targets. +:::: + ::::{card} 📊 Scoring :link: ./scoring/0_scoring Evaluate AI responses with true/false, Likert, classification, and custom scorers. :::: +::::{card} 🗂️ Registry +:link: ./registry/0_registry +Register and discover targets, scorers, and converters via class and instance registries. +:::: + +::::{card} 🖨️ Output +:link: ./output/0_output +Render attack results, scenario results, conversations, and scores to terminal, files, or Jupyter. +:::: + ::::{card} 💾 Memory :link: ./memory/0_memory Track conversations, scores, and attack results with SQLite or Azure SQL. :::: -::::{card} ⚙️ Setup & Configuration +::::{card} ⚙️ Setup :link: ./setup/0_setup Initialize PyRIT, configure defaults, and manage resiliency settings. :::: -::::{card} 📋 Scenarios -:link: ./scenarios/0_scenarios -Run standardized evaluation scenarios at scale across harm categories. -:::: - -::::{card} 🗂️ Registry -:link: ./registry/0_registry -Register and discover targets, scorers, and converters via class and instance registries. -:::: - -::::{card} 🖨️ Output -:link: ./output/0_output -Render attack results, scenario results, conversations, and scores to terminal, files, or Jupyter. +::::{card} 📓 Framework Documentation +:link: ../contributing/7_notebooks +Keep the component notebooks concise and executable, showing how the framework is used. :::: ::::: @@ -63,69 +73,251 @@ The sections above link to detailed guides for each component. The architecture # Architecture -The main components of PyRIT are prompts, attacks, converters, targets, and scoring. The best way to contribute to PyRIT is by contributing to one of these components. +The main components of PyRIT are datasets, targets, converters, scoring, and attacks — together with the attack techniques and scenarios that combine them. The best way to contribute to PyRIT is by contributing to one of these components. ![alt text](../../assets/architecture_components.png) As much as possible, each component is a pluggable brick of functionality. Prompts from one attack can be used in another. An attack for one scenario can use multiple targets. And sometimes you completely skip components (e.g. almost every component can be a NoOp also, you can have a NoOp converter that doesn't convert, or a NoOp target that just prints the prompts). -If you are contributing to PyRIT, that work will most likely land in one of these buckets and be as self-contained as possible. It isn't always this clean, but when an attack scenario doesn't quite fit (and that's okay!) it's good to brainstorm with the maintainers about how we can modify our architecture. +Each section below states what a component **owns** and, just as importantly, what it **does not own** (with a pointer to the component that does). If you are contributing to PyRIT, that work will most likely land in one of the core component buckets and be as self-contained as possible. It isn't always this clean, but when an attack scenario doesn't quite fit (and that's okay!) it's good to brainstorm with the maintainers about how we can modify our architecture. Also, if our **Framework Plans** would be helpful, please open issues! + +# Core Components + +## [Datasets](./datasets/0_dataset) + +**Source**: `pyrit/datasets/` (providers); seed/prompt types in `pyrit/models/seeds/`. + +**Responsibility**: Provide a single place to define and manage the inputs to an attack — prompts, jailbreak templates, source images, attack strategies, and similar seeds. + +- New datasets can be added in the dataset module. +- Dataset providers load seeds into memory; components then retrieve them from memory. Providers are not queried directly at attack time. +- Most components should work with seeds passed directly in (except scenarios, which may package them from memory). Never reach for dataset providers, file paths, etc. inside a component — either pass the seed as an argument or retrieve it from memory. + +**Does NOT own**: + +- Persisting or looking up seeds at run time — that is Memory. +- Deciding which seeds to run — that is a Scenario. + +**Framework Plans**: + +- There is some churn here. We haven't managed these much at scale, and we may have to redefine how it works. +- We want more investment in managing datasets and loading them more intelligently. +- We need to more consistently pass seeds or use memory. + +**Contributing (difficulty: easy)**: Are there more prompts and jailbreak templates you can add for scenarios you're testing for? It is easy to add new dataset providers. + +## [Attacks](./executor/0_executor) + +**Source**: `pyrit/executor/attack/`. + +**Responsibility**: Own the *algorithm and control flow* of achieving a single objective — managing the conversation between objective and adversarial targets, and using datasets, converters, and scorers along the way. + +- Any branching decision (i.e. the next step depends on a previous result) belongs in an attack. +- An attack should branch based on a **scorer result**, never on a raw target response directly (e.g. "was this prompt blocked?" is a scorer's job, not an attack's). +- Attacks use scoring and target capabilities implicitly, and should support multi-modal. +- An attack may ship with sensible **defaults**, but it should always **accept** (never hard-code) the pieces a technique configures: scorers, datasets/seeds (fed to the objective target as `prepended_conversation` and `next_message`), targets (objective and adversarial), and converters. Exposing these as parameters is what lets the attack be packaged as an Attack Technique. +- Compound attacks are possible, combining different attacks in different ways. + +**Does NOT own**: + +- Interpreting a raw target response — that is Scoring. +- The specific configuration of prompts, converters, and strategy used — that is an Attack Technique. +- Choosing which attacks or techniques to run, or running them at scale — that is a Scenario. + +**Framework Plans**: + +- We need to move some older attacks that don't belong here. Many (e.g. FlipAttack) should just be attack techniques. +- There are potential ways we could combine different algorithms. Are Crescendo and TAP ultimately the same? +- We need to support target capabilities more implicitly. +- Other executors, like benchmarks, need better end-to-end support; potentially including an `ExpectedResult` seed and associated scorers. +- More flexible compound attacks should continue to be added. + +**Contributing (difficulty: hard)**: The best way to contribute is likely opening issues if you run into limitations. + +## Attack Technique + +**Source**: `pyrit/scenario/core/attack_technique.py` and `attack_technique_factory.py`; built-in registrations in `pyrit/setup/initializers/components/`. + +**Responsibility**: A single, declarative **configuration** of an attack — no new logic. It bundles an existing attack class with the strategy, converters, datasets, and prompts that define one named technique. + +A technique should be expressible as one self-contained definition, for example: + +```python +AttackTechniqueFactory( + name="violent_durian", + attack_class=RedTeamingAttack, + strategy_tags=["multi_turn"], + adversarial_system_prompt=SeedPrompt.from_yaml_file(EXECUTOR_RED_TEAM_PATH / "violent_durian.yaml"), + adversarial_seed_prompt=SeedPrompt.from_yaml_file( + EXECUTOR_RED_TEAM_PATH / "violent_durian_seed_prompt.yaml" + ), +) +``` + +**Does NOT own**: + +- Any branching or control flow — that lives in the Attack it configures. +- Selecting which techniques to run — that is a Scenario. + +**Framework Plans**: + +- We are still defining *where* attack techniques are registered (today this can live in setup/initializers, but that may change). +- Managing these better, so scenarios can more easily select or build the attack techniques to use. + +**Contributing (difficulty: easy)**: Add the technique as a single declarative configuration, with no new logic. + +## [Scenarios](./scenarios/0_scenarios) + +**Source**: `pyrit/scenario/`. + +**Responsibility**: The avenue to "run PyRIT against something" — **select** which attack techniques and datasets to run, then orchestrate them at scale. + +- A scenario takes user input and uses it to package datasets with attack techniques. +- A scenario orchestrates resiliency and parallelism from a high level. +- No result should depend on a previous result — that cross-result branching is an attack's job. + +**Does NOT own**: + +- Per-objective branching or conversation logic — that is an Attack. +- The internal configuration of a technique — that is an Attack Technique. + +**Framework Plans**: + +- Scenarios are new enough that we are still discovering patterns and limitations, so they will be refactored regularly. + +**Contributing (difficulty: medium)**: Is there a scanner that does something PyRIT doesn't? Add it as a scenario. Because we're still changing how this works, it is less well-defined than other areas. + +## [Converters](./converters/0_converters) + +**Source**: `pyrit/prompt_converter/`. + +**Responsibility**: Convert a prompt into something else. Converters can be stacked and combined, and can be as varied as translating a text prompt into a Word document, rephrasing a prompt, or adding a text overlay to an image. + +**Does NOT own**: + +- Deciding *when* to apply a conversion, or branching on the result — that is an Attack. + +**Framework Plans**: + +- We want to refactor our converter pipeline; some things that should be converters (e.g. partial converting) may be postponed. This is supported but could be much more dynamic. + +**Contributing (difficulty: easy)**: The existing pattern is well-defined. Are there ways prompts can be converted that would be useful for an attack? + +## [Target](./targets/0_prompt_targets.md) + +**Source**: `pyrit/prompt_target/`; message shaping in `pyrit/message_normalizer/`. + +**Responsibility**: "The thing we're sending the prompt to." Many other components use it, including scorers, attacks, and converters. + +- This is often an LLM, but it doesn't have to be. For Cross-Domain Prompt Injection Attacks, the prompt target might be a storage account that a later prompt target has a reference to. Message and conversation should be generic enough to carry this extra data. +- Target capabilities are used to check whether a target is compatible with what the other components want to do. +- Targets use `message_normalizer` together with prompt capabilities to transform `Messages` into the formats a given target supports. +- Because targets are so varied, it is reasonable to return multiple tool calls, or none at all. +- One attack can have many prompt targets (and converters and scorers can use prompt targets too, to convert or score). + +**Framework Plans**: + +- Better agent support may require extra pieces attached to a Message. +- Better surface support may require expanding the return types. + +**Contributing (difficulty: easy)**: + +- The pattern is well-defined. +- Are there models you want to use at any stage or for different attacks? And could your model simply be one of the existing targets? + +## [Scoring](./scoring/0_scoring.ipynb) + +**Source**: `pyrit/score/`. + +**Responsibility**: Give feedback to the attack on what happened with a prompt — from "was this prompt blocked?" to "was our objective achieved?". Scoring owns the *interpretation* of a response; every decision an attack makes is based on a scorer result. + +**Does NOT own**: + +- Acting on a score — branching, retrying, or stopping is the Attack's job. + +**Framework Plans**: + +- Scorers will be refactored to be more generic, so they can determine more general results (does a file exist? was a tool called?). + +**Contributing (difficulty: easy)**: + +- The pattern is well-defined. +- You can evaluate how accurate probabilistic scorers are and likely make them more accurate. +- Is there data you want to use to make decisions or analyze? + +# Core library + +The modules below are the supporting library the core components are built on. + +## [Registry](./registry/0_registry) + +**Source**: `pyrit/registry/`. + +**Responsibility**: Build and store the core components — the **construction** side of the framework. -The remainder of this document talks about the different components, how they work, what their responsibilities are, and ways to contribute. +- If you are creating a component from user input (e.g. via config, REST, or automatically), it should go through the registry. +- If you are storing an instance of a component, it should use the registry. +**Does NOT own**: -## Datasets: Prompts, Jailbreak Templates, Source Images, Attack Strategies, etc. +- Defining the *shape* of a component or its identifier — that is Models. -The first piece of an attack is often a dataset piece, like a prompt. "Tell me how to create a Molotov cocktail" is an example of a prompt. PyRIT is a good place to have a library of things to check for. +## Models -Ways to contribute: Check out our documentation on [seed datasets](./datasets/0_dataset.md); are there more prompts and jailbreak templates you can add that include scenarios you're testing for? +**Source**: `pyrit/models/` (including `pyrit/models/identifiers/`). -## Attacks +**Responsibility**: A lightweight module where core types are defined — the **description** side of the framework. These types should be used wherever possible to prevent drift. -Attacks are responsible for putting all the other pieces together. They make use of all other components in PyRIT to execute an attack technique end-to-end. -PyRIT supports single-turn (e.g. Many Shot Jailbreaks [@anthropic2024manyshot], Role Play, Skeleton Key [@microsoft2024skeletonkey]) and multi-turn attack strategies (e.g. Tree of Attacks [@mehrotra2023tap], Crescendo [@russinovich2024crescendo]), and compound strategies (e.g. `SequentialAttack`) for chaining several techniques against a single objective. +- If you are creating a class that overlaps heavily with another, or using a dict to serialize across boundaries, consider whether you can use or move it into `pyrit.models`. +- Models includes `identifiers`, which describe the core components; together with the registry, an identifier can often recreate the component it describes. +- Models includes the types passed between components, and should be preferred in REST. +- Models should never depend on anything outside `pyrit.common` (which itself shouldn't depend on anything). -Ways to contribute: Check out our [attack docs](./executor/0_executor.md). There are hundreds of attacks outlined in research papers. A lot of these can be captured within PyRIT. If you find an attack that doesn't fit the attack model please notify the team. Are there scenarios you can write attack modules for? +## [Output](./output/0_output) -## Converters +**Source**: `pyrit/output/`. -Converters are a powerful component that converts prompts to something else. They can be stacked and combined. They can be as varied as translating a text prompt into a Word document, rephrasing a prompt in 100 different ways, or adding a text overlay to an image. +**Responsibility**: Render finished components — attack results, scenario results, conversations, and scores — to different surfaces (terminal, files, Jupyter). Output is invoked directly by the CLI and in notebooks; the components it renders do not call into it. -Ways to contribute: Check out our [converter docs](./converters/0_converters.ipynb). Are there ways prompts can be converted that would be useful for an attack? +**Does NOT own**: -## Target +- Live, in-run progress printing — that belongs to the scenario's own printer. -A Prompt Target can be thought of as "the thing we're sending the prompt to". +## Backend -This is often an LLM, but it doesn't have to be. For Cross-Domain Prompt Injection Attacks, the Prompt Target might be a Storage Account that a later Prompt Target has a reference to. +**Source**: `pyrit/backend/`. -One attack can have many Prompt Targets (and in fact, converters and Scoring Engine can also use Prompt Targets to convert/score the prompt). +**Responsibility**: Expose PyRIT through a REST API for the frontend and other clients. The backend owns presentation-specific logic and models — request/response shapes, mapping, and HTTP concerns — but should still use `pyrit.models` and the registry wherever it can. -Ways to contribute: Check out our [target docs](./targets/0_prompt_targets.md). Are there models you want to use at any stage or for different attacks? +- The backend may define its own presentation models, but where a `pyrit.models` type already exists it should reuse that type rather than redefine it. +- Components should be constructed through the registry, not built directly in the backend. +**Does NOT own**: -## Scoring Engine +- The shape of core types — that is Models. +- Constructing or storing components — that is the Registry. -The scoring engine is a component that gives feedback to the attack on what happened with the prompt. This could be as simple as "Was this prompt blocked?" or "Was our objective achieved?" +## [Memory](./memory/0_memory.md) -Ways to contribute: Check out our [scoring docs](./scoring/0_scoring.ipynb). Is there data you want to use to make decisions or analyze? +**Source**: `pyrit/memory/`. -## Memory +**Responsibility**: The canonical store that components read from and write to — seeds, conversations, scores, and attack results. When a component needs more than what is passed in, it goes through memory. -One important thing to remember about this architecture is its swappable nature. Prompts and targets and converters and attacks and scorers should all be swappable. But sometimes one of these components needs additional information. If the target is an LLM, we need a way to look up previous messages sent to that session so we can properly construct the new message. If the target is a blob store, we need to know the URL to use for a future attack. +One important thing to remember about this architecture is its swappable nature. Prompts, targets, converters, attacks, and scorers should all be swappable. But sometimes one of these components needs additional information — if the target is an LLM, we need a way to look up previous messages sent to that session so we can construct the new message; if the target is a blob store, we need the URL to use for a future attack. Memory is where that shared state lives. -For more details about memory configuration, please follow the guide in [memory](./memory/0_memory.md). +## [Setup](./setup/0_setup) -Memory modifications and contributions should usually be designed with the maintainers. +**Source**: `pyrit/setup/`. -## The Flow +**Responsibility**: Initialize PyRIT and configure framework-wide defaults — memory selection, default targets, and resiliency settings. -To some extent, the ordering in this diagram matters. In the simplest cases, you have a prompt, an attack takes the prompt, uses prompt normalizer to run it through converters and send to a target, and the result is scored. +- Setup wires up the environment a run depends on; it does not implement attack behavior. -But this simple view is complicated by the fact that an attack can have multiple targets, converters can be stacked, scorers can use targets to score, etc. +## [Framework Documentation](../contributing/7_notebooks.md) -Sometimes, if a scenario requires specific data, we may need to modify the architecture. This happened recently when we thought a single target may take multiple prompts separately in a single request. Any time we need to modify the architecture like this, that's something that needs to be designed with the maintainers so we can consolidate our other supported scenarios and future plans. +**Source**: `doc/` (component notebooks, e.g. `doc/code/`). -## Notebooks +**Responsibility**: Show how the framework is used, concisely. -For all their power, attacks should still be generic. A lot of our front-end code and operators use Notebooks to interact with PyRIT. This is fantastic, but most new logic should not be notebooks. Notebooks should mostly be used for attack setup and documentation. For example, configuring the components and putting them together is a good use of a notebook, but new logic for an attack should be moved to one or more components. +- Notebooks that contain code should be executable. +- Notebooks should execute quickly.