FEAT: DatasetConfiguration Refactor by rlundeen2 · Pull Request #2071 · microsoft/PyRIT

rlundeen2 · 2026-06-22T22:07:11Z

DatasetConfiguration worked well for the first scenarios but strained as we added more: garak.encoding needs seed prompts, jailbreak needs two datasets (the harms and the jailbreaks themselves), psychosocial ties datasets to techniques, and other Garak scenarios want more flexible types. This PR refactors DatasetConfiguration to fit these diverse scenarios while simplifying the overall structure.

How it works now:

On-demand fetching via the provider — if a configured dataset isn't in memory, it's fetched from the registered SeedDatasetProvider and stored, so you no longer need the load_default_datasets initializer on startup unless you want to preload.
Simpler structure — a single DatasetConfiguration base plus DatasetAttackConfiguration (the default most scenarios use); the earlier typed subclasses and parallel get_seeds getters were dropped.
Compound datasets — CompoundDatasetAttackConfiguration combines several configs and samples across them, so a scenario like encoding or jailbreak can draw from multiple datasets at once (with per-dataset or global max_dataset_size).
Composable validators — constraints run against the fully resolved dataset before max_dataset_size sampling, so they describe the dataset itself rather than a sampled subset. Non-emptiness is enforced by default; scenarios can add others (seed type, harm categories, allowed names, inline-vs-named).
Inline or named seeds — each resolved dataset carries a DatasetSourceKind (inline vs. memory), letting a scenario accept seeds directly (e.g. CLI --objectives) or require named datasets.
Loud failures — resolution and fetch errors raise DatasetConstraintError naming the dataset and the reason instead of silently warning or returning empty.

Scenarios resolve through two getters — get_seed_attack_groups_async and get_attack_groups_by_dataset_async. All scenario unit tests pass.

Restructure DatasetConfiguration into a generic base plus typed subclasses (DatasetObjectiveConfiguration, DatasetPromptConfiguration, DatasetAttackConfiguration). Constraints are expressed through composable validators run against the fully resolved dataset (pre-sampling), and the resolved set carries a DatasetSourceKind (inline vs memory) so validators can require or forbid inline seeds. Auto-fetch missing datasets from the provider on demand and raise loudly (with chained root cause) instead of silently warning. Non-emptiness is enforced as a default validator on every config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Scenarios now fetch their datasets from the registered provider on demand the first time they run, so the load_default_datasets initializer is no longer required for everyday runs or in the recommended default config. Remove it from the default config examples and per-run scanner command examples, and add a note explaining it is now an optional preload step (useful for warming memory or populating a database for offline use). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a dataset_names tuple to ResolvedDataset plus a restrict_dataset_names validator so scenarios that pair techniques with specific datasets (e.g. psychosocial, jailbreak) can constrain which datasets a configuration may draw from. Replace the EXPLICIT_SEED_GROUPS_KEY dict-overload with an explicit inline-vs-named split. Named resolution (_collect_named_seeds_async) now returns only real dataset names, and get_seeds_async resolves inline data through a dedicated branch, removing the reserved-key collision guards. Inline data keeps a single honest INLINE_DATASET_NAME label in the by-dataset views (used for atomic-attack naming), so user-facing labels read 'technique_inline' instead of leaking the old sentinel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove DatasetObjectiveConfiguration, DatasetPromptConfiguration, and the generic DatasetConfiguration[SeedT] flat resolver (get_seeds_async). None had production callers; the objectives-only constraint that motivated the typed subclasses is enforced at runtime per-technique, not at the dataset level. DatasetConfiguration is now a plain (non-generic) base of resolution/fetch/ validate/sample plumbing plus the deprecated legacy getters, and DatasetAttackConfiguration remains the one concrete resolver scenarios use. Tests that exercised the removed flat resolver now drive the same base resolution through DatasetAttackConfiguration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

romanlutz · 2026-06-23T12:16:25Z

I'm a bit worried about datasets we pulled into DB in the past that need updates (eg in the context of standardizing harm categories, or fixing metadata fields). Similarly, what happens to live datasets like threat feeds (think Odin or others).

How do we make sure they get refreshed?

rlundeen2 · 2026-06-23T21:16:07Z

I'm a bit worried about datasets we pulled into DB in the past that need updates (eg in the context of standardizing harm categories, or fixing metadata fields). Similarly, what happens to live datasets like threat feeds (think Odin or others).

How do we make sure they get refreshed?

Good idea to include; I would separate from this PR but there are a couple solutions I like;

We could add an init parameter to this class that always refreshes
we could add an initializer that refreshes existing datasets that haven't been refreshed in X days (or refreshes all if that is 0)

I like #2 more; I created an item in proposed and linked to this

…ing to a global budget A single DatasetAttackConfiguration now applies max_dataset_size as one global budget for both resolvers (flat and by-dataset), instead of the by-dataset path silently sampling per dataset. Per-dataset budgets are now expressed explicitly by composing children with the new MultiDatasetAttackConfiguration, which takes a list of child configurations (each resolves, validates, and samples itself) and concatenates them, with an optional compound-level cap on top. MultiDatasetAttackConfiguration.per_dataset(dataset_names=[...], max_dataset_size=N) is the compact 'N per dataset' form. encoding now composes two single-dataset children (restoring its prior ~6-group coverage), and rapid_response/text_adaptive use per_dataset to keep their per-dataset budgets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The four flat-resolver scenarios (encoding, scam, jailbreak, psychosocial) caught DatasetConstraintError, reset seed_groups to [], then re-raised a generic '_raise_dataset_exception' ValueError. Because require_nonempty() is always a default validator, the resolver never returns an empty list -- it raises a DatasetConstraintError naming the offending dataset and reason. The catch only hid that specific cause behind a misleading 'not available or failed to load' message. Remove the swallow so the precise error propagates, drop the now-dead empty guards and the orphaned _raise_dataset_exception helper, and assert the specific error in the affected tests. Psychosocial keeps a loud, specific error when its harm-category filter removes everything. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Six validator factories (require_min_size, require_harm_categories, require_seed_type, require_inline_seeds, forbid_inline_seeds, restrict_dataset_names) were exported from pyrit.scenario.core but used nowhere in production -- only in unit tests, which import them directly from the dataset_configuration module. Drop them from the package __init__ exports and __all__ to shrink the public surface. The definitions stay in dataset_configuration.py for scenarios to compose via _default_validators or the validators= parameter. require_nonempty (the always-applied default) remains exported. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…kConfiguration 'Compound' fits PyRIT naming conventions better than 'Multi'. Pure rename across the class definition, exports, scenarios that compose it, and tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…taset-configuration # Conflicts: # .pyrit_conf_example

…taset-configuration # Conflicts: # tests/unit/scenario/airt/test_cyber.py # tests/unit/scenario/airt/test_leakage.py # tests/unit/scenario/scenarios/adaptive/test_text_adaptive.py

…taset-configuration # Conflicts: # tests/unit/backend/test_scenario_run_service.py

rlundeen2 changed the title ~~MAINT: DatasetConfiguration Refactor~~ FEAT: DatasetConfiguration Refactor Jun 22, 2026

rlundeen2 mentioned this pull request Jun 22, 2026

[BREAKING] FEAT: Standardize Jailbreak scenario defaults #2045

Open

rlundeen2 and others added 3 commits June 22, 2026 15:44

romanlutz reviewed Jun 26, 2026

View reviewed changes

Comment thread pyrit/scenario/scenarios/airt/jailbreak.py Outdated

Comment thread pyrit/scenario/core/dataset_configuration.py

rlundeen2 and others added 5 commits June 29, 2026 18:13

Merge remote-tracking branch 'origin/main' into rlundeen2-redesign-da…

79e4f34

…taset-configuration # Conflicts: # .pyrit_conf_example

romanlutz approved these changes Jun 30, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into rlundeen2-redesign-da…

c8cbfbf

…taset-configuration # Conflicts: # tests/unit/scenario/airt/test_cyber.py # tests/unit/scenario/airt/test_leakage.py # tests/unit/scenario/scenarios/adaptive/test_text_adaptive.py

rlundeen2 enabled auto-merge June 30, 2026 20:47

rlundeen2 added this pull request to the merge queue Jun 30, 2026

github-merge-queue Bot removed this pull request from the merge queue due to a conflict with the base branch Jun 30, 2026

Merge remote-tracking branch 'origin/main' into rlundeen2-redesign-da…

df125a7

…taset-configuration # Conflicts: # tests/unit/backend/test_scenario_run_service.py

rlundeen2 enabled auto-merge June 30, 2026 21:18

rlundeen2 added this pull request to the merge queue Jun 30, 2026

Merged via the queue into microsoft:main with commit fe4383b Jun 30, 2026
53 checks passed

rlundeen2 deleted the rlundeen2-redesign-dataset-configuration branch June 30, 2026 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT: DatasetConfiguration Refactor#2071

FEAT: DatasetConfiguration Refactor#2071
rlundeen2 merged 11 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-redesign-dataset-configuration

rlundeen2 commented Jun 22, 2026 •

edited

Loading

Uh oh!

romanlutz commented Jun 23, 2026

Uh oh!

rlundeen2 commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rlundeen2 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romanlutz commented Jun 23, 2026

Uh oh!

rlundeen2 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rlundeen2 commented Jun 22, 2026 •

edited

Loading

rlundeen2 commented Jun 23, 2026 •

edited

Loading