Skip to content

FEAT: DatasetConfiguration Refactor#2071

Merged
rlundeen2 merged 11 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-redesign-dataset-configuration
Jun 30, 2026
Merged

FEAT: DatasetConfiguration Refactor#2071
rlundeen2 merged 11 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-redesign-dataset-configuration

Conversation

@rlundeen2

@rlundeen2 rlundeen2 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

DatasetConfiguration worked well for the first scenarios but strained as we added more: garak.encoding needs seed prompts, jailbreak needs two datasets (the harms and the jailbreaks themselves), psychosocial ties datasets to techniques, and other Garak scenarios want more flexible types. This PR refactors DatasetConfiguration to fit these diverse scenarios while simplifying the overall structure.

How it works now:

  • On-demand fetching via the provider — if a configured dataset isn't in memory, it's fetched from the registered SeedDatasetProvider and stored, so you no longer need the load_default_datasets initializer on startup unless you want to preload.
  • Simpler structure — a single DatasetConfiguration base plus DatasetAttackConfiguration (the default most scenarios use); the earlier typed subclasses and parallel get_seeds getters were dropped.
  • Compound datasetsCompoundDatasetAttackConfiguration combines several configs and samples across them, so a scenario like encoding or jailbreak can draw from multiple datasets at once (with per-dataset or global max_dataset_size).
  • Composable validators — constraints run against the fully resolved dataset before max_dataset_size sampling, so they describe the dataset itself rather than a sampled subset. Non-emptiness is enforced by default; scenarios can add others (seed type, harm categories, allowed names, inline-vs-named).
  • Inline or named seeds — each resolved dataset carries a DatasetSourceKind (inline vs. memory), letting a scenario accept seeds directly (e.g. CLI --objectives) or require named datasets.
  • Loud failures — resolution and fetch errors raise DatasetConstraintError naming the dataset and the reason instead of silently warning or returning empty.

Scenarios resolve through two getters — get_seed_attack_groups_async and get_attack_groups_by_dataset_async. All scenario unit tests pass.

Restructure DatasetConfiguration into a generic base plus typed subclasses
(DatasetObjectiveConfiguration, DatasetPromptConfiguration,
DatasetAttackConfiguration). Constraints are expressed through composable
validators run against the fully resolved dataset (pre-sampling), and the
resolved set carries a DatasetSourceKind (inline vs memory) so validators can
require or forbid inline seeds. Auto-fetch missing datasets from the provider
on demand and raise loudly (with chained root cause) instead of silently
warning. Non-emptiness is enforced as a default validator on every config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rlundeen2 rlundeen2 changed the title MAINT: DatasetConfiguration Refactor FEAT: DatasetConfiguration Refactor Jun 22, 2026
rlundeen2 and others added 3 commits June 22, 2026 15:44
Scenarios now fetch their datasets from the registered provider on demand
the first time they run, so the load_default_datasets initializer is no
longer required for everyday runs or in the recommended default config.
Remove it from the default config examples and per-run scanner command
examples, and add a note explaining it is now an optional preload step
(useful for warming memory or populating a database for offline use).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a dataset_names tuple to ResolvedDataset plus a restrict_dataset_names
validator so scenarios that pair techniques with specific datasets (e.g.
psychosocial, jailbreak) can constrain which datasets a configuration may
draw from.

Replace the EXPLICIT_SEED_GROUPS_KEY dict-overload with an explicit
inline-vs-named split. Named resolution (_collect_named_seeds_async) now
returns only real dataset names, and get_seeds_async resolves inline data
through a dedicated branch, removing the reserved-key collision guards. Inline
data keeps a single honest INLINE_DATASET_NAME label in the by-dataset views
(used for atomic-attack naming), so user-facing labels read 'technique_inline'
instead of leaking the old sentinel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove DatasetObjectiveConfiguration, DatasetPromptConfiguration, and the
generic DatasetConfiguration[SeedT] flat resolver (get_seeds_async). None had
production callers; the objectives-only constraint that motivated the typed
subclasses is enforced at runtime per-technique, not at the dataset level.

DatasetConfiguration is now a plain (non-generic) base of resolution/fetch/
validate/sample plumbing plus the deprecated legacy getters, and
DatasetAttackConfiguration remains the one concrete resolver scenarios use.
Tests that exercised the removed flat resolver now drive the same base
resolution through DatasetAttackConfiguration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz

Copy link
Copy Markdown
Contributor

I'm a bit worried about datasets we pulled into DB in the past that need updates (eg in the context of standardizing harm categories, or fixing metadata fields). Similarly, what happens to live datasets like threat feeds (think Odin or others).

How do we make sure they get refreshed?

@rlundeen2

rlundeen2 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

I'm a bit worried about datasets we pulled into DB in the past that need updates (eg in the context of standardizing harm categories, or fixing metadata fields). Similarly, what happens to live datasets like threat feeds (think Odin or others).

How do we make sure they get refreshed?

Good idea to include; I would separate from this PR but there are a couple solutions I like;

  1. We could add an init parameter to this class that always refreshes
  2. we could add an initializer that refreshes existing datasets that haven't been refreshed in X days (or refreshes all if that is 0)

I like #2 more; I created an item in proposed and linked to this

Comment thread pyrit/scenario/scenarios/airt/jailbreak.py Outdated
Comment thread pyrit/scenario/core/dataset_configuration.py
rlundeen2 and others added 5 commits June 29, 2026 18:13
…ing to a global budget

A single DatasetAttackConfiguration now applies max_dataset_size as one global
budget for both resolvers (flat and by-dataset), instead of the by-dataset path
silently sampling per dataset. Per-dataset budgets are now expressed explicitly
by composing children with the new MultiDatasetAttackConfiguration, which takes a
list of child configurations (each resolves, validates, and samples itself) and
concatenates them, with an optional compound-level cap on top.

MultiDatasetAttackConfiguration.per_dataset(dataset_names=[...], max_dataset_size=N)
is the compact 'N per dataset' form. encoding now composes two single-dataset
children (restoring its prior ~6-group coverage), and rapid_response/text_adaptive
use per_dataset to keep their per-dataset budgets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The four flat-resolver scenarios (encoding, scam, jailbreak, psychosocial)
caught DatasetConstraintError, reset seed_groups to [], then re-raised a
generic '_raise_dataset_exception' ValueError. Because require_nonempty() is
always a default validator, the resolver never returns an empty list -- it
raises a DatasetConstraintError naming the offending dataset and reason. The
catch only hid that specific cause behind a misleading 'not available or
failed to load' message. Remove the swallow so the precise error propagates,
drop the now-dead empty guards and the orphaned _raise_dataset_exception
helper, and assert the specific error in the affected tests. Psychosocial
keeps a loud, specific error when its harm-category filter removes everything.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Six validator factories (require_min_size, require_harm_categories,
require_seed_type, require_inline_seeds, forbid_inline_seeds,
restrict_dataset_names) were exported from pyrit.scenario.core but used
nowhere in production -- only in unit tests, which import them directly from
the dataset_configuration module. Drop them from the package __init__ exports
and __all__ to shrink the public surface. The definitions stay in
dataset_configuration.py for scenarios to compose via _default_validators or
the validators= parameter. require_nonempty (the always-applied default)
remains exported.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…kConfiguration

'Compound' fits PyRIT naming conventions better than 'Multi'. Pure rename
across the class definition, exports, scenarios that compose it, and tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…taset-configuration

# Conflicts:
#	.pyrit_conf_example
…taset-configuration

# Conflicts:
#	tests/unit/scenario/airt/test_cyber.py
#	tests/unit/scenario/airt/test_leakage.py
#	tests/unit/scenario/scenarios/adaptive/test_text_adaptive.py
@rlundeen2 rlundeen2 enabled auto-merge June 30, 2026 20:47
@rlundeen2 rlundeen2 added this pull request to the merge queue Jun 30, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to a conflict with the base branch Jun 30, 2026
…taset-configuration

# Conflicts:
#	tests/unit/backend/test_scenario_run_service.py
@rlundeen2 rlundeen2 enabled auto-merge June 30, 2026 21:18
@rlundeen2 rlundeen2 added this pull request to the merge queue Jun 30, 2026
Merged via the queue into microsoft:main with commit fe4383b Jun 30, 2026
53 checks passed
@rlundeen2 rlundeen2 deleted the rlundeen2-redesign-dataset-configuration branch June 30, 2026 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants