FEAT: DatasetConfiguration Refactor#2071
Merged
rlundeen2 merged 11 commits intoJun 30, 2026
Merged
Conversation
Restructure DatasetConfiguration into a generic base plus typed subclasses (DatasetObjectiveConfiguration, DatasetPromptConfiguration, DatasetAttackConfiguration). Constraints are expressed through composable validators run against the fully resolved dataset (pre-sampling), and the resolved set carries a DatasetSourceKind (inline vs memory) so validators can require or forbid inline seeds. Auto-fetch missing datasets from the provider on demand and raise loudly (with chained root cause) instead of silently warning. Non-emptiness is enforced as a default validator on every config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Scenarios now fetch their datasets from the registered provider on demand the first time they run, so the load_default_datasets initializer is no longer required for everyday runs or in the recommended default config. Remove it from the default config examples and per-run scanner command examples, and add a note explaining it is now an optional preload step (useful for warming memory or populating a database for offline use). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a dataset_names tuple to ResolvedDataset plus a restrict_dataset_names validator so scenarios that pair techniques with specific datasets (e.g. psychosocial, jailbreak) can constrain which datasets a configuration may draw from. Replace the EXPLICIT_SEED_GROUPS_KEY dict-overload with an explicit inline-vs-named split. Named resolution (_collect_named_seeds_async) now returns only real dataset names, and get_seeds_async resolves inline data through a dedicated branch, removing the reserved-key collision guards. Inline data keeps a single honest INLINE_DATASET_NAME label in the by-dataset views (used for atomic-attack naming), so user-facing labels read 'technique_inline' instead of leaking the old sentinel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove DatasetObjectiveConfiguration, DatasetPromptConfiguration, and the generic DatasetConfiguration[SeedT] flat resolver (get_seeds_async). None had production callers; the objectives-only constraint that motivated the typed subclasses is enforced at runtime per-technique, not at the dataset level. DatasetConfiguration is now a plain (non-generic) base of resolution/fetch/ validate/sample plumbing plus the deprecated legacy getters, and DatasetAttackConfiguration remains the one concrete resolver scenarios use. Tests that exercised the removed flat resolver now drive the same base resolution through DatasetAttackConfiguration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
I'm a bit worried about datasets we pulled into DB in the past that need updates (eg in the context of standardizing harm categories, or fixing metadata fields). Similarly, what happens to live datasets like threat feeds (think Odin or others). How do we make sure they get refreshed? |
Contributor
Author
Good idea to include; I would separate from this PR but there are a couple solutions I like;
I like #2 more; I created an item in proposed and linked to this |
romanlutz
reviewed
Jun 26, 2026
…ing to a global budget A single DatasetAttackConfiguration now applies max_dataset_size as one global budget for both resolvers (flat and by-dataset), instead of the by-dataset path silently sampling per dataset. Per-dataset budgets are now expressed explicitly by composing children with the new MultiDatasetAttackConfiguration, which takes a list of child configurations (each resolves, validates, and samples itself) and concatenates them, with an optional compound-level cap on top. MultiDatasetAttackConfiguration.per_dataset(dataset_names=[...], max_dataset_size=N) is the compact 'N per dataset' form. encoding now composes two single-dataset children (restoring its prior ~6-group coverage), and rapid_response/text_adaptive use per_dataset to keep their per-dataset budgets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The four flat-resolver scenarios (encoding, scam, jailbreak, psychosocial) caught DatasetConstraintError, reset seed_groups to [], then re-raised a generic '_raise_dataset_exception' ValueError. Because require_nonempty() is always a default validator, the resolver never returns an empty list -- it raises a DatasetConstraintError naming the offending dataset and reason. The catch only hid that specific cause behind a misleading 'not available or failed to load' message. Remove the swallow so the precise error propagates, drop the now-dead empty guards and the orphaned _raise_dataset_exception helper, and assert the specific error in the affected tests. Psychosocial keeps a loud, specific error when its harm-category filter removes everything. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Six validator factories (require_min_size, require_harm_categories, require_seed_type, require_inline_seeds, forbid_inline_seeds, restrict_dataset_names) were exported from pyrit.scenario.core but used nowhere in production -- only in unit tests, which import them directly from the dataset_configuration module. Drop them from the package __init__ exports and __all__ to shrink the public surface. The definitions stay in dataset_configuration.py for scenarios to compose via _default_validators or the validators= parameter. require_nonempty (the always-applied default) remains exported. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…kConfiguration 'Compound' fits PyRIT naming conventions better than 'Multi'. Pure rename across the class definition, exports, scenarios that compose it, and tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…taset-configuration # Conflicts: # .pyrit_conf_example
romanlutz
approved these changes
Jun 30, 2026
…taset-configuration # Conflicts: # tests/unit/scenario/airt/test_cyber.py # tests/unit/scenario/airt/test_leakage.py # tests/unit/scenario/scenarios/adaptive/test_text_adaptive.py
…taset-configuration # Conflicts: # tests/unit/backend/test_scenario_run_service.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DatasetConfigurationworked well for the first scenarios but strained as we added more: garak.encoding needs seed prompts, jailbreak needs two datasets (the harms and the jailbreaks themselves), psychosocial ties datasets to techniques, and other Garak scenarios want more flexible types. This PR refactorsDatasetConfigurationto fit these diverse scenarios while simplifying the overall structure.How it works now:
SeedDatasetProviderand stored, so you no longer need theload_default_datasetsinitializer on startup unless you want to preload.DatasetConfigurationbase plusDatasetAttackConfiguration(the default most scenarios use); the earlier typed subclasses and parallelget_seedsgetters were dropped.CompoundDatasetAttackConfigurationcombines several configs and samples across them, so a scenario like encoding or jailbreak can draw from multiple datasets at once (with per-dataset or globalmax_dataset_size).max_dataset_sizesampling, so they describe the dataset itself rather than a sampled subset. Non-emptiness is enforced by default; scenarios can add others (seed type, harm categories, allowed names, inline-vs-named).DatasetSourceKind(inline vs. memory), letting a scenario accept seeds directly (e.g. CLI--objectives) or require named datasets.DatasetConstraintErrornaming the dataset and the reason instead of silently warning or returning empty.Scenarios resolve through two getters —
get_seed_attack_groups_asyncandget_attack_groups_by_dataset_async. All scenario unit tests pass.