feat(redteam): add built-in red teaming support#184
Conversation
8d7d3f5 to
c9f5845
Compare
poshinchen
left a comment
There was a problem hiding this comment.
Could you use built-in python | / list instead of typing's deprecated Union, List and so on?
Quick heads-up – fixed it in 438f9e0 |
|
Created sub-issues under #177 to track P0/P1 work:
This PR (#184) covers the infrastructure layer of P0. Checked items in #220 reflect what's already implemented here:
Remaining P0 work (tracked in #220):
|
|
@poshinchen Resolved your comments in d750fe0:
|
|
/strands review the PR |
|
Issue: This PR introduces a significant new public API surface ( The PR description documents the components well, but missing from an API review perspective:
Suggestion: Add from strands_evals.redteam import red_teamand document whether this is the intended public entry point. |
Review SummaryAssessment: Comment (Request Changes on specific items) Solid foundation for red teaming capabilities. The architecture cleanly separates concerns (presets vs. strategies vs. evaluators vs. runner) and the Review Categories
The separation of "what to attack" (presets) from "how to attack" (strategies) is a clean design that should scale well as more strategies land in the follow-up PRs. |
|
@poshinchen Addressed all automated review findings in 1284fa1 API surface comment – |
be8876e to
49b2bb2
Compare
Review Summary (Round 2)Assessment: Comment The restructuring addresses the key issues from the first review well — the module is now under Remaining Items
Nice improvement from the previous iteration. The |
Adds an experimental red-teaming module under src/strands_evals/experimental/redteam/ that extends Strands Evals base types (Case, Experiment, Evaluator, ActorSimulator) with adversarial counterparts. - AdversarialCaseGenerator: generates RedTeamCases per risk category, with optional auto-inference of categories from target tools/system_prompt - RedTeamExperiment: orchestrates multi-turn attacker/target conversations - AttackSuccessEvaluator: continuous 0.0-1.0 LLM-as-judge over conversation + tool execution traces - AdversarialActorSimulator: ActorSimulator subclass shared across strategies - AttackStrategy + PromptStrategy with gradual_escalation as the default
49b2bb2 to
e8e5be2
Compare
Description
Adds built-in red teaming capabilities to strands-evals, enabling automated adversarial testing of AI agents.
Core components:
jailbreak,prompt_extraction,harmful_content): Pre-built actor profiles, goals, seed inputs, and per-preset evaluation metricsgradual_escalation— an adaptive multi-turn strategy that analyzes target responses and pivots techniques dynamicallyRedTeamJudgeEvaluator: Composite safety evaluator with 3 metrics (guardrail_breach,harmfulness,prompt_leakage). Dynamically builds judge prompts based on only the metrics relevant to each attack patternrun_red_team()entry point: End-to-end orchestration — case generation, multi-turn attack simulation viaActorSimulator, and safety evaluation in a single calltarget_infoparameter for LLM-generated attack goals tailored to the specific target systemRelated Issues
Closes #220
Type of Change
New feature
Testing
hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.