Skip to content

feat(redteam): add built-in red teaming support#184

Open
kevmyung wants to merge 1 commit into
strands-agents:mainfrom
kevmyung:feat/red-team-foundation
Open

feat(redteam): add built-in red teaming support#184
kevmyung wants to merge 1 commit into
strands-agents:mainfrom
kevmyung:feat/red-team-foundation

Conversation

@kevmyung
Copy link
Copy Markdown

@kevmyung kevmyung commented Mar 31, 2026

Description

Adds built-in red teaming capabilities to strands-evals, enabling automated adversarial testing of AI agents.

image

Core components:

  • Attack presets (jailbreak, prompt_extraction, harmful_content): Pre-built actor profiles, goals, seed inputs, and per-preset evaluation metrics
  • Strategy system: Pluggable attack strategies separated from presets. Ships with gradual_escalation — an adaptive multi-turn strategy that analyzes target responses and pivots techniques dynamically
  • RedTeamJudgeEvaluator: Composite safety evaluator with 3 metrics (guardrail_breach, harmfulness, prompt_leakage). Dynamically builds judge prompts based on only the metrics relevant to each attack pattern
  • run_red_team() entry point: End-to-end orchestration — case generation, multi-turn attack simulation via ActorSimulator, and safety evaluation in a single call
  • Target-aware goal generation: Optional target_info parameter for LLM-generated attack goals tailored to the specific target system

Related Issues

Closes #220

Type of Change

New feature

Testing

  • I ran hatch run prepare
  • Unit tests for presets, runner, and judge evaluator (49 tests passing)
  • Integration tested against mock compliant target and Claude Haiku target

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Copy Markdown
Contributor

@poshinchen poshinchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use built-in python | / list instead of typing's deprecated Union, List and so on?

@kevmyung kevmyung temporarily deployed to manual-approval May 1, 2026 15:23 — with GitHub Actions Inactive
@kevmyung
Copy link
Copy Markdown
Author

kevmyung commented May 1, 2026

Could you use built-in python | / list instead of typing's deprecated Union, List and so on?

Quick heads-up – fixed it in 438f9e0

Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/redteam/evaluators/red_team_judge_evaluator.py Outdated
@yeomjiwonyeom
Copy link
Copy Markdown

Created sub-issues under #177 to track P0/P1 work:

This PR (#184) covers the infrastructure layer of P0. Checked items in #220 reflect what's already implemented here:

  • AttackStrategy ABC, RiskCategory, AttackGoal types
  • AttackSuccessEvaluator (0.0–1.0 continuous scoring on execution traces)
  • RedTeamJudgeEvaluator (binary per-metric)
  • red_team(agent) entry point with auto tool extraction
  • RedTeamReport with grouped views
  • Multi-turn conversation loop via ActorSimulator
  • 1 strategy: gradual_escalation

Remaining P0 work (tracked in #220):

  • 5 multi-turn strategies: Crescendo, Linear/PAIR, TAP/TreeJailbreaking, BadLikertJudge, SequentialBreak
  • RedTeamExperiment orchestrator
  • AdversarialActorSimulator / AdversarialCaseGenerator
  • Turn budget increase to 20-50

@kevmyung
Copy link
Copy Markdown
Author

@poshinchen Resolved your comments in d750fe0:

  • Moved red-team evaluators under src/strands_evals/redteam/evaluators/
  • Unified attack_strategies API (accepts strings or AttackStrategy instances)
  • Fixed extract_tool_info to handle get_all_tools_config's dict shape
  • Addressed remaining nits (typing, unused symbols, dead branches, docstrings)

@poshinchen
Copy link
Copy Markdown
Contributor

/strands review the PR

Comment thread src/strands_evals/redteam/runner.py Outdated
Comment thread src/strands_evals/redteam/runner.py Outdated
Comment thread src/strands_evals/redteam/runner.py Outdated
Comment thread src/strands_evals/redteam/presets.py Outdated
Comment thread src/strands_evals/redteam/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/redteam/agent_adapter.py Outdated
@github-actions
Copy link
Copy Markdown

Issue: This PR introduces a significant new public API surface (strands_evals.redteam) with multiple abstractions customers will use (red_team(), AttackStrategy, RedTeamJudgeEvaluator, AttackSuccessEvaluator, presets). Per the API Bar Raising guidelines, this likely warrants the needs-api-review label.

The PR description documents the components well, but missing from an API review perspective:

  • Module-level import paths (e.g., can users do from strands_evals import red_team?)
  • Currently the redteam module is not re-exported from strands_evals/__init__.py
  • Whether JAILBREAK, PROMPT_EXTRACTION, HARMFUL_CONTENT should be public constants vs. accessed via the registry

Suggestion: Add needs-api-review label and decide on the import ergonomics. At minimum, consider adding redteam to the top-level __init__.py lazy imports so users can write:

from strands_evals.redteam import red_team

and document whether this is the intended public entry point.

Comment thread tests/strands_evals/redteam/test_runner.py Outdated
Comment thread src/strands_evals/redteam/strategies/base.py Outdated
@github-actions
Copy link
Copy Markdown

Review Summary

Assessment: Comment (Request Changes on specific items)

Solid foundation for red teaming capabilities. The architecture cleanly separates concerns (presets vs. strategies vs. evaluators vs. runner) and the red_team() API provides a simple entry point. Two items I'd ask to address before merging:

Review Categories
  • Concurrency Safety: The shared tool_trace mutable list pattern will silently corrupt data if the experiment runs with parallel workers. This needs either a fix or an explicit max_workers=1 constraint.
  • Test Coverage Gaps: AttackSuccessEvaluator and agent_adapter.py are untested — both are part of the public API surface.
  • API Surface: This introduces a substantial new public module. Consider adding needs-api-review label and clarifying the intended import paths (top-level re-export vs. submodule).
  • Evaluator Aggregation: The multi-metric judge evaluator's outputs get averaged, which can mask critical safety failures. Worth a deliberate design decision on the aggregation semantics.
  • Reproducibility: No seed parameter for case generation makes CI/CD regression testing non-deterministic.

The separation of "what to attack" (presets) from "how to attack" (strategies) is a clean design that should scale well as more strategies land in the follow-up PRs.

@kevmyung
Copy link
Copy Markdown
Author

@poshinchen Addressed all automated review findings in 1284fa1

API surface comment – needs-api-review label and top-level re-export (from strands_evals import red_team) decision are up to you. Current import path is from strands_evals.redteam import red_team. Happy to add a top-level re-export if you prefer.

Comment thread src/strands_evals/redteam/evaluators/attack_success_evaluator.py Outdated
Comment thread src/strands_evals/redteam/evaluators/attack_success_evaluator.py Outdated
Comment thread src/strands_evals/redteam/evaluators/attack_success_evaluator.py Outdated
Comment thread src/strands_evals/redteam/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/redteam/evaluators/red_team_judge_evaluator.py Outdated
Comment thread src/strands_evals/redteam/types.py Outdated
@kevmyung kevmyung force-pushed the feat/red-team-foundation branch from be8876e to 49b2bb2 Compare May 20, 2026 15:45
@kevmyung kevmyung deployed to manual-approval May 20, 2026 15:51 — with GitHub Actions Active
Comment thread src/strands_evals/experimental/redteam/generators/adversarial.py
Comment thread src/strands_evals/experimental/redteam/strategies/__init__.py
Comment thread src/strands_evals/experimental/redteam/generators/adversarial.py Outdated
Comment thread src/strands_evals/experimental/redteam/experiment.py
Comment thread src/strands_evals/experimental/redteam/simulators/adversarial.py
Comment thread src/strands_evals/experimental/redteam/case.py
@github-actions
Copy link
Copy Markdown

Review Summary (Round 2)

Assessment: Comment

The restructuring addresses the key issues from the first review well — the module is now under experimental/, concurrency is fixed (per-call trace list), exception handling is narrowed, types are Pydantic models, and the AttackStrategy base class now uses build_simulator() as the concrete abstract method. The RedTeamExperimentAdversarialCaseGeneratorAttackSuccessEvaluator pipeline is a clean layered design.

Remaining Items
  • Robustness: _infer_risk_categories uses fragile JSON string slicing that will crash on malformed LLM output. Use structured_output_model consistently.
  • Async compatibility: asyncio.run() in sync wrappers will fail in already-async contexts (Jupyter, FastAPI). Expose proper async alternatives.
  • Return type override: RedTeamExperiment.run_evaluations() changes the return type signature from the base class — document or restructure.
  • Test gap: agent_adapter.py remains untested. Both functions interface with Agent internals that may shift.
  • API noise: 6 unimplemented strategy skeletons are exported from __all__ — defer public export until functional.

Nice improvement from the previous iteration. The experimental/ namespace gives appropriate API stability expectations while the design matures.

Adds an experimental red-teaming module under src/strands_evals/experimental/redteam/
that extends Strands Evals base types (Case, Experiment, Evaluator, ActorSimulator)
with adversarial counterparts.

- AdversarialCaseGenerator: generates RedTeamCases per risk category, with
  optional auto-inference of categories from target tools/system_prompt
- RedTeamExperiment: orchestrates multi-turn attacker/target conversations
- AttackSuccessEvaluator: continuous 0.0-1.0 LLM-as-judge over conversation +
  tool execution traces
- AdversarialActorSimulator: ActorSimulator subclass shared across strategies
- AttackStrategy + PromptStrategy with gradual_escalation as the default
@kevmyung kevmyung force-pushed the feat/red-team-foundation branch from 49b2bb2 to e8e5be2 Compare May 20, 2026 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE][P0] Red-Teaming: Core Pipeline — multi-turn attack strategies, evaluator, experiment, reporting

3 participants