feat(redteam): add built-in red teaming support by kevmyung · Pull Request #184 · strands-agents/evals

kevmyung · 2026-03-31T23:14:30Z

Description

Adds built-in red teaming capabilities to strands-evals, enabling automated adversarial testing of AI agents.

Core components:

Attack presets (jailbreak, prompt_extraction, harmful_content): Pre-built actor profiles, goals, seed inputs, and per-preset evaluation metrics
Strategy system: Pluggable attack strategies separated from presets. Ships with gradual_escalation — an adaptive multi-turn strategy that analyzes target responses and pivots techniques dynamically
RedTeamJudgeEvaluator: Composite safety evaluator with 3 metrics (guardrail_breach, harmfulness, prompt_leakage). Dynamically builds judge prompts based on only the metrics relevant to each attack pattern
run_red_team() entry point: End-to-end orchestration — case generation, multi-turn attack simulation via ActorSimulator, and safety evaluation in a single call
Target-aware goal generation: Optional target_info parameter for LLM-generated attack goals tailored to the specific target system

Related Issues

Closes #220

Type of Change

New feature

Testing

I ran hatch run prepare
Unit tests for presets, runner, and judge evaluator (49 tests passing)
Integration tested against mock compliant target and Claude Haiku target

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

poshinchen

Could you use built-in python | / list instead of typing's deprecated Union, List and so on?

kevmyung · 2026-05-01T20:27:02Z

Could you use built-in python | / list instead of typing's deprecated Union, List and so on?

Quick heads-up – fixed it in 438f9e0

yeomjiwonyeom · 2026-05-11T23:56:17Z

Created sub-issues under #177 to track P0/P1 work:

[FEATURE][P0] Red-Teaming: Core Pipeline — multi-turn attack strategies, evaluator, experiment, reporting #220 — P0: Core Pipeline (multi-turn attack strategies, evaluator, experiment, reporting)
[FEATURE][P0.5] Red-Teaming: Agent-topology attacks & skill-based red-teaming #221 — P0.5: Agent-topology attacks & skill-based red-teaming
[FEATURE][P1] Red-Teaming: Tool-graph case generation, deterministic verification, adaptive red-teaming agent #222 — P1: Tool-graph case generation, deterministic verification, adaptive red-teaming agent

This PR (#184) covers the infrastructure layer of P0. Checked items in #220 reflect what's already implemented here:

AttackStrategy ABC, RiskCategory, AttackGoal types
AttackSuccessEvaluator (0.0–1.0 continuous scoring on execution traces)
RedTeamJudgeEvaluator (binary per-metric)
red_team(agent) entry point with auto tool extraction
RedTeamReport with grouped views
Multi-turn conversation loop via ActorSimulator
1 strategy: gradual_escalation

Remaining P0 work (tracked in #220):

5 multi-turn strategies: Crescendo, Linear/PAIR, TAP/TreeJailbreaking, BadLikertJudge, SequentialBreak
RedTeamExperiment orchestrator
AdversarialActorSimulator / AdversarialCaseGenerator
Turn budget increase to 20-50

kevmyung · 2026-05-12T01:13:24Z

@poshinchen Resolved your comments in d750fe0:

Moved red-team evaluators under src/strands_evals/redteam/evaluators/
Unified attack_strategies API (accepts strings or AttackStrategy instances)
Fixed extract_tool_info to handle get_all_tools_config's dict shape
Addressed remaining nits (typing, unused symbols, dead branches, docstrings)

poshinchen · 2026-05-12T16:12:32Z

/strands review the PR

github-actions · 2026-05-12T17:09:27Z

Issue: This PR introduces a significant new public API surface (strands_evals.redteam) with multiple abstractions customers will use (red_team(), AttackStrategy, RedTeamJudgeEvaluator, AttackSuccessEvaluator, presets). Per the API Bar Raising guidelines, this likely warrants the needs-api-review label.

The PR description documents the components well, but missing from an API review perspective:

Module-level import paths (e.g., can users do from strands_evals import red_team?)
Currently the redteam module is not re-exported from strands_evals/__init__.py
Whether JAILBREAK, PROMPT_EXTRACTION, HARMFUL_CONTENT should be public constants vs. accessed via the registry

Suggestion: Add needs-api-review label and decide on the import ergonomics. At minimum, consider adding redteam to the top-level __init__.py lazy imports so users can write:

from strands_evals.redteam import red_team

and document whether this is the intended public entry point.

github-actions · 2026-05-12T17:09:33Z

Review Summary

Assessment: Comment (Request Changes on specific items)

Solid foundation for red teaming capabilities. The architecture cleanly separates concerns (presets vs. strategies vs. evaluators vs. runner) and the red_team() API provides a simple entry point. Two items I'd ask to address before merging:

Review Categories

Concurrency Safety: The shared tool_trace mutable list pattern will silently corrupt data if the experiment runs with parallel workers. This needs either a fix or an explicit max_workers=1 constraint.
Test Coverage Gaps: AttackSuccessEvaluator and agent_adapter.py are untested — both are part of the public API surface.
API Surface: This introduces a substantial new public module. Consider adding needs-api-review label and clarifying the intended import paths (top-level re-export vs. submodule).
Evaluator Aggregation: The multi-metric judge evaluator's outputs get averaged, which can mask critical safety failures. Worth a deliberate design decision on the aggregation semantics.
Reproducibility: No seed parameter for case generation makes CI/CD regression testing non-deterministic.

The separation of "what to attack" (presets) from "how to attack" (strategies) is a clean design that should scale well as more strategies land in the follow-up PRs.

kevmyung · 2026-05-12T19:26:31Z

@poshinchen Addressed all automated review findings in 1284fa1

API surface comment – needs-api-review label and top-level re-export (from strands_evals import red_team) decision are up to you. Current import path is from strands_evals.redteam import red_team. Happy to add a top-level re-export if you prefer.

github-actions · 2026-05-20T18:27:50Z

Review Summary (Round 2)

Assessment: Comment

The restructuring addresses the key issues from the first review well — the module is now under experimental/, concurrency is fixed (per-call trace list), exception handling is narrowed, types are Pydantic models, and the AttackStrategy base class now uses build_simulator() as the concrete abstract method. The RedTeamExperiment → AdversarialCaseGenerator → AttackSuccessEvaluator pipeline is a clean layered design.

Remaining Items

Robustness: _infer_risk_categories uses fragile JSON string slicing that will crash on malformed LLM output. Use structured_output_model consistently.
Async compatibility: asyncio.run() in sync wrappers will fail in already-async contexts (Jupyter, FastAPI). Expose proper async alternatives.
Return type override: RedTeamExperiment.run_evaluations() changes the return type signature from the base class — document or restructure.
Test gap: agent_adapter.py remains untested. Both functions interface with Agent internals that may shift.
API noise: 6 unimplemented strategy skeletons are exported from __all__ — defer public export until functional.

Nice improvement from the previous iteration. The experimental/ namespace gives appropriate API stability expectations while the design matures.

Adds an experimental red-teaming module under src/strands_evals/experimental/redteam/ that extends Strands Evals base types (Case, Experiment, Evaluator, ActorSimulator) with adversarial counterparts. - AdversarialCaseGenerator: generates RedTeamCases per risk category, with optional auto-inference of categories from target tools/system_prompt - RedTeamExperiment: orchestrates multi-turn attacker/target conversations - AttackSuccessEvaluator: continuous 0.0-1.0 LLM-as-judge over conversation + tool execution traces - AdversarialActorSimulator: ActorSimulator subclass shared across strategies - AttackStrategy + PromptStrategy with gradual_escalation as the default

kevmyung had a problem deploying to manual-approval March 31, 2026 23:14 — with GitHub Actions Failure

kevmyung had a problem deploying to manual-approval March 31, 2026 23:21 — with GitHub Actions Failure

kevmyung requested a deployment to manual-approval April 25, 2026 00:17 — with GitHub Actions Waiting

kevmyung force-pushed the feat/red-team-foundation branch from 8d7d3f5 to c9f5845 Compare April 25, 2026 04:18

kevmyung requested a deployment to manual-approval April 25, 2026 04:19 — with GitHub Actions Waiting

poshinchen reviewed May 1, 2026

View reviewed changes

kevmyung temporarily deployed to manual-approval May 1, 2026 15:23 — with GitHub Actions Inactive

poshinchen reviewed May 8, 2026

View reviewed changes

Comment thread src/strands_evals/experimental/redteam/evaluators/attack_success_evaluator.py

poshinchen reviewed May 8, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated

poshinchen reviewed May 8, 2026

View reviewed changes

Comment thread src/strands_evals/evaluators/red_team_judge_evaluator.py Outdated

poshinchen reviewed May 11, 2026

View reviewed changes

This was referenced May 11, 2026

[FEATURE][P0] Red-Teaming: Core Pipeline — multi-turn attack strategies, evaluator, experiment, reporting #220

Open

[FEATURE][P0.5] Red-Teaming: Agent-topology attacks & skill-based red-teaming #221

Open

kevmyung requested a deployment to manual-approval May 12, 2026 01:06 — with GitHub Actions Waiting

github-actions Bot added strands-running and removed strands-running labels May 12, 2026