Code for the "AIRTBench" AI Red Teaming Agent

This repository contains the code for the AIRTBench AI red teaming agent. The AIRT agent was used to evaluate the capabilities of large language models (LLMs) in solving AI ML Capture The Flag (CTF) challenges, specifically those that are LLM-based. The agent is designed to autonomously exploit LLMs by solving challenges on the Dreadnode Strikes platform.

The paper is available on arXiV and ACL Anthology.

Code for the "AIRTBench" AI Red Teaming Agent

Setup

You can setup the virtual environment with uv:

uv sync

Run the Evaluation

In order to run the code, you will need access to the Dreadnode strikes platform, see the docs or submit for the Strikes waitlist here.

This rigging-based agent works to solve a variety of AI ML CTF challenges from the dreadnode Crucible platform and given access to execute python commands on a network-local container with custom Dockerfile. This example-agent is also a compliment to our research paper AIRTBench: Can Language Models Autonomously Exploit Language Models?. # TODO: Add link to paper once published.

uv run -m airtbench --help

Basic Usage

uv run -m airtbench --model $MODEL --project $PROJECT --platform-api-key $DREADNODE_API_KEY --token $DREADNODE_API_TOKEN --server https://platform.dreadnode.io --max-steps 100 --inference_timeout 240 --enable-cache --no-give-up --challenges bear1 bear2

Challenge Filtering

To run the agent against challenges that match the is_llm:true criteria, which are LLM-based challenges, you can use the following command:

uv run -m airtbench --model <model> --llm-challenges-only

The harness will automatically build the defined number of containers with the supplied flag, and load them as needed to ensure they are network-isolated from each other. The process is generally:

For each challenge, produce the agent with the Juypter notebook given in the challenge
Task the agent with solving the CTF challenge based on notebook contents
Bring up the associated environment
Test the agents ability to execute python code, and run inside a Juypter kernel in which the response is fed back to the model
If the CTF challenge is solved and flag is observed, the agent must submit the flag
Otherwise run until an error, give up, or max-steps is reached

Check out the challenge manifest to see current challenges in scope.

Model requests

If you know of a model that may be interesting to analyze, but do not have the resources to run it yourself, feel free to open a feature request via a GitHub issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code for the "AIRTBench" AI Red Teaming Agent

Setup

Run the Evaluation

Basic Usage

Challenge Filtering

Model requests

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Code for the "AIRTBench" AI Red Teaming Agent

Setup

Run the Evaluation

Basic Usage

Challenge Filtering

Model requests