|
1 | 1 | # Jailbreaking Frontier Models |
2 | 2 |
|
3 | | -This repo accompanies the the blogpost, ["Automatically Jailbreaking Frontier Language Models with Investigator Agents"](https://transluce.org/jailbreaking-frontier-models). |
| 3 | +This repo accompanies the blog post, ["Automatically Jailbreaking Frontier Language Models with Investigator Agents"](https://transluce.org/jailbreaking-frontier-models). |
4 | 4 |
|
5 | | -We provide a reference implementation of the dataset and reward function used in the blog post, but note that it is not optimized for efficiency or scalability. Unfortunately, we do not include the RL training loop, as it is tightly coupled with our internal research tooling. However, this codebase should serve as a useful starting point for those who wish to train jailbreaking agents and reproduce our experiments. |
| 5 | +We provide a reference implementation of the dataset and reward function from our blog post. Please keep in mind that this implementation prioritizes clarity over optimization, so you may want to enhance it for efficiency or scalability depending on your needs. |
| 6 | + |
| 7 | +We did not include the RL training loop in this release, as it's closely integrated with our internal research infrastructure. That said, we hope this codebase provides a helpful start for anyone interested in training jailbreaking agents or building upon our experiments. |
6 | 8 |
|
7 | 9 | ## Dataset |
8 | 10 |
|
@@ -33,13 +35,13 @@ Set the `OPENAI_API_KEY` environment variable to your OpenAI API key. Executing |
33 | 35 |
|
34 | 36 | ### Run a test script demonstrating the PRBO reward function |
35 | 37 |
|
36 | | -First, host [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) with an OpenAI-compatible endpoint (e.g. vLLM or SGLang) running at `http://HOSTNAME:PORT/v1`. Then, run the following command to compute the PRBO reward for a test prompt. |
| 38 | +First, host [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) with an OpenAI-compatible endpoint (e.g. vLLM or SGLang) running at an accessible URL, which we will refer to as `http://HOSTNAME:PORT/v1`. Then, run the following command to compute the PRBO reward for a test prompt: |
37 | 39 |
|
38 | 40 | ```bash |
39 | 41 | uv run python examples/reward_fn_computation.py gpt_oss_base_url=http://HOSTNAME:PORT/v1 |
40 | 42 | ``` |
41 | 43 |
|
42 | | -**Warning:** In the paper, we tested many training runs with bonus black-box rewards for attacking various API models (GPT-4.1, GPT-5, Claude Sonnet 4). We do not implement this here, but it is a simple additive bonus to the reward used here (in our training runs, this was a 0-20 bonus per model exploited, scaling linearly depending on the response score). We caution that this can get very expensive, especially when sampling responses from flagship reasoning models. Furthermore, it is possible that sending many attempted jailbreaks to a production API service may trigger monitors for suspicious activity, so this should be done with caution, following all applicable policies. |
| 44 | +**Warning:** In the paper, we tested many training runs with bonus black-box rewards for attacking various API models (GPT-4.1, GPT-5, Claude Sonnet 4). We do not implement this here, but it is a simple additive bonus to the reward function in this repo (in our training runs, this was a bonus of up to 20 points per model exploited, scaling linearly depending on the response score). We caution that this can get very expensive, especially when sampling responses from flagship reasoning models. Furthermore, since sending many attempted jailbreaks to a production API service may trigger monitors for suspicious activity, it should be done with caution, respecting all applicable policies. |
43 | 45 |
|
44 | 46 | # Citation |
45 | 47 |
|
|
0 commit comments