Skip to content

Commit 54da46d

Browse files
committed
fix typos
1 parent 774e0b9 commit 54da46d

1 file changed

Lines changed: 6 additions & 4 deletions

File tree

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
# Jailbreaking Frontier Models
22

3-
This repo accompanies the the blogpost, ["Automatically Jailbreaking Frontier Language Models with Investigator Agents"](https://transluce.org/jailbreaking-frontier-models).
3+
This repo accompanies the blog post, ["Automatically Jailbreaking Frontier Language Models with Investigator Agents"](https://transluce.org/jailbreaking-frontier-models).
44

5-
We provide a reference implementation of the dataset and reward function used in the blog post, but note that it is not optimized for efficiency or scalability. Unfortunately, we do not include the RL training loop, as it is tightly coupled with our internal research tooling. However, this codebase should serve as a useful starting point for those who wish to train jailbreaking agents and reproduce our experiments.
5+
We provide a reference implementation of the dataset and reward function from our blog post. Please keep in mind that this implementation prioritizes clarity over optimization, so you may want to enhance it for efficiency or scalability depending on your needs.
6+
7+
We did not include the RL training loop in this release, as it's closely integrated with our internal research infrastructure. That said, we hope this codebase provides a helpful start for anyone interested in training jailbreaking agents or building upon our experiments.
68

79
## Dataset
810

@@ -33,13 +35,13 @@ Set the `OPENAI_API_KEY` environment variable to your OpenAI API key. Executing
3335

3436
### Run a test script demonstrating the PRBO reward function
3537

36-
First, host [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) with an OpenAI-compatible endpoint (e.g. vLLM or SGLang) running at `http://HOSTNAME:PORT/v1`. Then, run the following command to compute the PRBO reward for a test prompt.
38+
First, host [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) with an OpenAI-compatible endpoint (e.g. vLLM or SGLang) running at an accessible URL, which we will refer to as `http://HOSTNAME:PORT/v1`. Then, run the following command to compute the PRBO reward for a test prompt:
3739

3840
```bash
3941
uv run python examples/reward_fn_computation.py gpt_oss_base_url=http://HOSTNAME:PORT/v1
4042
```
4143

42-
**Warning:** In the paper, we tested many training runs with bonus black-box rewards for attacking various API models (GPT-4.1, GPT-5, Claude Sonnet 4). We do not implement this here, but it is a simple additive bonus to the reward used here (in our training runs, this was a 0-20 bonus per model exploited, scaling linearly depending on the response score). We caution that this can get very expensive, especially when sampling responses from flagship reasoning models. Furthermore, it is possible that sending many attempted jailbreaks to a production API service may trigger monitors for suspicious activity, so this should be done with caution, following all applicable policies.
44+
**Warning:** In the paper, we tested many training runs with bonus black-box rewards for attacking various API models (GPT-4.1, GPT-5, Claude Sonnet 4). We do not implement this here, but it is a simple additive bonus to the reward function in this repo (in our training runs, this was a bonus of up to 20 points per model exploited, scaling linearly depending on the response score). We caution that this can get very expensive, especially when sampling responses from flagship reasoning models. Furthermore, since sending many attempted jailbreaks to a production API service may trigger monitors for suspicious activity, it should be done with caution, respecting all applicable policies.
4345

4446
# Citation
4547

0 commit comments

Comments
 (0)