TransluceAI
diff --git a/‎.github/workflows/typecheck.yml‎
Lines changed: 42 additions & 0 deletions b/‎.github/workflows/typecheck.yml‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 162 additions & 0 deletions b/‎.gitignore‎
Lines changed: 162 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 59 additions & 0 deletions b/‎README.md‎
Lines changed: 59 additions & 0 deletions
@@ -0,0 +1,42 @@
+name: Type Check
+
+on:
+  push:
+    branches: [ main, master ]
+  pull_request:
+    branches: [ main, master ]
+  workflow_dispatch:
+
+jobs:
+  typecheck:
+    runs-on: ubuntu-latest
+    
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@v4
+    
+    - name: Install uv
+      uses: astral-sh/setup-uv@v3
+      with:
+        enable-cache: true
+        cache-dependency-glob: "uv.lock"
+    
+    - name: Set up Python
+      run: uv python install 3.12
+    
+    - name: Install dependencies
+      run: |
+        uv sync --all-extras --dev
+    
+    - name: Run pyright
+      run: |
+        uv run pyright
+    
+    - name: Upload type checking results
+      if: failure()
+      uses: actions/upload-artifact@v4
+      with:
+        name: typecheck-results
+        path: |
+          **/pyrightconfig.json
+          **/.pyright/
@@ -0,0 +1,162 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be added to the global gitignore or merged into this project gitignore.  For a PyCharm
+#  project, it is common to ignore these directories.
+.idea/
+
+# uv
+.python-version
@@ -0,0 +1,59 @@
+# Jailbreaking Frontier Models
+
+This repo accompanies the blog post, ["Automatically Jailbreaking Frontier Language Models with Investigator Agents"](https://transluce.org/jailbreaking-frontier-models). 
+
+We provide a reference implementation of the dataset and reward function from our blog post. Please keep in mind that this implementation prioritizes clarity over optimization, so you may want to enhance it for efficiency or scalability depending on your needs. 
+
+We did not include the RL training loop in this release, as it's closely integrated with our internal research infrastructure. That said, we hope this codebase provides a helpful start for anyone interested in training jailbreaking agents or building upon our experiments.
+
+## Dataset
+
+The dataset is located at `data/jailbreak_dataset.jsonl`. The lines have the following format:
+
+```json
+{
+    "behavior_id": "<unique id for the behavior>",
+    "behavior": "<user prompt for the behavior to elicit>",
+    "category": "<one of drug/chemical/biological/radiological/nuclear/explosive>",
+    "optimizer_target": "<start of an affirmative response, like \"Sure...\" -- this is not used by our investigator agent, but might be useful for other jailbreaking techniques>",
+}
+```
+
+## Running the reference reward function
+
+### Install dependencies
+
+First install [uv](https://docs.astral.sh/uv/), then run the following to install the dependencies in a virtual environment.
+
+```bash
+uv sync
+```
+
+### Set up environment variables
+
+Set the `OPENAI_API_KEY` environment variable to your OpenAI API key. Executing the test script below will send a few queries to `gpt-4.1-mini`, which we use as our default judge model.
+
+### Run a test script demonstrating the PRBO reward function
+
+First, host [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) with an OpenAI-compatible endpoint (e.g. vLLM or SGLang) running at an accessible URL, which we will refer to as `http://HOSTNAME:PORT/v1`. Then, run the following command to compute the PRBO reward for a test prompt:
+
+```bash
+uv run python examples/reward_fn_computation.py gpt_oss_base_url=http://HOSTNAME:PORT/v1
+```
+
+**Warning:** In the paper, we tested many training runs with bonus black-box rewards for attacking various API models (GPT-4.1, GPT-5, Claude Sonnet 4). We do not implement this here, but it is a simple additive bonus to the reward function in this repo (in our training runs, this was a bonus of up to 20 points per model exploited, scaling linearly depending on the response score). We caution that this can get very expensive, especially when sampling responses from flagship reasoning models. **Additionally, since sending many attempted jailbreaks to a production API service may trigger monitors for suspicious activity, it should be done with caution, respecting all applicable policies.**
+
+# Citation
+
+If you reference this work in a publication, please cite:
+
+```bibtex
+@misc{chowdhury2025jailbreaking,
+  author       = {Chowdhury, Neil and Schwettmann, Sarah and Steinhardt, Jacob},
+  title        = {Automatically Jailbreaking Frontier Language Models with Investigator Agents},
+  year         = {2025},
+  month        = {September},
+  day          = {3},
+  howpublished = {\url{https://transluce.org/jailbreaking-frontier-models}}
+}
+```