Skip to content

Commit 08b57ed

Browse files
committed
Build OpenEnv code review benchmark
0 parents  commit 08b57ed

21 files changed

+4924
-0
lines changed

.github/workflows/ci.yml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: ["main"]
6+
pull_request:
7+
8+
jobs:
9+
validate:
10+
runs-on: ubuntu-latest
11+
timeout-minutes: 25
12+
13+
steps:
14+
- uses: actions/checkout@v4
15+
16+
- uses: actions/setup-python@v5
17+
with:
18+
python-version: "3.11"
19+
20+
- name: Install uv
21+
uses: astral-sh/setup-uv@v5
22+
23+
- name: Install OpenEnv CLI
24+
run: python -m pip install git+https://github.com/meta-pytorch/OpenEnv.git
25+
26+
- name: Sync environment
27+
run: uv sync --frozen --extra dev
28+
29+
- name: Run tests
30+
run: uv run pytest -q
31+
32+
- name: Validate local environment structure
33+
run: openenv validate --verbose
34+
35+
- name: Start server
36+
run: |
37+
uv run server --port 8000 &
38+
echo $! > server.pid
39+
sleep 8
40+
41+
- name: Validate running environment
42+
run: openenv validate http://127.0.0.1:8000
43+
44+
- name: Stop server
45+
if: always()
46+
run: |
47+
if [ -f server.pid ]; then kill $(cat server.pid) || true; fi

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
.venv/
2+
__pycache__/
3+
.pytest_cache/
4+
.coverage
5+
htmlcov/
6+
dist/
7+
build/
8+
*.pyc
9+
*.pyo
10+
*.pyd
11+
*.egg-info/
12+

LICENSE

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
MIT License
2+
3+
Copyright (c) 2026 Rohan5commit
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.
22+

README.md

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Code Review Arena for OpenEnv
2+
3+
`code_review_env` is a production-style OpenEnv benchmark for pull-request review.
4+
Instead of toy gameplay, the agent reviews realistic code changes for security,
5+
correctness, reliability, and quality regressions, then submits a structured review.
6+
7+
This repository is designed to score well on the four hackathon judging axes:
8+
9+
1. Runtime correctness
10+
2. OpenEnv interface compliance
11+
3. Task design quality
12+
4. Grading logic sophistication
13+
14+
## Why this environment is competitive
15+
16+
- Real-world task: PR review is a daily engineering workflow with direct product value.
17+
- Deterministic benchmark: no external APIs, no flaky third-party services, no hidden randomness.
18+
- Rich interaction loop: the agent can list changed files, inspect diffs or full files, search code, and submit a final review.
19+
- Sophisticated grading: optimal finding-to-rubric matching, severity weighting, line tolerance, semantic keyword checks, duplicate detection, and false-positive penalties.
20+
- Judge-friendly packaging: standalone OpenEnv environment with Docker, tests, client, and CI validation.
21+
22+
## Benchmark design
23+
24+
The built-in corpus contains realistic PR tasks across:
25+
26+
- Broken access control
27+
- SQL injection
28+
- Path traversal
29+
- SSRF
30+
- JWT validation mistakes
31+
- Concurrency and race conditions
32+
- Client-side XSS
33+
- False-positive control via a clean refactor task
34+
35+
Each task includes:
36+
37+
- PR title and description
38+
- Changed file summaries
39+
- Unified diff snippets
40+
- Full changed-file contents for inspection
41+
- CI summary
42+
- Hidden reference findings used by the grader
43+
44+
## Action space
45+
46+
The environment accepts one typed `CodeReviewAction` with these `action_type` values:
47+
48+
- `list_files`
49+
- `inspect_file`
50+
- `search_code`
51+
- `submit_review`
52+
53+
Final review submissions are a list of structured findings:
54+
55+
- `file_path`
56+
- `line_start`
57+
- `line_end`
58+
- `severity`
59+
- `category`
60+
- `title`
61+
- `explanation`
62+
- `confidence`
63+
64+
## Scoring model
65+
66+
The grader uses optimal one-to-one matching between submitted findings and reference findings.
67+
Each candidate match blends:
68+
69+
- file/path agreement
70+
- line alignment with tolerance
71+
- category normalization and alias matching
72+
- severity agreement
73+
- title/explanation semantic coverage
74+
75+
The final score combines:
76+
77+
- coverage of true issues
78+
- precision of submitted issues
79+
- efficiency from staying within the review budget
80+
- penalties for false positives
81+
- penalties for duplicates
82+
- penalties for missing high-severity findings
83+
84+
This makes the reward function significantly harder to game than simple exact-string matching.
85+
86+
## Local development
87+
88+
```bash
89+
uv sync --extra dev
90+
uv run pytest
91+
uv run server --port 8000
92+
```
93+
94+
Validate structure with OpenEnv:
95+
96+
```bash
97+
openenv validate --verbose
98+
openenv validate http://127.0.0.1:8000
99+
```
100+
101+
## Example usage
102+
103+
```python
104+
import asyncio
105+
106+
from code_review_env import CodeReviewAction, CodeReviewEnv, ReviewFinding
107+
108+
109+
async def main() -> None:
110+
async with CodeReviewEnv(base_url="http://127.0.0.1:8000") as env:
111+
result = await env.reset(task_id="sql_injection_report_filters")
112+
print(result.observation.pr_title)
113+
114+
await env.step(
115+
CodeReviewAction(
116+
action_type="inspect_file",
117+
file_path="analytics/reporting.py",
118+
view_mode="full",
119+
start_line=1,
120+
end_line=120,
121+
)
122+
)
123+
124+
graded = await env.step(
125+
CodeReviewAction(
126+
action_type="submit_review",
127+
findings=[
128+
ReviewFinding(
129+
file_path="analytics/reporting.py",
130+
line_start=24,
131+
line_end=31,
132+
severity="critical",
133+
category="sql_injection",
134+
title="Unsafe string interpolation in SQL query",
135+
explanation=(
136+
"customer_id and period are inserted directly into SQL, "
137+
"so an attacker can change the query instead of using "
138+
"parameter binding."
139+
),
140+
confidence=0.95,
141+
)
142+
],
143+
)
144+
)
145+
print(graded.observation.scorecard.overall_score)
146+
147+
148+
asyncio.run(main())
149+
```
150+
151+
## Hugging Face / Space deployment
152+
153+
The environment is ready for:
154+
155+
```bash
156+
openenv push --repo-id <your-hf-space>
157+
```
158+
159+
No API keys are required for the benchmark itself. If you later want a private rubric
160+
bundle for leaderboard use, you can point `CODE_REVIEW_TASK_BUNDLE_PATH` at a private
161+
JSON file without changing the environment interface.
162+

__init__.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
"""Code review benchmark environment for OpenEnv."""
2+
3+
try:
4+
from .client import CodeReviewEnv
5+
from .models import (
6+
ChangedFileSummary,
7+
CodeReviewAction,
8+
CodeReviewObservation,
9+
CodeReviewState,
10+
FindingAssessment,
11+
ReviewFinding,
12+
ReviewScorecard,
13+
SearchHit,
14+
)
15+
except ImportError: # pragma: no cover
16+
from client import CodeReviewEnv
17+
from models import (
18+
ChangedFileSummary,
19+
CodeReviewAction,
20+
CodeReviewObservation,
21+
CodeReviewState,
22+
FindingAssessment,
23+
ReviewFinding,
24+
ReviewScorecard,
25+
SearchHit,
26+
)
27+
28+
__all__ = [
29+
"ChangedFileSummary",
30+
"CodeReviewAction",
31+
"CodeReviewEnv",
32+
"CodeReviewObservation",
33+
"CodeReviewState",
34+
"FindingAssessment",
35+
"ReviewFinding",
36+
"ReviewScorecard",
37+
"SearchHit",
38+
]

client.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
"""Typed OpenEnv client for the code review benchmark."""
2+
3+
from __future__ import annotations
4+
5+
from typing import Any
6+
7+
from openenv.core.env_client import EnvClient
8+
from openenv.core.client_types import StepResult
9+
10+
try:
11+
from .models import (
12+
ChangedFileSummary,
13+
CodeReviewAction,
14+
CodeReviewObservation,
15+
CodeReviewState,
16+
ReviewScorecard,
17+
SearchHit,
18+
)
19+
except ImportError: # pragma: no cover
20+
from models import (
21+
ChangedFileSummary,
22+
CodeReviewAction,
23+
CodeReviewObservation,
24+
CodeReviewState,
25+
ReviewScorecard,
26+
SearchHit,
27+
)
28+
29+
30+
class CodeReviewEnv(EnvClient[CodeReviewAction, CodeReviewObservation, CodeReviewState]):
31+
"""Persistent WebSocket client for the code review environment."""
32+
33+
def _step_payload(self, action: CodeReviewAction) -> dict[str, Any]:
34+
return action.model_dump(exclude_none=True)
35+
36+
def _parse_result(self, payload: dict[str, Any]) -> StepResult[CodeReviewObservation]:
37+
obs_data = payload.get("observation", {})
38+
scorecard_data = obs_data.get("scorecard")
39+
observation = CodeReviewObservation(
40+
task_id=obs_data.get("task_id", ""),
41+
task_title=obs_data.get("task_title", ""),
42+
difficulty=obs_data.get("difficulty", ""),
43+
phase=obs_data.get("phase", "overview"),
44+
instructions=obs_data.get("instructions", ""),
45+
repo_name=obs_data.get("repo_name", ""),
46+
pr_title=obs_data.get("pr_title", ""),
47+
pr_description=obs_data.get("pr_description", ""),
48+
ci_summary=obs_data.get("ci_summary", ""),
49+
action_result=obs_data.get("action_result", ""),
50+
displayed_content=obs_data.get("displayed_content", ""),
51+
changed_files=[
52+
ChangedFileSummary.model_validate(item)
53+
for item in obs_data.get("changed_files", [])
54+
],
55+
search_results=[
56+
SearchHit.model_validate(item) for item in obs_data.get("search_results", [])
57+
],
58+
attempts_remaining=obs_data.get("attempts_remaining", 0),
59+
scorecard=(
60+
ReviewScorecard.model_validate(scorecard_data) if scorecard_data else None
61+
),
62+
done=payload.get("done", False),
63+
reward=payload.get("reward"),
64+
)
65+
return StepResult(
66+
observation=observation,
67+
reward=payload.get("reward"),
68+
done=payload.get("done", False),
69+
)
70+
71+
def _parse_state(self, payload: dict[str, Any]) -> CodeReviewState:
72+
return CodeReviewState.model_validate(payload)

0 commit comments

Comments
 (0)