|
| 1 | +# Code Review Arena for OpenEnv |
| 2 | + |
| 3 | +`code_review_env` is a production-style OpenEnv benchmark for pull-request review. |
| 4 | +Instead of toy gameplay, the agent reviews realistic code changes for security, |
| 5 | +correctness, reliability, and quality regressions, then submits a structured review. |
| 6 | + |
| 7 | +This repository is designed to score well on the four hackathon judging axes: |
| 8 | + |
| 9 | +1. Runtime correctness |
| 10 | +2. OpenEnv interface compliance |
| 11 | +3. Task design quality |
| 12 | +4. Grading logic sophistication |
| 13 | + |
| 14 | +## Why this environment is competitive |
| 15 | + |
| 16 | +- Real-world task: PR review is a daily engineering workflow with direct product value. |
| 17 | +- Deterministic benchmark: no external APIs, no flaky third-party services, no hidden randomness. |
| 18 | +- Rich interaction loop: the agent can list changed files, inspect diffs or full files, search code, and submit a final review. |
| 19 | +- Sophisticated grading: optimal finding-to-rubric matching, severity weighting, line tolerance, semantic keyword checks, duplicate detection, and false-positive penalties. |
| 20 | +- Judge-friendly packaging: standalone OpenEnv environment with Docker, tests, client, and CI validation. |
| 21 | + |
| 22 | +## Benchmark design |
| 23 | + |
| 24 | +The built-in corpus contains realistic PR tasks across: |
| 25 | + |
| 26 | +- Broken access control |
| 27 | +- SQL injection |
| 28 | +- Path traversal |
| 29 | +- SSRF |
| 30 | +- JWT validation mistakes |
| 31 | +- Concurrency and race conditions |
| 32 | +- Client-side XSS |
| 33 | +- False-positive control via a clean refactor task |
| 34 | + |
| 35 | +Each task includes: |
| 36 | + |
| 37 | +- PR title and description |
| 38 | +- Changed file summaries |
| 39 | +- Unified diff snippets |
| 40 | +- Full changed-file contents for inspection |
| 41 | +- CI summary |
| 42 | +- Hidden reference findings used by the grader |
| 43 | + |
| 44 | +## Action space |
| 45 | + |
| 46 | +The environment accepts one typed `CodeReviewAction` with these `action_type` values: |
| 47 | + |
| 48 | +- `list_files` |
| 49 | +- `inspect_file` |
| 50 | +- `search_code` |
| 51 | +- `submit_review` |
| 52 | + |
| 53 | +Final review submissions are a list of structured findings: |
| 54 | + |
| 55 | +- `file_path` |
| 56 | +- `line_start` |
| 57 | +- `line_end` |
| 58 | +- `severity` |
| 59 | +- `category` |
| 60 | +- `title` |
| 61 | +- `explanation` |
| 62 | +- `confidence` |
| 63 | + |
| 64 | +## Scoring model |
| 65 | + |
| 66 | +The grader uses optimal one-to-one matching between submitted findings and reference findings. |
| 67 | +Each candidate match blends: |
| 68 | + |
| 69 | +- file/path agreement |
| 70 | +- line alignment with tolerance |
| 71 | +- category normalization and alias matching |
| 72 | +- severity agreement |
| 73 | +- title/explanation semantic coverage |
| 74 | + |
| 75 | +The final score combines: |
| 76 | + |
| 77 | +- coverage of true issues |
| 78 | +- precision of submitted issues |
| 79 | +- efficiency from staying within the review budget |
| 80 | +- penalties for false positives |
| 81 | +- penalties for duplicates |
| 82 | +- penalties for missing high-severity findings |
| 83 | + |
| 84 | +This makes the reward function significantly harder to game than simple exact-string matching. |
| 85 | + |
| 86 | +## Local development |
| 87 | + |
| 88 | +```bash |
| 89 | +uv sync --extra dev |
| 90 | +uv run pytest |
| 91 | +uv run server --port 8000 |
| 92 | +``` |
| 93 | + |
| 94 | +Validate structure with OpenEnv: |
| 95 | + |
| 96 | +```bash |
| 97 | +openenv validate --verbose |
| 98 | +openenv validate http://127.0.0.1:8000 |
| 99 | +``` |
| 100 | + |
| 101 | +## Example usage |
| 102 | + |
| 103 | +```python |
| 104 | +import asyncio |
| 105 | + |
| 106 | +from code_review_env import CodeReviewAction, CodeReviewEnv, ReviewFinding |
| 107 | + |
| 108 | + |
| 109 | +async def main() -> None: |
| 110 | + async with CodeReviewEnv(base_url="http://127.0.0.1:8000") as env: |
| 111 | + result = await env.reset(task_id="sql_injection_report_filters") |
| 112 | + print(result.observation.pr_title) |
| 113 | + |
| 114 | + await env.step( |
| 115 | + CodeReviewAction( |
| 116 | + action_type="inspect_file", |
| 117 | + file_path="analytics/reporting.py", |
| 118 | + view_mode="full", |
| 119 | + start_line=1, |
| 120 | + end_line=120, |
| 121 | + ) |
| 122 | + ) |
| 123 | + |
| 124 | + graded = await env.step( |
| 125 | + CodeReviewAction( |
| 126 | + action_type="submit_review", |
| 127 | + findings=[ |
| 128 | + ReviewFinding( |
| 129 | + file_path="analytics/reporting.py", |
| 130 | + line_start=24, |
| 131 | + line_end=31, |
| 132 | + severity="critical", |
| 133 | + category="sql_injection", |
| 134 | + title="Unsafe string interpolation in SQL query", |
| 135 | + explanation=( |
| 136 | + "customer_id and period are inserted directly into SQL, " |
| 137 | + "so an attacker can change the query instead of using " |
| 138 | + "parameter binding." |
| 139 | + ), |
| 140 | + confidence=0.95, |
| 141 | + ) |
| 142 | + ], |
| 143 | + ) |
| 144 | + ) |
| 145 | + print(graded.observation.scorecard.overall_score) |
| 146 | + |
| 147 | + |
| 148 | +asyncio.run(main()) |
| 149 | +``` |
| 150 | + |
| 151 | +## Hugging Face / Space deployment |
| 152 | + |
| 153 | +The environment is ready for: |
| 154 | + |
| 155 | +```bash |
| 156 | +openenv push --repo-id <your-hf-space> |
| 157 | +``` |
| 158 | + |
| 159 | +No API keys are required for the benchmark itself. If you later want a private rubric |
| 160 | +bundle for leaderboard use, you can point `CODE_REVIEW_TASK_BUNDLE_PATH` at a private |
| 161 | +JSON file without changing the environment interface. |
| 162 | + |
0 commit comments