You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+33-36Lines changed: 33 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -130,10 +130,8 @@ Based on a systematic review of **204 papers and online resources**, this survey
130
130
-`(2025-04)`**Multi-SWE-bench**: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving [](https://arxiv.org/abs/2504.02605v1)[](https://openreview.net/forum?id=MhBZzkz4h9)
131
131
-`(2025-04)`**SWE-PolyBench**: SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [](https://arxiv.org/abs/2504.08703)
132
132
-`(2025-04)`**SWE-bench Multilingual**: SWE-smith: Scaling Data for Software Engineering Agents [](https://arxiv.org/abs/2504.21798v2)[](https://openreview.net/forum?id=63iVrXc8cC)
133
-
-`(2025-03)`**FEA-Bench**: FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation
134
-
for Feature Implementation [](https://arxiv.org/abs/2503.06680v2)
135
-
-`(2025-02)`**SWE-Lancer**: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World
-`(2025-03)`**FEA-Bench**: FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation [](https://arxiv.org/abs/2503.06680v2)
134
+
-`(2025-02)`**SWE-Lancer**: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? [](https://arxiv.org/abs/2502.12115v4)
137
135
-`(2024-12)`**Visual SWE-bench**: CodeV: Issue Resolving with Visual Data [](https://arxiv.org/abs/2412.17315v1)[](http://dx.doi.org/10.18653/v1/2025.findings-acl.384)
138
136
-`(2024-10)`**SWE-bench Multimodal**: SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? [](https://arxiv.org/abs/2410.03859v1)[](https://openreview.net/forum?id=riTiq3i21b)
139
137
-`(2024-08)`**SWE-bench-java**: SWE-bench-java: A GitHub Issue Resolving Benchmark for Java [](https://arxiv.org/abs/2408.14354)
@@ -155,6 +153,34 @@ Based on a systematic review of **204 papers and online resources**, this survey
155
153
-`(2025-01)`**SWE-Fixer**: SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [](https://arxiv.org/abs/2501.05040v3)
156
154
-`(2023-10)`**SWE-bench-extra**: SWE-bench: Can Language Models Resolve Real-world Github Issues? [](https://arxiv.org/abs/2310.06770v3)
157
155
156
+
### 📥 Data Collection Methods
157
+
158
+
*Techniques for collecting training data*
159
+
160
+
-`(2026-03)`**OpenSWE**: daVinci-Env: Open SWE Environment Synthesis at Scale [](https://arxiv.org/abs/2603.13023)[](https://github.com/GAIR-NLP/OpenSWE)[](https://huggingface.co/datasets/GAIR/OpenSWE)
161
+
-`(2026-02)`**DockSmith**: DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder [](https://arxiv.org/abs/2602.00592)[](https://huggingface.co/collections/8sj7df9k8m5x8/docksmith)
162
+
-`(2026-02)`**SWE-rebench V2**: SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale [](https://arxiv.org/abs/2602.23866)
163
+
-`(2026-02)`**Scale-SWE**: Immersion in the GitHub Universe: Scaling Coding Agents to Mastery [](https://arxiv.org/abs/2602.09892)[](https://github.com/AweAI-Team/ScaleSWE)[](https://huggingface.co/collections/AweAI-Team/scale-swe)
164
+
-`(2026-01)`**MEnvAgent**: MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering [](https://arxiv.org/abs/2601.22859)[](https://github.com/ernie-research/MEnvAgent)
165
+
-`(2025-12)`**Multi-Docker-Eval**: Multi-Docker-Eval: A 'Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering [](https://arxiv.org/abs/2512.06915)
166
+
-`(2025-08)`**RepoForge**: RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale [](https://arxiv.org/abs/2508.01550)
167
+
-`(2025-07)`**SWE-MERA**: SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks [](https://arxiv.org/abs/2507.11059)
168
+
-`(2025-06)`**SWE-Factory**: SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks [](https://arxiv.org/abs/2506.10954)
169
+
-`(2025-05)`**SWE-rebench**: SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [](https://arxiv.org/abs/2505.20411v2)[](https://openreview.net/forum?id=nMpJoVmRy1)
170
+
171
+
### 🔬 Data Synthesis Methods
172
+
173
+
*Approaches for synthetic data generation*
174
+
175
+
-`(2026-02)`**SWE-World**: SWE-World: Building Software Engineering Agents in Docker-Free Environments [](https://arxiv.org/abs/2602.03419)[](https://github.com/RUCAIBox/SWE-World)
176
+
-`(2026-02)`**SWE-Hub**: SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks [](https://arxiv.org/abs/2603.00575)
177
+
-`(2025-09)`**SWE-Mirror**: SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories [](https://arxiv.org/abs/2509.08724)
178
+
-`(2025-06)`**SWE-Flow**: Synthesizing Software Engineering Data in a Test-Driven Manner [](https://arxiv.org/abs/2506.09003v2)[](https://openreview.net/forum?id=P9DQ2IExgS)
179
+
-`(2025-04)`**R2E-Gym**: R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents [](https://arxiv.org/abs/2504.07164)[](https://openreview.net/forum?id=7evvwwdo3z)
180
+
-`(2025-04)`**SWE-Synth**: SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [](https://arxiv.org/abs/2504.14757)
181
+
-`(2025-04)`**SWE-smith**: SWE-smith: Scaling Data for Software Engineering Agents [](https://arxiv.org/abs/2504.21798v2)[](https://openreview.net/forum?id=63iVrXc8cC)
182
+
-`(2025-01)`**Learn-by-interact**: Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments [](https://arxiv.org/abs/2501.10893v1)[](https://openreview.net/forum?id=3UKOzGWCVY)
183
+
158
184
### 🤖 Single-Agent Systems
159
185
160
186
*Individual autonomous agents for issue resolution*
@@ -326,43 +352,14 @@ Based on a systematic review of **204 papers and online resources**, this survey
326
352
-`(2025-01)`**ReasoningBank**: CodeMonkeys: Scaling Test-Time Compute for Software Engineering [](https://arxiv.org/abs/2501.14723)
327
353
-`(2024-10)`**SWE-Search**: SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement [](https://arxiv.org/abs/2410.20285v6)[](https://openreview.net/forum?id=G7sIFXugTX)
328
354
329
-
### 📥 Data Collection Methods
330
-
331
-
*Techniques for collecting training data*
332
-
333
-
-`(2026-03)`**OpenSWE**: daVinci-Env: Open SWE Environment Synthesis at Scale [](https://arxiv.org/abs/2603.13023)[](https://github.com/GAIR-NLP/OpenSWE)[](https://huggingface.co/datasets/GAIR/OpenSWE)
334
-
-`(2026-02)`**DockSmith**: DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder [](https://arxiv.org/abs/2602.00592)[](https://huggingface.co/collections/8sj7df9k8m5x8/docksmith)
335
-
-`(2026-02)`**SWE-rebench V2**: SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale [](https://arxiv.org/abs/2602.23866)
336
-
-`(2026-02)`**Scale-SWE**: Immersion in the GitHub Universe: Scaling Coding Agents to Mastery [](https://arxiv.org/abs/2602.09892)[](https://github.com/AweAI-Team/ScaleSWE)[](https://huggingface.co/collections/AweAI-Team/scale-swe)
337
-
-`(2026-01)`**MEnvAgent**: MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering [](https://arxiv.org/abs/2601.22859)[](https://github.com/ernie-research/MEnvAgent)
338
-
-`(2025-12)`**Multi-Docker-Eval**: Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering [](https://arxiv.org/abs/2512.06915)
339
-
-`(2025-08)`**RepoForge**: RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale [](https://arxiv.org/abs/2508.01550)
340
-
-`(2025-07)`**SWE-MERA**: SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks [](https://arxiv.org/abs/2507.11059)
341
-
-`(2025-06)`**SWE-Factory**: SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks [](https://arxiv.org/abs/2506.10954)
342
-
-`(2025-05)`**SWE-rebench**: SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [](https://arxiv.org/abs/2505.20411v2)[](https://openreview.net/forum?id=nMpJoVmRy1)
343
-
344
-
### 🔬 Data Synthesis Methods
345
-
346
-
*Approaches for synthetic data generation*
347
-
348
-
-`(2026-02)`**SWE-World**: SWE-World: Building Software Engineering Agents in Docker-Free Environments [](https://arxiv.org/abs/2602.03419)[](https://github.com/RUCAIBox/SWE-World)
349
-
-`(2026-02)`**SWE-Hub**: SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks [](https://arxiv.org/abs/2603.00575)
350
-
-`(2025-09)`**SWE-Mirror**: SWE-Mirror: Scaling Issue-Resolving Datasets by Mirroring Issues Across Repositories [](https://arxiv.org/abs/2509.08724)
351
-
-`(2025-06)`**SWE-Flow**: Synthesizing Software Engineering Data in a Test-Driven Manner [](https://arxiv.org/abs/2506.09003v2)[](https://openreview.net/forum?id=P9DQ2IExgS)
352
-
-`(2025-04)`**R2E-Gym**: R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents [](https://arxiv.org/abs/2504.07164)[](https://openreview.net/forum?id=7evvwwdo3z)
353
-
-`(2025-04)`**SWE-Synth**: SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [](https://arxiv.org/abs/2504.14757)
354
-
-`(2025-04)`**SWE-smith**: SWE-smith: Scaling Data for Software Engineering Agents [](https://arxiv.org/abs/2504.21798v2)[](https://openreview.net/forum?id=63iVrXc8cC)
355
-
-`(2025-01)`**Learn-by-interact**: Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments [](https://arxiv.org/abs/2501.10893v1)[](https://openreview.net/forum?id=3UKOzGWCVY)
356
-
357
355
### 📈 Data Analysis
358
356
359
357
*Analysis of datasets and benchmarks*
360
358
361
359
-`(2025-12)`**Data contamination**: Does SWE-Bench-Verified Test Agent Ability or Model Memory? [](https://arxiv.org/abs/2512.10218)
362
360
-`(2025-11)`**Test Overfitting on SWE-bench**: Investigating Test Overfitting on SWE-bench [](https://arxiv.org/abs/2511.16858)
363
361
-`(2025-07)`**Rigorous agentic benchmarks**: Establishing Best Practices for Building Rigorous Agentic Benchmarks [](https://arxiv.org/abs/2507.02825)
364
-
-`(2025-07)`**SPICE**: SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity,
365
-
Test Coverage, and Effort Estimation [](https://arxiv.org/abs/2507.09108v5)
362
+
-`(2025-07)`**SPICE**: SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation [](https://arxiv.org/abs/2507.09108v5)
366
363
-`(2025-06)`**UTBoost**: UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench [](https://arxiv.org/abs/2506.09289)
367
364
-`(2025-06)`**Trustworthiness**: Is Your Automated Software Engineer Trustworthy? [](https://arxiv.org/abs/2506.17812)
368
365
-`(2025-06)`**The SWE-Bench Illusion**: The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [](https://arxiv.org/abs/2506.12286)
@@ -585,9 +582,9 @@ Open **http://localhost:5000/admin** to manage papers, datasets, and methods.
585
582
We welcome contributions! To add new papers or tables:
586
583
587
584
1. Fork this repository
588
-
2. Add entries via the admin interface (`python start.py` → `localhost:5000/admin`)
585
+
2. Add entries via the admin interface (`python app.py` → `localhost:5000/admin`)
589
586
— or manually edit the YAML/CSV files in `data/`
590
-
3. Run `python start.py --init` if you edited files directly
587
+
3. Run `python app.py --init` if you edited files directly
0 commit comments