From 1f42f6f35f0e64d24ab13741021c6c2a5090bf8d Mon Sep 17 00:00:00 2001 From: Xavier Puspus <36430014+xmpuspus@users.noreply.github.com> Date: Sun, 24 May 2026 09:22:21 +0800 Subject: [PATCH] Add AWB to 8.2 Benchmarks > Integrated Benchmarks AWB (AI Workflow Benchmark) evaluates AI coding workflows on 100 tasks across 8 categories using real OSS repositories pinned at commit SHAs. Scored across 7 capability dimensions; ships 9 adapters (Claude Code, Cursor, Aider, Gemini CLI, Codex CLI, Windsurf, Copilot, Pi). --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 65b44d9..5db3688 100644 --- a/README.md +++ b/README.md @@ -4895,6 +4895,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF - **OmniCode**: "OmniCode: A Benchmark for Evaluating Software Engineering Agents" [2026-02] [[paper](https://arxiv.org/abs/2602.02262)] +- **AWB**: "AI Workflow Benchmark: Evaluating End-to-End AI Coding Workflows on Real Open-Source Tasks" [2026-04] [[repo](https://github.com/xmpuspus/ai-workflow-benchmark)] [[methodology](https://github.com/xmpuspus/ai-workflow-benchmark/blob/main/METHODOLOGY.md)] + #### Evaluation Metrics - "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis" [2020-09] [[paper](https://arxiv.org/abs/2009.10297)]