File tree Expand file tree Collapse file tree
knowledge_base/AI/Benchmarks Expand file tree Collapse file tree Original file line number Diff line number Diff line change 1- ## LiveCodeBench
1+ ## LiveCodeBench : AI Competitive programming benchmark
22[ paper] ( https://arxiv.org/pdf/2403.07974 )
33[ blog] ( https://huggingface.co/blog/leaderboard-livecodebench )
44- solve competitive problems
1111- test cases are generated by Gpt-4-turbo based on problem description
1212 - verified by running on known solution
1313
14- ## SWE Bench : AI Agent benchmarking
14+ ## SWE Bench : AI Software engineer benchmark
1515[ site] ( https://www.swebench.com/ ) ,[ paper] ( https://arxiv.org/pdf/2310.06770v2 )
1616- resolve github issues
1717 - Input : Issue, Code base snapshot
2626 - Filter by increase in test fail-to-pass ratio
2727 - After filtering, out of 90000 problems, 2294 were selected
2828- [ openai] ( https://openai.com/index/introducing-swe-bench-verified ) partnered to verify the benchmark
29- - makes sure the tests captures that the Issue is fixed and are not dependent on the implementation details (follows BDD)
29+ - makes sure the tests captures that the Issue is fixed and are not dependent on the implementation details (follows BDD)
30+
31+ ## SWE Lancer : AI Freelancer benchmark
32+ [ paper] ( https://arxiv.org/pdf/2502.12115 )
33+ - a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in realworld payouts.
34+ - Input/Output same as SWE bench
35+ - Evaluation : End to End browser automation tests from original freelancer of the task.
You can’t perform that action at this time.
0 commit comments