Summary
Our current scoring is a weighted sum (tests + convergence + diff outlier penalty). This works for simple cases but is rudimentary for the complex question of "which solution is best." There's established research on multi-criteria decision making and social choice theory that could improve recommendations.
Research directions
Copeland scoring (tournament method)
Each agent is compared pairwise against every other agent. An agent gets +1 for each pairwise "win" and -1 for each "loss." The agent with the highest Copeland score wins. A "win" could be defined as: passes more tests, has higher convergence, or is preferred by an LLM judge.
Pros: Handles non-transitive preferences, condorcet-winner guaranteed.
Cons: Needs a clear pairwise comparison function.
Borda count
Rank agents on each criterion separately, then sum the ranks. Agent ranked 1st on tests, 2nd on convergence, 1st on correctness → score = 1+2+1 = 4 (lower is better).
Pros: Simple, interpretable, handles heterogeneous criteria.
Cons: Sensitive to irrelevant alternatives.
LLM-as-judge (pairwise)
Use an LLM to compare pairs of diffs: "Which of these two solutions is better and why?" Aggregate pairwise preferences using Copeland or Elo rating.
Pros: Can capture semantic quality that metrics miss (better naming, cleaner error handling, more idiomatic code).
Cons: Expensive (N*(N-1)/2 API calls), LLM might have biases.
Multi-criteria Decision Analysis (MCDA)
Formal methods like TOPSIS, PROMETHEE, or AHP that handle multiple conflicting criteria with weights.
Ensemble of judges
Use multiple evaluation methods (current scoring + LLM judge + code complexity metrics) and aggregate THOSE via voting — meta-ensemble.
Relevant academic work
- Kambhampati's LLM-Modulo framework (external verifiers for LLM outputs)
- AlphaCode's clustering + filtering approach
- Social choice theory (Arrow, Condorcet, Borda)
- MCDA textbooks (Belton & Stewart)
Proposed first step
Add Copeland pairwise scoring as an alternative to weighted sum. Simple to implement, well-understood, and handles the "smaller isn't always better" problem naturally because each criterion is compared independently.
Acceptance criteria
Summary
Our current scoring is a weighted sum (tests + convergence + diff outlier penalty). This works for simple cases but is rudimentary for the complex question of "which solution is best." There's established research on multi-criteria decision making and social choice theory that could improve recommendations.
Research directions
Copeland scoring (tournament method)
Each agent is compared pairwise against every other agent. An agent gets +1 for each pairwise "win" and -1 for each "loss." The agent with the highest Copeland score wins. A "win" could be defined as: passes more tests, has higher convergence, or is preferred by an LLM judge.
Pros: Handles non-transitive preferences, condorcet-winner guaranteed.
Cons: Needs a clear pairwise comparison function.
Borda count
Rank agents on each criterion separately, then sum the ranks. Agent ranked 1st on tests, 2nd on convergence, 1st on correctness → score = 1+2+1 = 4 (lower is better).
Pros: Simple, interpretable, handles heterogeneous criteria.
Cons: Sensitive to irrelevant alternatives.
LLM-as-judge (pairwise)
Use an LLM to compare pairs of diffs: "Which of these two solutions is better and why?" Aggregate pairwise preferences using Copeland or Elo rating.
Pros: Can capture semantic quality that metrics miss (better naming, cleaner error handling, more idiomatic code).
Cons: Expensive (N*(N-1)/2 API calls), LLM might have biases.
Multi-criteria Decision Analysis (MCDA)
Formal methods like TOPSIS, PROMETHEE, or AHP that handle multiple conflicting criteria with weights.
Ensemble of judges
Use multiple evaluation methods (current scoring + LLM judge + code complexity metrics) and aggregate THOSE via voting — meta-ensemble.
Relevant academic work
Proposed first step
Add Copeland pairwise scoring as an alternative to weighted sum. Simple to implement, well-understood, and handles the "smaller isn't always better" problem naturally because each criterion is compared independently.
Acceptance criteria
--scoring copelandoption