Skip to content

Research: advanced recommendation methods (Copeland, Borda, pairwise comparison) #103

@that-github-user

Description

@that-github-user

Summary

Our current scoring is a weighted sum (tests + convergence + diff outlier penalty). This works for simple cases but is rudimentary for the complex question of "which solution is best." There's established research on multi-criteria decision making and social choice theory that could improve recommendations.

Research directions

Copeland scoring (tournament method)

Each agent is compared pairwise against every other agent. An agent gets +1 for each pairwise "win" and -1 for each "loss." The agent with the highest Copeland score wins. A "win" could be defined as: passes more tests, has higher convergence, or is preferred by an LLM judge.

Pros: Handles non-transitive preferences, condorcet-winner guaranteed.
Cons: Needs a clear pairwise comparison function.

Borda count

Rank agents on each criterion separately, then sum the ranks. Agent ranked 1st on tests, 2nd on convergence, 1st on correctness → score = 1+2+1 = 4 (lower is better).

Pros: Simple, interpretable, handles heterogeneous criteria.
Cons: Sensitive to irrelevant alternatives.

LLM-as-judge (pairwise)

Use an LLM to compare pairs of diffs: "Which of these two solutions is better and why?" Aggregate pairwise preferences using Copeland or Elo rating.

Pros: Can capture semantic quality that metrics miss (better naming, cleaner error handling, more idiomatic code).
Cons: Expensive (N*(N-1)/2 API calls), LLM might have biases.

Multi-criteria Decision Analysis (MCDA)

Formal methods like TOPSIS, PROMETHEE, or AHP that handle multiple conflicting criteria with weights.

Ensemble of judges

Use multiple evaluation methods (current scoring + LLM judge + code complexity metrics) and aggregate THOSE via voting — meta-ensemble.

Relevant academic work

  • Kambhampati's LLM-Modulo framework (external verifiers for LLM outputs)
  • AlphaCode's clustering + filtering approach
  • Social choice theory (Arrow, Condorcet, Borda)
  • MCDA textbooks (Belton & Stewart)

Proposed first step

Add Copeland pairwise scoring as an alternative to weighted sum. Simple to implement, well-understood, and handles the "smaller isn't always better" problem naturally because each criterion is compared independently.

Acceptance criteria

  • Research and document at least 3 approaches
  • Implement Copeland scoring as --scoring copeland option
  • Compare results: does Copeland recommend different agents than weighted sum?
  • Update docs/architecture.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions