Research: advanced recommendation methods (Copeland, Borda, pairwise comparison)

## Summary
Our current scoring is a weighted sum (tests + convergence + diff outlier penalty). This works for simple cases but is rudimentary for the complex question of "which solution is best." There's established research on multi-criteria decision making and social choice theory that could improve recommendations.

## Research directions

### Copeland scoring (tournament method)
Each agent is compared pairwise against every other agent. An agent gets +1 for each pairwise "win" and -1 for each "loss." The agent with the highest Copeland score wins. A "win" could be defined as: passes more tests, has higher convergence, or is preferred by an LLM judge.

**Pros:** Handles non-transitive preferences, condorcet-winner guaranteed.
**Cons:** Needs a clear pairwise comparison function.

### Borda count
Rank agents on each criterion separately, then sum the ranks. Agent ranked 1st on tests, 2nd on convergence, 1st on correctness → score = 1+2+1 = 4 (lower is better).

**Pros:** Simple, interpretable, handles heterogeneous criteria.
**Cons:** Sensitive to irrelevant alternatives.

### LLM-as-judge (pairwise)
Use an LLM to compare pairs of diffs: "Which of these two solutions is better and why?" Aggregate pairwise preferences using Copeland or Elo rating.

**Pros:** Can capture semantic quality that metrics miss (better naming, cleaner error handling, more idiomatic code).
**Cons:** Expensive (N*(N-1)/2 API calls), LLM might have biases.

### Multi-criteria Decision Analysis (MCDA)
Formal methods like TOPSIS, PROMETHEE, or AHP that handle multiple conflicting criteria with weights.

### Ensemble of judges
Use multiple evaluation methods (current scoring + LLM judge + code complexity metrics) and aggregate THOSE via voting — meta-ensemble.

## Relevant academic work
- Kambhampati's LLM-Modulo framework (external verifiers for LLM outputs)
- AlphaCode's clustering + filtering approach
- Social choice theory (Arrow, Condorcet, Borda)
- MCDA textbooks (Belton & Stewart)

## Proposed first step
Add Copeland pairwise scoring as an alternative to weighted sum. Simple to implement, well-understood, and handles the "smaller isn't always better" problem naturally because each criterion is compared independently.

## Acceptance criteria
- [ ] Research and document at least 3 approaches
- [ ] Implement Copeland scoring as `--scoring copeland` option
- [ ] Compare results: does Copeland recommend different agents than weighted sum?
- [ ] Update docs/architecture.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: advanced recommendation methods (Copeland, Borda, pairwise comparison) #103

Summary

Research directions

Copeland scoring (tournament method)

Borda count

LLM-as-judge (pairwise)

Multi-criteria Decision Analysis (MCDA)

Ensemble of judges

Relevant academic work

Proposed first step

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Research: advanced recommendation methods (Copeland, Borda, pairwise comparison) #103

Description

Summary

Research directions

Copeland scoring (tournament method)

Borda count

LLM-as-judge (pairwise)

Multi-criteria Decision Analysis (MCDA)

Ensemble of judges

Relevant academic work

Proposed first step

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions