Skip to content

Add evaluate command: compare scoring methods across all runs#106

Merged
that-github-user merged 1 commit into
mainfrom
issue-105-scoring-evaluation
Mar 28, 2026
Merged

Add evaluate command: compare scoring methods across all runs#106
that-github-user merged 1 commit into
mainfrom
issue-105-scoring-evaluation

Conversation

@that-github-user

Copy link
Copy Markdown
Owner

Summary

  • thinktank evaluate re-scores all past runs with weighted, Copeland, and Borda
  • Shows per-run comparison table and agreement rates
  • Implements Borda count as third scoring method (local to evaluate)

Key findings from 21 real runs

  • All three agree: 52% — scoring method matters nearly half the time
  • Copeland = Borda: 86% — two independent social choice methods converge
  • Weighted disagrees with both ~40% — weighted is the outlier
  • Implication: Copeland should likely become the default scoring method

Change type

  • New feature

Related issue

Partial fix for #105

How to test

npm test      # 126 tests pass
thinktank evaluate  # shows scoring comparison

Breaking changes

  • This PR introduces breaking changes

🤖 Generated with Claude Code

thinktank evaluate re-scores all past runs with weighted, Copeland,
and Borda methods, showing agreement rates and disagreements.

Key finding from 21 runs: Copeland and Borda agree 86% of the time,
while weighted disagrees with both ~40%. This suggests weighted scoring
is the outlier — Copeland should likely be the default.

Partial fix for #105

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@that-github-user that-github-user merged commit 251bc50 into main Mar 28, 2026
4 checks passed
@that-github-user that-github-user deleted the issue-105-scoring-evaluation branch March 28, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant