Add evaluate command: compare scoring methods across all runs by that-github-user · Pull Request #106 · that-github-user/thinktank

that-github-user · 2026-03-28T22:53:02Z

Summary

thinktank evaluate re-scores all past runs with weighted, Copeland, and Borda
Shows per-run comparison table and agreement rates
Implements Borda count as third scoring method (local to evaluate)

Key findings from 21 real runs

All three agree: 52% — scoring method matters nearly half the time
Copeland = Borda: 86% — two independent social choice methods converge
Weighted disagrees with both ~40% — weighted is the outlier
Implication: Copeland should likely become the default scoring method

Change type

New feature

Related issue

Partial fix for #105

How to test

npm test      # 126 tests pass
thinktank evaluate  # shows scoring comparison

Breaking changes

This PR introduces breaking changes

🤖 Generated with Claude Code

thinktank evaluate re-scores all past runs with weighted, Copeland, and Borda methods, showing agreement rates and disagreements. Key finding from 21 runs: Copeland and Borda agree 86% of the time, while weighted disagrees with both ~40%. This suggests weighted scoring is the outlier — Copeland should likely be the default. Partial fix for #105 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

that-github-user merged commit 251bc50 into main Mar 28, 2026
4 checks passed

that-github-user deleted the issue-105-scoring-evaluation branch March 28, 2026 22:54

This was referenced Mar 28, 2026

Evaluate scoring methods: statistical comparison of weighted vs Copeland vs Borda #105

Closed

Switch default scoring method from weighted to copeland #109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluate command: compare scoring methods across all runs#106

Add evaluate command: compare scoring methods across all runs#106
that-github-user merged 1 commit into
mainfrom
issue-105-scoring-evaluation

that-github-user commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

that-github-user commented Mar 28, 2026

Summary

Key findings from 21 real runs

Change type

Related issue

How to test

Breaking changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant