Feat: elo credit assignment by positonic · Pull Request #2 · evalscience/deepgov-gg23-advisors

positonic · 2025-04-24T11:48:21Z

Exploration of "Tournament-Style Scoring (Ranked-Choice + Elo) which comes from chess tournaments"

Goal of this PR:
Establish if this could be useful for ranking projects, and differentiating which are the most effective.

Conclusion:
In your implementation, Elo scoring is used to rank grant applications by simulating pairwise comparisons between projects based on their application data, AI-generated reviews, research summaries, karmaGap data and Hypercerts data.

For each matchup, an LLM is prompted to decide which project deserves more funding, and Elo ratings are updated accordingly. After all comparisons, the Elo scores are normalized so they sum to 1, allowing you to proportionally allocate a fixed prize pool (e.g., $25k) based on each project’s relative standing. This approach effectively establishes a funding ranking, but it reflects relative preference—not absolute impact—so high-impact projects may not receive proportionally more funding unless further adjustments are made.

To test this run:

bun run scripts/credit-assignment-elo.ts

It's good to rank projects, but not suitable in it's current form to answer the question:

“Which projects should get more funding, not just higher ranking?”

Elo is good at	but not good at:
Establishing relative order ("A beats B")	Measuring how much better A is vs B in $ terms
Reflecting surprises in matchups	Giving absolute scores of value or quality
Adapting to local context (who beats whom)	Comparing against a fixed standard or objective

This PR also includes the MAPrank approach.

Magnitude-Adaptive Pairwise Ranking (MAPRank)" - which is designed to uses a combination of magnitude and an adaptive K-factor for pairwise rankings.

see credit-assignment-map-rank.ts

Summary of the Magnitude-Adaptive Pairwise Ranking (MAPRank) Approach:

This system evaluates applications through a series of pairwise comparisons within distinct "agent" contexts (representing different review models or perspectives).

For each matchup:
An AI model (the creditAssignmentAgent) determines not just which of the two applications is better, but also how much better it is on a scale from 0.5 (projects are roughly equal) to 1.0 (winner is significantly better).

This "how much better" score is translated into an actual_score for each project in the pair (e.g., if A wins with 0.8, A gets 0.8 and B gets 0.2).
Applications start with a BASE_RATING. Ratings are then updated directly based on the outcome and magnitude of this comparison. The update formula is:
new_rating = old_rating + effective_K * (actual_score_for_project - 0.5)

The effective_K factor is dynamic: it's calculated as BASE_K_FACTOR / sqrt(number_of_opponents_in_tournament). This makes the system more sensitive to individual match outcomes in smaller tournaments and more stable in larger ones, preventing ratings from changing too drastically or bottoming out.
After all pairwise comparisons for an agent are complete, the resulting raw scores are normalized to sum to 1, representing a proportional share (e.g., for funding allocation).

This method differs from traditional Elo because it doesn't rely on an "expected score" based on rating differences. Instead, it directly incorporates the magnitude of the perceived difference from the AI judge into the rating updates, providing a more nuanced assessment of relative quality.

To run this run:

bun run scripts/credit-assignment-map-rank.ts

…pairwise script

…d accuracy

…oring details in the review process

…hance scoring logic for improved evaluation

…t for cleaner code

positonic · 2025-05-22T19:32:52Z

I've been testing this with this in my .env file:

EVAL_DATASET="42161/867/0x62f25a11c2ae5a2af563cc5b1f772b3aebe1bd4a0a82e41a78e61e1db972ad7e,42161/867/0xd089724cd73c932413bce5c797aee7d2fbcd1ad282f24cff790977e77908fdca,42161/867/0x5a35dc4ee0fd8cf69eb9f227b626c0c093c3efc5a6f1b518a3792d5e8b721860,42161/867/0xe573019b9f23a496663f5944a83c8acdc99792bfc5f5ad603ee8f6cb0f46f9fe,42161/865/0x9119659eb8173b32bb4423f83702ee30c1e1db49ae0c07b00263bf3ea7f4d4ef"

With the following test results:

Processing 5 applications...
Pre-loading application data (app, research, karmagap)...

🎯 Running Elo tournament for agent: gitcoin-communist
Match: GainForest vs Treegens DAO🌳 -> Winner: A
Match: GainForest vs ÆRTH - Planetary AI -> Winner: A
Match: GainForest vs Hydrapad -> Winner: A
Match: GainForest vs Deep Funding -> Winner: A
Match: Treegens DAO🌳 vs ÆRTH - Planetary AI -> Winner: B
Match: Treegens DAO🌳 vs Hydrapad -> Winner: B
Match: Treegens DAO🌳 vs Deep Funding -> Winner: B
Match: ÆRTH - Planetary AI vs Hydrapad -> Winner: A
Match: ÆRTH - Planetary AI vs Deep Funding -> Winner: B
Match: Hydrapad vs Deep Funding -> Winner: B
Raw final ratings: {
  "42161-867-0x62f25a11c2ae5a2af563cc5b1f772b3aebe1bd4a0a82e41a78e61e1db972ad7e": 1059.729526119354,
  "42161-867-0xd089724cd73c932413bce5c797aee7d2fbcd1ad282f24cff790977e77908fdca": 938.3618138141832,
  "42161-867-0x5a35dc4ee0fd8cf69eb9f227b626c0c093c3efc5a6f1b518a3792d5e8b721860": 999.9571041686318,
  "42161-867-0xe573019b9f23a496663f5944a83c8acdc99792bfc5f5ad603ee8f6cb0f46f9fe": 970.1385703893195,
  "42161-865-0x9119659eb8173b32bb4423f83702ee30c1e1db49ae0c07b00263bf3ea7f4d4ef": 1031.8129855085115
}
✅ Saved results for gitcoin-communist

🎯 Running Elo tournament for agent: open-source-capitalist
Match: GainForest vs Treegens DAO🌳 -> Winner: A
Match: GainForest vs ÆRTH - Planetary AI -> Winner: A
Match: GainForest vs Hydrapad -> Winner: A
Match: GainForest vs Deep Funding -> Winner: A
Match: Treegens DAO🌳 vs ÆRTH - Planetary AI -> Winner: B
Match: Treegens DAO🌳 vs Hydrapad -> Winner: B
Match: Treegens DAO🌳 vs Deep Funding -> Winner: B
Match: ÆRTH - Planetary AI vs Hydrapad -> Winner: A
Match: ÆRTH - Planetary AI vs Deep Funding -> Winner: A
Match: Hydrapad vs Deep Funding -> Winner: B
Raw final ratings: {
  "42161-867-0x62f25a11c2ae5a2af563cc5b1f772b3aebe1bd4a0a82e41a78e61e1db972ad7e": 1059.729526119354,
  "42161-867-0xd089724cd73c932413bce5c797aee7d2fbcd1ad282f24cff790977e77908fdca": 938.3618138141832,
  "42161-867-0x5a35dc4ee0fd8cf69eb9f227b626c0c093c3efc5a6f1b518a3792d5e8b721860": 1031.9571041686318,
  "42161-867-0xe573019b9f23a496663f5944a83c8acdc99792bfc5f5ad603ee8f6cb0f46f9fe": 968.6693365924307,
  "42161-865-0x9119659eb8173b32bb4423f83702ee30c1e1db49ae0c07b00263bf3ea7f4d4ef": 1001.2822193054003
}
✅ Saved results for open-source-capitalist

🎯 Running Elo tournament for agent: regenerator
Match: GainForest vs Treegens DAO🌳 -> Winner: A
Match: GainForest vs ÆRTH - Planetary AI -> Winner: A
Match: GainForest vs Hydrapad -> Winner: A
Match: GainForest vs Deep Funding -> Winner: A
Match: Treegens DAO🌳 vs ÆRTH - Planetary AI -> Winner: B
Match: Treegens DAO🌳 vs Hydrapad -> Winner: B
Match: Treegens DAO🌳 vs Deep Funding -> Winner: B
Match: ÆRTH - Planetary AI vs Hydrapad -> Winner: A
Match: ÆRTH - Planetary AI vs Deep Funding -> Winner: B
Match: Hydrapad vs Deep Funding -> Winner: B
Raw final ratings: {
  "42161-867-0x62f25a11c2ae5a2af563cc5b1f772b3aebe1bd4a0a82e41a78e61e1db972ad7e": 1059.729526119354,
  "42161-867-0xd089724cd73c932413bce5c797aee7d2fbcd1ad282f24cff790977e77908fdca": 938.3618138141832,
  "42161-867-0x5a35dc4ee0fd8cf69eb9f227b626c0c093c3efc5a6f1b518a3792d5e8b721860": 999.9571041686318,
  "42161-867-0xe573019b9f23a496663f5944a83c8acdc99792bfc5f5ad603ee8f6cb0f46f9fe": 970.1385703893195,
  "42161-865-0x9119659eb8173b32bb4423f83702ee30c1e1db49ae0c07b00263bf3ea7f4d4ef": 1031.8129855085115
}
✅ Saved results for regenerator

daviddao · 2025-05-22T19:36:51Z

This looks so cool, i'm excited to test it out!

positonic · 2025-05-22T22:06:37Z

This PR also includes the MAPrank approach.

Magnitude-Adaptive Pairwise Ranking (MAPRank)" - which is designed to uses a combination of magnitude and an adaptive K-factor for pairwise rankings.

see credit-assignment-map-rank.ts

Summary of the Magnitude-Adaptive Pairwise Ranking (MAPRank) Approach:

This system evaluates applications through a series of pairwise comparisons within distinct "agent" contexts (representing different review models or perspectives).

For each matchup:
An AI model (the creditAssignmentAgent) determines not just which of the two applications is better, but also how much better it is on a scale from 0.5 (projects are roughly equal) to 1.0 (winner is significantly better).

This "how much better" score is translated into an actual_score for each project in the pair (e.g., if A wins with 0.8, A gets 0.8 and B gets 0.2).
Applications start with a BASE_RATING. Ratings are then updated directly based on the outcome and magnitude of this comparison. The update formula is:
new_rating = old_rating + effective_K * (actual_score_for_project - 0.5)

The effective_K factor is dynamic: it's calculated as BASE_K_FACTOR / sqrt(number_of_opponents_in_tournament). This makes the system more sensitive to individual match outcomes in smaller tournaments and more stable in larger ones, preventing ratings from changing too drastically or bottoming out.
After all pairwise comparisons for an agent are complete, the resulting raw scores are normalized to sum to 1, representing a proportional share (e.g., for funding allocation).

This method differs from traditional Elo because it doesn't rely on an "expected score" based on rating differences. Instead, it directly incorporates the magnitude of the perceived difference from the AI judge into the rating updates, providing a more nuanced assessment of relative quality.

To run this run:

bun run scripts/credit-assignment-map-rank.ts

positonic changed the title ~~Feat elo credit assignment [DRAFT - don't merge]~~ Feat: elo credit assignment [DRAFT - don't merge] Apr 25, 2025

positonic added 8 commits May 22, 2025 10:36

feat: update gitignore for IDE

bce2daf

feat: implement Elo-based credit assignment for grant applications

f9bbee4

refactor: switch credit assignment model to OpenAI and remove unused …

1655239

…pairwise script

refactor: standardize application ID usage in credit assignment script

ec51cc7

fix: update scores in Elo credit assignment CSV files for consistency

cb51ec5

refactor: enhance Elo scoring logic and update CSV scores for improve…

c49d0fc

…d accuracy

refactor: enhance project comparison logic by including ethics and sc…

27edafa

…oring details in the review process

refactor: integrate hypercert data into application processing and en…

27a6e48

…hance scoring logic for improved evaluation

positonic force-pushed the feat-elo-credit-assignment branch from 2bf00df to 27a6e48 Compare May 22, 2025 19:26

refactor: remove temporary debug logging from credit assignment scrip…

e16a217

…t for cleaner code

positonic changed the title ~~Feat: elo credit assignment [DRAFT - don't merge]~~ Feat: elo credit assignment May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: elo credit assignment#2

Feat: elo credit assignment#2
positonic wants to merge 9 commits into
evalscience:mainfrom
positonic:feat-elo-credit-assignment

positonic commented Apr 24, 2025 •

edited

Loading

Uh oh!

positonic commented May 22, 2025

Uh oh!

daviddao commented May 22, 2025

Uh oh!

positonic commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

positonic commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Exploration of "Tournament-Style Scoring (Ranked-Choice + Elo) which comes from chess tournaments"

Uh oh!

positonic commented May 22, 2025

Uh oh!

daviddao commented May 22, 2025

Uh oh!

positonic commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

positonic commented Apr 24, 2025 •

edited

Loading