Skip to content

Feat: elo credit assignment#2

Open
positonic wants to merge 9 commits into
evalscience:mainfrom
positonic:feat-elo-credit-assignment
Open

Feat: elo credit assignment#2
positonic wants to merge 9 commits into
evalscience:mainfrom
positonic:feat-elo-credit-assignment

Conversation

@positonic
Copy link
Copy Markdown

@positonic positonic commented Apr 24, 2025

Exploration of "Tournament-Style Scoring (Ranked-Choice + Elo) which comes from chess tournaments"

Goal of this PR:
Establish if this could be useful for ranking projects, and differentiating which are the most effective.

Conclusion:
In your implementation, Elo scoring is used to rank grant applications by simulating pairwise comparisons between projects based on their application data, AI-generated reviews, research summaries, karmaGap data and Hypercerts data.

For each matchup, an LLM is prompted to decide which project deserves more funding, and Elo ratings are updated accordingly. After all comparisons, the Elo scores are normalized so they sum to 1, allowing you to proportionally allocate a fixed prize pool (e.g., $25k) based on each project’s relative standing. This approach effectively establishes a funding ranking, but it reflects relative preference—not absolute impact—so high-impact projects may not receive proportionally more funding unless further adjustments are made.

To test this run:

bun run scripts/credit-assignment-elo.ts

It's good to rank projects, but not suitable in it's current form to answer the question:

“Which projects should get more funding, not just higher ranking?”

Elo is good at but not good at:
Establishing relative order ("A beats B")Measuring how much better A is vs B in $ terms
Reflecting surprises in matchupsGiving absolute scores of value or quality
Adapting to local context (who beats whom)Comparing against a fixed standard or objective

This PR also includes the MAPrank approach.

Magnitude-Adaptive Pairwise Ranking (MAPRank)" - which is designed to uses a combination of magnitude and an adaptive K-factor for pairwise rankings.

see credit-assignment-map-rank.ts

Summary of the Magnitude-Adaptive Pairwise Ranking (MAPRank) Approach:

This system evaluates applications through a series of pairwise comparisons within distinct "agent" contexts (representing different review models or perspectives).

For each matchup:
An AI model (the creditAssignmentAgent) determines not just which of the two applications is better, but also how much better it is on a scale from 0.5 (projects are roughly equal) to 1.0 (winner is significantly better).

This "how much better" score is translated into an actual_score for each project in the pair (e.g., if A wins with 0.8, A gets 0.8 and B gets 0.2).
Applications start with a BASE_RATING. Ratings are then updated directly based on the outcome and magnitude of this comparison. The update formula is:
new_rating = old_rating + effective_K * (actual_score_for_project - 0.5)

The effective_K factor is dynamic: it's calculated as BASE_K_FACTOR / sqrt(number_of_opponents_in_tournament). This makes the system more sensitive to individual match outcomes in smaller tournaments and more stable in larger ones, preventing ratings from changing too drastically or bottoming out.
After all pairwise comparisons for an agent are complete, the resulting raw scores are normalized to sum to 1, representing a proportional share (e.g., for funding allocation).

This method differs from traditional Elo because it doesn't rely on an "expected score" based on rating differences. Instead, it directly incorporates the magnitude of the perceived difference from the AI judge into the rating updates, providing a more nuanced assessment of relative quality.

To run this run:

bun run scripts/credit-assignment-map-rank.ts 

@positonic positonic changed the title Feat elo credit assignment [DRAFT - don't merge] Feat: elo credit assignment [DRAFT - don't merge] Apr 25, 2025
@positonic positonic force-pushed the feat-elo-credit-assignment branch from 2bf00df to 27a6e48 Compare May 22, 2025 19:26
@positonic positonic changed the title Feat: elo credit assignment [DRAFT - don't merge] Feat: elo credit assignment May 22, 2025
@positonic
Copy link
Copy Markdown
Author

I've been testing this with this in my .env file:

EVAL_DATASET="42161/867/0x62f25a11c2ae5a2af563cc5b1f772b3aebe1bd4a0a82e41a78e61e1db972ad7e,42161/867/0xd089724cd73c932413bce5c797aee7d2fbcd1ad282f24cff790977e77908fdca,42161/867/0x5a35dc4ee0fd8cf69eb9f227b626c0c093c3efc5a6f1b518a3792d5e8b721860,42161/867/0xe573019b9f23a496663f5944a83c8acdc99792bfc5f5ad603ee8f6cb0f46f9fe,42161/865/0x9119659eb8173b32bb4423f83702ee30c1e1db49ae0c07b00263bf3ea7f4d4ef"

With the following test results:

image
Processing 5 applications...
Pre-loading application data (app, research, karmagap)...

🎯 Running Elo tournament for agent: gitcoin-communist
Match: GainForest vs Treegens DAO🌳 -> Winner: A
Match: GainForest vs ÆRTH - Planetary AI -> Winner: A
Match: GainForest vs Hydrapad -> Winner: A
Match: GainForest vs Deep Funding -> Winner: A
Match: Treegens DAO🌳 vs ÆRTH - Planetary AI -> Winner: B
Match: Treegens DAO🌳 vs Hydrapad -> Winner: B
Match: Treegens DAO🌳 vs Deep Funding -> Winner: B
Match: ÆRTH - Planetary AI vs Hydrapad -> Winner: A
Match: ÆRTH - Planetary AI vs Deep Funding -> Winner: B
Match: Hydrapad vs Deep Funding -> Winner: B
Raw final ratings: {
  "42161-867-0x62f25a11c2ae5a2af563cc5b1f772b3aebe1bd4a0a82e41a78e61e1db972ad7e": 1059.729526119354,
  "42161-867-0xd089724cd73c932413bce5c797aee7d2fbcd1ad282f24cff790977e77908fdca": 938.3618138141832,
  "42161-867-0x5a35dc4ee0fd8cf69eb9f227b626c0c093c3efc5a6f1b518a3792d5e8b721860": 999.9571041686318,
  "42161-867-0xe573019b9f23a496663f5944a83c8acdc99792bfc5f5ad603ee8f6cb0f46f9fe": 970.1385703893195,
  "42161-865-0x9119659eb8173b32bb4423f83702ee30c1e1db49ae0c07b00263bf3ea7f4d4ef": 1031.8129855085115
}
✅ Saved results for gitcoin-communist

🎯 Running Elo tournament for agent: open-source-capitalist
Match: GainForest vs Treegens DAO🌳 -> Winner: A
Match: GainForest vs ÆRTH - Planetary AI -> Winner: A
Match: GainForest vs Hydrapad -> Winner: A
Match: GainForest vs Deep Funding -> Winner: A
Match: Treegens DAO🌳 vs ÆRTH - Planetary AI -> Winner: B
Match: Treegens DAO🌳 vs Hydrapad -> Winner: B
Match: Treegens DAO🌳 vs Deep Funding -> Winner: B
Match: ÆRTH - Planetary AI vs Hydrapad -> Winner: A
Match: ÆRTH - Planetary AI vs Deep Funding -> Winner: A
Match: Hydrapad vs Deep Funding -> Winner: B
Raw final ratings: {
  "42161-867-0x62f25a11c2ae5a2af563cc5b1f772b3aebe1bd4a0a82e41a78e61e1db972ad7e": 1059.729526119354,
  "42161-867-0xd089724cd73c932413bce5c797aee7d2fbcd1ad282f24cff790977e77908fdca": 938.3618138141832,
  "42161-867-0x5a35dc4ee0fd8cf69eb9f227b626c0c093c3efc5a6f1b518a3792d5e8b721860": 1031.9571041686318,
  "42161-867-0xe573019b9f23a496663f5944a83c8acdc99792bfc5f5ad603ee8f6cb0f46f9fe": 968.6693365924307,
  "42161-865-0x9119659eb8173b32bb4423f83702ee30c1e1db49ae0c07b00263bf3ea7f4d4ef": 1001.2822193054003
}
✅ Saved results for open-source-capitalist

🎯 Running Elo tournament for agent: regenerator
Match: GainForest vs Treegens DAO🌳 -> Winner: A
Match: GainForest vs ÆRTH - Planetary AI -> Winner: A
Match: GainForest vs Hydrapad -> Winner: A
Match: GainForest vs Deep Funding -> Winner: A
Match: Treegens DAO🌳 vs ÆRTH - Planetary AI -> Winner: B
Match: Treegens DAO🌳 vs Hydrapad -> Winner: B
Match: Treegens DAO🌳 vs Deep Funding -> Winner: B
Match: ÆRTH - Planetary AI vs Hydrapad -> Winner: A
Match: ÆRTH - Planetary AI vs Deep Funding -> Winner: B
Match: Hydrapad vs Deep Funding -> Winner: B
Raw final ratings: {
  "42161-867-0x62f25a11c2ae5a2af563cc5b1f772b3aebe1bd4a0a82e41a78e61e1db972ad7e": 1059.729526119354,
  "42161-867-0xd089724cd73c932413bce5c797aee7d2fbcd1ad282f24cff790977e77908fdca": 938.3618138141832,
  "42161-867-0x5a35dc4ee0fd8cf69eb9f227b626c0c093c3efc5a6f1b518a3792d5e8b721860": 999.9571041686318,
  "42161-867-0xe573019b9f23a496663f5944a83c8acdc99792bfc5f5ad603ee8f6cb0f46f9fe": 970.1385703893195,
  "42161-865-0x9119659eb8173b32bb4423f83702ee30c1e1db49ae0c07b00263bf3ea7f4d4ef": 1031.8129855085115
}
✅ Saved results for regenerator

@daviddao
Copy link
Copy Markdown
Member

This looks so cool, i'm excited to test it out!

@positonic
Copy link
Copy Markdown
Author

This PR also includes the MAPrank approach.

Magnitude-Adaptive Pairwise Ranking (MAPRank)" - which is designed to uses a combination of magnitude and an adaptive K-factor for pairwise rankings.

see credit-assignment-map-rank.ts

Summary of the Magnitude-Adaptive Pairwise Ranking (MAPRank) Approach:

This system evaluates applications through a series of pairwise comparisons within distinct "agent" contexts (representing different review models or perspectives).

For each matchup:
An AI model (the creditAssignmentAgent) determines not just which of the two applications is better, but also how much better it is on a scale from 0.5 (projects are roughly equal) to 1.0 (winner is significantly better).

This "how much better" score is translated into an actual_score for each project in the pair (e.g., if A wins with 0.8, A gets 0.8 and B gets 0.2).
Applications start with a BASE_RATING. Ratings are then updated directly based on the outcome and magnitude of this comparison. The update formula is:
new_rating = old_rating + effective_K * (actual_score_for_project - 0.5)

The effective_K factor is dynamic: it's calculated as BASE_K_FACTOR / sqrt(number_of_opponents_in_tournament). This makes the system more sensitive to individual match outcomes in smaller tournaments and more stable in larger ones, preventing ratings from changing too drastically or bottoming out.
After all pairwise comparisons for an agent are complete, the resulting raw scores are normalized to sum to 1, representing a proportional share (e.g., for funding allocation).

This method differs from traditional Elo because it doesn't rely on an "expected score" based on rating differences. Instead, it directly incorporates the magnitude of the perceived difference from the AI judge into the rating updates, providing a more nuanced assessment of relative quality.

To run this run:

bun run scripts/credit-assignment-map-rank.ts 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants