Currently, our benchmark ranking system uses simple sorting: higher score = better rank. This results in misleadingly precise rankings when the score differences are negligible (e.g. 99.92 vs 99.93 being treated as a different rank, despite being statistically insignificant or irrelevant in practice).
🧩 Problem
The current behavior:
- Ranks are assigned based on the strict ordering of scores.
- Very close scores (e.g., due to noise, floating-point jitter, or micro-optimizations) are given separate ranks.
- This creates noise in the leaderboard and overemphasizes tiny performance differences.
✅ Desired behavior
🛠️ Proposed solution
-
Option A: Epsilon-tolerant rank
type BenchmarkResult = { name: string; score: number };
function tolerantRank(data: Array<BenchmarkResult>, epsilon = 0.01) {
const sorted = [...data].sort((a, b) => b.score - a.score);
const ranked: Array<{ rank: number; name: string; score: number }> = [];
let currentRank = 1;
for (let i = 0; i < sorted.length; i++) {
if (
i > 0 &&
Math.abs(sorted[i].score - sorted[i - 1].score) > epsilon
) {
currentRank = ranked.length + 1;
}
ranked.push({
rank: currentRank,
name: sorted[i].name,
score: sorted[i].score,
});
}
return ranked;
}
-
Option B: Upper-bound bucket rank
const buckets = [95, 98, 99.5, 99.9, 100];
function upperBoundRank(score: number) {
for (const bucket of buckets) {
if (score <= bucket) return bucket;
}
return 100;
}
Apply this during score preprocessing, and then rank the buckets instead of raw scores.
💬 Open questions
- Should the grouping tolerance (
epsilon) be static or adaptive based on the score range?
- Do we prefer visually clear bucket labels (e.g.
≤ 99.5%) or a fuzzy equality?
- Should we show confidence intervals or standard deviations in the future?
🏁 Goal
- Make benchmark ranking more trustworthy, human-readable, and resilient to noise.
We can also make anomaly detection for it. What inspires me is this benchmark: https://web.lmarena.ai/leaderboard
Related:
Currently, our benchmark ranking system uses simple sorting: higher score = better rank. This results in misleadingly precise rankings when the score differences are negligible (e.g. 99.92 vs 99.93 being treated as a different rank, despite being statistically insignificant or irrelevant in practice).
🧩 Problem
The current behavior:
✅ Desired behavior
Group almost-equal scores into the same rank, to reflect meaningful differences only.
Use either:
epsilon, they are considered equal.🛠️ Proposed solution
Option A: Epsilon-tolerant rank
Option B: Upper-bound bucket rank
Apply this during score preprocessing, and then rank the buckets instead of raw scores.
💬 Open questions
epsilon) be static or adaptive based on the score range?≤ 99.5%) or a fuzzy equality?🏁 Goal
We can also make anomaly detection for it. What inspires me is this benchmark: https://web.lmarena.ai/leaderboard
Related: