Improve benchmark ranking system to group similar scores using tolerant ranking algorithm

Currently, our benchmark ranking system uses simple sorting: higher score = better rank. This results in misleadingly precise rankings when the score differences are negligible (e.g. 99.92 vs 99.93 being treated as a different rank, despite being statistically insignificant or irrelevant in practice).

### 🧩 Problem

The current behavior:

* Ranks are assigned based on the strict ordering of scores.
* Very close scores (e.g., due to noise, floating-point jitter, or micro-optimizations) are given separate ranks.
* This creates noise in the leaderboard and overemphasizes tiny performance differences.

### ✅ Desired behavior

* **Group almost-equal scores into the same rank**, to reflect meaningful differences only.
* Use either:

  * **Tolerance-based ranking**: if scores differ less than `epsilon`, they are considered equal.
  * **Upper-bound bucket ranking**: scores are grouped into pre-defined thresholds or quantized steps.

### 🛠️ Proposed solution

1. **Option A: Epsilon-tolerant rank**

   ```ts
   type BenchmarkResult = { name: string; score: number };

   function tolerantRank(data: Array<BenchmarkResult>, epsilon = 0.01) {
     const sorted = [...data].sort((a, b) => b.score - a.score);

     const ranked: Array<{ rank: number; name: string; score: number }> = [];
     let currentRank = 1;

     for (let i = 0; i < sorted.length; i++) {
       if (
         i > 0 &&
         Math.abs(sorted[i].score - sorted[i - 1].score) > epsilon
       ) {
         currentRank = ranked.length + 1;
       }

       ranked.push({
         rank: currentRank,
         name: sorted[i].name,
         score: sorted[i].score,
       });
     }

     return ranked;
   }
   ```

2. **Option B: Upper-bound bucket rank**

   ```ts
   const buckets = [95, 98, 99.5, 99.9, 100];

   function upperBoundRank(score: number) {
     for (const bucket of buckets) {
       if (score <= bucket) return bucket;
     }
     return 100;
   }
   ```

   Apply this during score preprocessing, and then rank the buckets instead of raw scores.

---

### 💬 Open questions

* Should the grouping tolerance (`epsilon`) be static or adaptive based on the score range?
* Do we prefer visually clear bucket labels (e.g. `≤ 99.5%`) or a fuzzy equality?
* Should we show confidence intervals or standard deviations in the future?

---

### 🏁 Goal

* Make benchmark ranking more **trustworthy**, **human-readable**, and **resilient to noise**.

---

We can also make anomaly detection for it. What inspires me is this benchmark: https://web.lmarena.ai/leaderboard

Related:

- https://github.com/moltar/typescript-runtime-type-benchmarks/issues/2062

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve benchmark ranking system to group similar scores using tolerant ranking algorithm #2073

🧩 Problem

✅ Desired behavior

🛠️ Proposed solution

💬 Open questions

🏁 Goal

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve benchmark ranking system to group similar scores using tolerant ranking algorithm #2073

Description

🧩 Problem

✅ Desired behavior

🛠️ Proposed solution

💬 Open questions

🏁 Goal

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions