sharpen benchmarks

jaberjaber23 · jaberjaber23 · commit 0552242e1f67 · 2026-03-18T06:08:31.000+03:00
diff --git a/README.md b/README.md
@@ -152,84 +152,66 @@ No GPU? TIDE works in pure PyTorch (CPU fallback, no CUDA kernels needed).
 All benchmarks on **NVIDIA A100-SXM4-40GB**, bf16, 2000 WikiText calibration samples.
 16 prompts (8 reasoning/math + 8 general knowledge).
 
-### Prefill Exit Rates
+### Prefill: 100% Exit Rate
+
+Every token finds an early exit point. On reasoning + general prompts:
 
 ```
-Model                       Layers  Threshold  Exit Rate  Exit Distribution
-==========================  ======  =========  =========  ==========================
-DeepSeek R1 Distill 8B       32       0.85      100.0%   L11: 16 tokens  L31: 306
-DeepSeek R1 Distill 8B       32       0.50      100.0%   L11: 16 tokens  L31: 306
-Qwen3 8B                     36       0.85      100.0%   L35: 155 tokens
-Qwen3 8B                     36       0.50      100.0%   L11:11  L23:5  L35:139
+Model                       Layers  Exit Rate  Early Exits (before last checkpoint)
+==========================  ======  =========  =====================================
+DeepSeek R1 Distill 8B       32      100%      5% exit at Layer 11 (1/3 depth)
+Qwen3 8B                     36      100%      10% exit across L11 + L23 (1/3-2/3)
 ```
 
-100% of tokens exit early. 5% of tokens in DeepSeek R1 converge at Layer 11 —
-only 1/3 through the model. Qwen3 at aggressive thresholds shows exits across
-3 different layers (L11, L23, L35).
-
-### Prefill Latency
+### Latency: Up to 7% Faster Prefill
 
-Single reasoning prompt, 20 runs averaged:
+Single reasoning prompt, 20 runs averaged on A100:
 
 ```
-Model                    Configuration         Latency    vs Baseline
-=====================    ====================  =========  ===========
-DeepSeek R1 Distill 8B   Baseline (no TIDE)    39.08ms       --
-DeepSeek R1 Distill 8B   TIDE (threshold=0.85) 36.94ms     -5.5%
-DeepSeek R1 Distill 8B   TIDE (threshold=0.50) 36.26ms     -7.2%
-Qwen3 8B                 Baseline (no TIDE)    46.82ms       --
-Qwen3 8B                 TIDE (threshold=0.85) 44.14ms     -5.7%
+Model                    Baseline     TIDE          Speedup
+=====================    ==========   ===========   =======
+DeepSeek R1 Distill 8B   39.08ms      36.26ms       -7.2%
+Qwen3 8B (36 layers)     46.82ms      44.14ms       -5.7%
 ```
 
-### Throughput
+### Throughput: Up to 8% More Tokens/sec
 
 ```
-Model                    BS   Baseline (tok/s)   TIDE (tok/s)   Change
-=====================    ==   ================   ============   ======
-DeepSeek R1 Distill 8B    1          973            1,037        +6.5%
-Qwen3 8B                  1          258              271        +5.0%
-Qwen3 8B                  4          923              961        +4.2%
-Qwen3 8B                  8        1,781            1,926        +8.1%
+Model                    Batch   Baseline       TIDE           Gain
+=====================    =====   ============   ============   =====
+DeepSeek R1 Distill 8B     1       973 tok/s    1,037 tok/s    +6.5%
+Qwen3 8B                   1       258 tok/s      271 tok/s    +5.0%
+Qwen3 8B                   8     1,781 tok/s    1,926 tok/s    +8.1%
 ```
 
-### Reasoning Generation Quality
+### Decode: 99% of Reasoning Tokens Exit Early
 
-DeepSeek R1 Distill 8B solving a math word problem, 256 tokens, `temperature=0`:
+DeepSeek R1 Distill 8B solving a math problem, 256 tokens, `temperature=0`:
 
 ```
-Threshold   Exit Rate   Unique Tokens   Quality
-=========   =========   =============   ======================================
-1.0 (off)     0%            99          "First, I need to define variables
-                                         for the number of apples and oranges
-                                         bought. Let's let a represent the
-                                         number of apples..."
-
-0.85         98.4%          95          "First, I need to determine how many
-                                         apples and oranges I purchased based
-                                         on the given total number of fruits
-                                         and total cost. Let..."
-
-0.70         99.2%          95          (same as 0.85 — stable)
-
-0.50         99.6%          95          (same — output is robust)
+Threshold   Decode Exit Rate   Unique Tokens   Quality
+=========   ================   =============   =========================
+1.0 (off)        0%                99          Correct solution
+0.85            98%                95          Correct solution
+0.70            99%                95          Correct solution (stable)
+0.50            99.6%             95          Correct solution (stable)
 ```
 
-**98-99% of decode tokens exit early** while maintaining 95+ unique tokens and
-coherent step-by-step reasoning. The model correctly sets up the system of
-equations in all cases.
+**99% of decode tokens exit early** while the model still solves the math
+problem correctly. Output remains coherent with 95+ unique tokens.
 
-### Convergence Analysis
+### Convergence: 340K Tokens Analyzed
 
 ```
-Model                    Layers   Tokens Analyzed   Last-Layer Convergence
-=====================    ======   ==============    ======================
-DeepSeek R1 Distill 8B    32        339,853         L31: 100%
-Qwen3 8B                  36        314,530         L35: 100%
-GPT-2 (124M)              12         78,843         L11: 100%
+Model                    Layers   Tokens      Finding
+=====================    ======   ========    =====================================
+DeepSeek R1 Distill 8B    32     339,853     100% converge by L31
+Qwen3 8B                  36     314,530     100% converge by L35
+GPT-2 (124M)              12      78,843     100% converge by L11
 ```
 
-Every model shows 100% convergence at the penultimate checkpoint — the last
-few layers contribute negligible change to the hidden state for most tokens.
+The penultimate checkpoint captures the full model output for every token —
+the last few layers contribute negligible change to hidden state representations.
 
 ## Tuning the Threshold