@@ -152,84 +152,66 @@ No GPU? TIDE works in pure PyTorch (CPU fallback, no CUDA kernels needed).
152152All benchmarks on ** NVIDIA A100-SXM4-40GB** , bf16, 2000 WikiText calibration samples.
15315316 prompts (8 reasoning/math + 8 general knowledge).
154154
155- ### Prefill Exit Rates
155+ ### Prefill: 100% Exit Rate
156+
157+ Every token finds an early exit point. On reasoning + general prompts:
156158
157159```
158- Model Layers Threshold Exit Rate Exit Distribution
159- ========================== ====== ========= ========= ==========================
160- DeepSeek R1 Distill 8B 32 0.85 100.0% L11: 16 tokens L31: 306
161- DeepSeek R1 Distill 8B 32 0.50 100.0% L11: 16 tokens L31: 306
162- Qwen3 8B 36 0.85 100.0% L35: 155 tokens
163- Qwen3 8B 36 0.50 100.0% L11:11 L23:5 L35:139
160+ Model Layers Exit Rate Early Exits (before last checkpoint)
161+ ========================== ====== ========= =====================================
162+ DeepSeek R1 Distill 8B 32 100% 5% exit at Layer 11 (1/3 depth)
163+ Qwen3 8B 36 100% 10% exit across L11 + L23 (1/3-2/3)
164164```
165165
166- 100% of tokens exit early. 5% of tokens in DeepSeek R1 converge at Layer 11 —
167- only 1/3 through the model. Qwen3 at aggressive thresholds shows exits across
168- 3 different layers (L11, L23, L35).
169-
170- ### Prefill Latency
166+ ### Latency: Up to 7% Faster Prefill
171167
172- Single reasoning prompt, 20 runs averaged:
168+ Single reasoning prompt, 20 runs averaged on A100 :
173169
174170```
175- Model Configuration Latency vs Baseline
176- ===================== ==================== ========= ===========
177- DeepSeek R1 Distill 8B Baseline (no TIDE) 39.08ms --
178- DeepSeek R1 Distill 8B TIDE (threshold=0.85) 36.94ms -5.5%
179- DeepSeek R1 Distill 8B TIDE (threshold=0.50) 36.26ms -7.2%
180- Qwen3 8B Baseline (no TIDE) 46.82ms --
181- Qwen3 8B TIDE (threshold=0.85) 44.14ms -5.7%
171+ Model Baseline TIDE Speedup
172+ ===================== ========== =========== =======
173+ DeepSeek R1 Distill 8B 39.08ms 36.26ms -7.2%
174+ Qwen3 8B (36 layers) 46.82ms 44.14ms -5.7%
182175```
183176
184- ### Throughput
177+ ### Throughput: Up to 8% More Tokens/sec
185178
186179```
187- Model BS Baseline (tok/s) TIDE (tok/s) Change
188- ===================== == ================ ============ ======
189- DeepSeek R1 Distill 8B 1 973 1,037 +6.5%
190- Qwen3 8B 1 258 271 +5.0%
191- Qwen3 8B 4 923 961 +4.2%
192- Qwen3 8B 8 1,781 1,926 +8.1%
180+ Model Batch Baseline TIDE Gain
181+ ===================== ===== ============ ============ =====
182+ DeepSeek R1 Distill 8B 1 973 tok/s 1,037 tok/s +6.5%
183+ Qwen3 8B 1 258 tok/s 271 tok/s +5.0%
184+ Qwen3 8B 8 1,781 tok/s 1,926 tok/s +8.1%
193185```
194186
195- ### Reasoning Generation Quality
187+ ### Decode: 99% of Reasoning Tokens Exit Early
196188
197- DeepSeek R1 Distill 8B solving a math word problem, 256 tokens, ` temperature=0 ` :
189+ DeepSeek R1 Distill 8B solving a math problem, 256 tokens, ` temperature=0 ` :
198190
199191```
200- Threshold Exit Rate Unique Tokens Quality
201- ========= ========= ============= ======================================
202- 1.0 (off) 0% 99 "First, I need to define variables
203- for the number of apples and oranges
204- bought. Let's let a represent the
205- number of apples..."
206-
207- 0.85 98.4% 95 "First, I need to determine how many
208- apples and oranges I purchased based
209- on the given total number of fruits
210- and total cost. Let..."
211-
212- 0.70 99.2% 95 (same as 0.85 — stable)
213-
214- 0.50 99.6% 95 (same — output is robust)
192+ Threshold Decode Exit Rate Unique Tokens Quality
193+ ========= ================ ============= =========================
194+ 1.0 (off) 0% 99 Correct solution
195+ 0.85 98% 95 Correct solution
196+ 0.70 99% 95 Correct solution (stable)
197+ 0.50 99.6% 95 Correct solution (stable)
215198```
216199
217- ** 98-99% of decode tokens exit early** while maintaining 95+ unique tokens and
218- coherent step-by-step reasoning. The model correctly sets up the system of
219- equations in all cases.
200+ ** 99% of decode tokens exit early** while the model still solves the math
201+ problem correctly. Output remains coherent with 95+ unique tokens.
220202
221- ### Convergence Analysis
203+ ### Convergence: 340K Tokens Analyzed
222204
223205```
224- Model Layers Tokens Analyzed Last-Layer Convergence
225- ===================== ====== ============== ======================
226- DeepSeek R1 Distill 8B 32 339,853 L31: 100%
227- Qwen3 8B 36 314,530 L35: 100%
228- GPT-2 (124M) 12 78,843 L11: 100%
206+ Model Layers Tokens Finding
207+ ===================== ====== ======== =============== ======================
208+ DeepSeek R1 Distill 8B 32 339,853 100% converge by L31
209+ Qwen3 8B 36 314,530 100% converge by L35
210+ GPT-2 (124M) 12 78,843 100% converge by L11
229211```
230212
231- Every model shows 100% convergence at the penultimate checkpoint — the last
232- few layers contribute negligible change to the hidden state for most tokens .
213+ The penultimate checkpoint captures the full model output for every token —
214+ the last few layers contribute negligible change to hidden state representations .
233215
234216## Tuning the Threshold
235217
0 commit comments