You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Start with your constraints: (1) Can you send data to a third-party API, or do you need self-hosting? (2) What is your latency budget? (3) What is your cost ceiling per request? (4) Do you need vision, long context, or strong reasoning? These four questions will narrow the field to 2-3 candidates. Then prototype with your actual data and measure what matters for your specific use case.</p>
67
67
</div>
68
68
69
+
<divclass="section-break">* * *</div>
70
+
71
+
<h2>Documentation Frameworks: Model Cards, Datasheets, and Data Cards</h2>
72
+
73
+
<p>The model cards in this appendix follow a tradition of structured documentation for ML artifacts. Three complementary frameworks have emerged as standards, each addressing a different artifact and audience. Understanding their differences helps teams choose the right documentation strategy for their projects.</p>
74
+
75
+
<h3>Model Cards (Mitchell et al., 2019)</h3>
76
+
77
+
<p>
78
+
Model cards document the <em>model itself</em>: intended use cases, performance metrics disaggregated by demographic group, known limitations, and ethical considerations. Originally proposed at FAT* 2019, model cards are now standard on Hugging Face, where every model repository includes a card rendered from a structured <code>README.md</code>. Model cards answer the question: "Should I use this model for my task, and what should I watch out for?"
79
+
</p>
80
+
81
+
<h3>Datasheets for Datasets (Gebru et al., 2021)</h3>
82
+
83
+
<p>
84
+
Datasheets document the <em>training or evaluation data</em> behind a model. The framework organizes documentation into seven sections: <strong>motivation</strong> (why the dataset was created), <strong>composition</strong> (what the data contains), <strong>collection process</strong> (how it was gathered and by whom), <strong>preprocessing</strong> (cleaning, filtering, labeling steps), <strong>uses</strong> (intended and prohibited applications), <strong>distribution</strong> (how the dataset is shared), and <strong>maintenance</strong> (who maintains it and how to report issues). Datasheets answer the question: "Can I trust this data, and is it appropriate for my use case?"
85
+
</p>
86
+
87
+
<h3>Data Cards (Google, 2022)</h3>
88
+
89
+
<p>
90
+
Google's Data Cards Playbook extends the datasheet concept with a more structured, template-driven approach designed for enterprise adoption. Data cards include quantitative summaries (dataset size, label distributions, demographic breakdowns) alongside qualitative descriptions, making them easier to generate semi-automatically from metadata. The playbook provides fillable templates and review checklists that integrate into MLOps workflows.
91
+
</p>
92
+
93
+
<h3>Comparison: Documentation Frameworks</h3>
94
+
95
+
<divclass="comparison-table">
96
+
<divclass="comparison-table-title">Model Cards vs. Datasheets vs. Data Cards</div>
97
+
<table>
98
+
<thead>
99
+
<tr>
100
+
<thscope="col">Framework</th>
101
+
<thscope="col">Artifact Type</th>
102
+
<thscope="col">Primary Audience</th>
103
+
<thscope="col">Key Sections</th>
104
+
<thscope="col">Adoption</th>
105
+
</tr>
106
+
</thead>
107
+
<tbody>
108
+
<tr>
109
+
<td><strong>Model Cards</strong></td>
110
+
<td>Trained model</td>
111
+
<td>Downstream developers, auditors</td>
112
+
<td>Intended use, metrics by group, limitations, ethical considerations</td>
113
+
<td>Widespread (Hugging Face, major providers)</td>
<h3>Operationalizing Documentation in Training Pipelines</h3>
134
+
135
+
<p>
136
+
Documentation should not be a manual afterthought. Modern MLOps pipelines can generate documentation artifacts automatically. Hugging Face's <code>huggingface_hub</code> library provides <code>ModelCard</code> and <code>DatasetCard</code> classes that populate templates from training metadata (metrics, hyperparameters, dataset statistics). Google's Data Cards Playbook includes scripts that extract schema information and compute summary statistics directly from data files. The goal is to make documentation a build artifact: generated during training, versioned alongside model weights, and reviewed during the deployment approval process.
137
+
</p>
138
+
139
+
<h3>Tools for Documentation</h3>
140
+
141
+
<p>
142
+
<strong>Hugging Face Dataset Cards:</strong> Every dataset on the Hub includes a structured card with YAML metadata (task type, languages, license) and freeform sections. The <code>datasets</code> library can auto-generate skeleton cards from dataset metadata. <strong>Google Data Cards Playbook:</strong> Provides PDF and digital templates, a facilitator guide for team workshops, and example cards for reference datasets. Both tools lower the barrier to producing useful documentation, though human review remains essential for nuanced content like limitation descriptions and ethical considerations.
143
+
</p>
144
+
145
+
<divclass="callout tip">
146
+
<divclass="callout-title">Tip</div>
147
+
<p>Documentation is a living artifact. Automate what you can (statistics, schema, performance metrics), but reserve human judgment for what you must (limitations, ethical considerations, known biases). Schedule quarterly reviews of model and dataset cards, especially after retraining or data pipeline changes. Stale documentation is worse than no documentation because it creates false confidence.</p>
148
+
</div>
149
+
69
150
<navclass="chapter-nav">
70
151
<aclass="prev" href="section-h.2.html">Open-Weight Model Families</a>
Copy file name to clipboardExpand all lines: part-2-understanding-llms/module-09-inference-optimization/section-9.2.html
+42-3Lines changed: 42 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -543,15 +543,54 @@ <h3>7.2 Eviction and Sparsification</h3>
543
543
Batch 32: 2587.1 tok/s (1.24s for 3200 tokens)</div>
544
544
<divclass="code-caption"><strong>Code Fragment 9.2.5:</strong> Using a key-value cache to avoid redundant computation during autoregressive generation, dramatically speeding up inference.</div>
545
545
546
-
<h2>8. Research Frontiers<spanclass="level-badge advanced" title="Advanced">Advanced</span></h2>
<p>Standard transformer context windows are limited by the quadratic cost of attention and the linear growth of the KV cache. A family of techniques now extends context lengths from the typical 4K to 128K range into millions of tokens, enabling applications like full-codebase reasoning, book-length summarization, and long-horizon agent memory.</p>
549
+
550
+
<h3>8.1 RoPE Extension Methods</h3>
551
+
552
+
<p>Most modern LLMs use Rotary Position Embeddings (RoPE), which encode position through rotation matrices applied to query and key vectors. Naively extrapolating beyond the trained context length causes quality collapse because the model encounters rotation angles it has never seen. Several methods address this.</p>
553
+
554
+
<p><strong>YaRN (Yet another RoPE extensioN)</strong> uses NTK-aware interpolation, which adjusts the frequency basis of RoPE so that high-frequency components (encoding local position) are preserved while low-frequency components (encoding global position) are compressed. With minimal fine-tuning (400 to 1,000 steps), YaRN extends a model trained at 4K context to 64K or 128K with negligible perplexity increase.</p>
555
+
556
+
<p><strong>LongRoPE</strong> takes a progressive extension approach: rather than a single interpolation step, it extends context in stages (e.g., 4K to 256K to 2M+), fine-tuning briefly at each stage. LongRoPE also searches for non-uniform interpolation factors per RoPE dimension, finding that some dimensions tolerate more compression than others. This progressive strategy achieves extensions to 2 million tokens.</p>
557
+
558
+
<h3>8.2 Distributed and Compressed Attention</h3>
559
+
560
+
<p><strong>Ring Attention</strong> (Liu et al., 2023) distributes attention computation across multiple devices by partitioning the sequence into blocks assigned to different hosts. Each device computes attention over its local block while passing KV blocks to the next device in a ring topology, overlapping communication with computation. The effective context length scales linearly with the number of devices, enabling million-token sequences on a cluster of GPUs without any single device holding the full KV cache.</p>
561
+
562
+
<p><strong>Infini-attention</strong> (Munkhdalai et al., 2024) augments standard attention with a compressive memory that accumulates information from previous segments. The model maintains both a local attention window (for recent tokens) and a compressed summary (for distant context), combining them via a learned gating mechanism. This allows processing unbounded input lengths with bounded memory, at the cost of lossy compression for distant information.</p>
563
+
564
+
<p><strong>Sliding window attention</strong>, used in Mistral and other models, restricts each token to attend only to a fixed window of recent tokens (e.g., 4,096). Information beyond the window propagates through stacked layers: if layer 1 attends to a 4K window, layer 2's window covers tokens that already incorporated 4K of context, effectively reaching 8K. Combined with a small number of global attention tokens or "sink" tokens, sliding window approaches handle long sequences efficiently while keeping per-layer memory constant.</p>
<tr><td><strong>YaRN</strong></td><td>128K</td><td>Single GPU (fine-tune only)</td><td>Near-lossless up to 8x extension</td></tr>
573
+
<tr><td><strong>LongRoPE</strong></td><td>2M+</td><td>Single GPU (progressive fine-tune)</td><td>Good with per-dimension tuning</td></tr>
574
+
<tr><td><strong>Ring Attention</strong></td><td>1M+ (scales with devices)</td><td>Multi-GPU cluster</td><td>Exact (no approximation)</td></tr>
575
+
<tr><td><strong>Infini-attention</strong></td><td>Unbounded (streaming)</td><td>Single GPU</td><td>Lossy for distant context</td></tr>
576
+
<tr><td><strong>Sliding Window</strong></td><td>Unbounded (streaming)</td><td>Single GPU</td><td>Local context exact; distant via layer stacking</td></tr>
577
+
</table>
578
+
</div>
579
+
580
+
<divclass="callout research-frontier">
581
+
<divclass="callout-title">Research Frontier: The Convergence Toward Million-Token Context</div>
582
+
<p>By 2025, Gemini 1.5 Pro (2M tokens), Claude 3 (200K), and GPT-4 Turbo (128K) demonstrated that very long contexts are production-viable. The research trajectory is clear: RoPE extensions and efficient attention mechanisms are converging toward million-token windows as a default capability. The open question is not whether models can process long contexts, but whether they can <em>reliably use</em> information at arbitrary positions within them. Retrieval accuracy degrades in the middle of very long contexts (the "lost in the middle" phenomenon), suggesting that raw context length alone is insufficient without architectural improvements to attention patterns.</p>
583
+
</div>
584
+
585
+
<h2>9. Research Frontiers <spanclass="level-badge advanced" title="Advanced">Advanced</span></h2>
586
+
587
+
<h3>9.1 Test-Time Training (TTT)</h3>
549
588
550
589
<p>Test-Time Training (Sun et al., 2024) proposes a radical alternative to the KV cache. Instead of storing explicit key-value pairs for all past tokens, TTT layers <em>compress</em> the context into updated model weights. During inference, when processing a long context, a TTT layer performs a mini training step: it updates a small set of internal parameters via <aclass="cross-ref" href="../../part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html">gradient descent</a> on a next-token-prediction loss over the recent context. These updated parameters implicitly encode the contextual information that would otherwise require an explicit KV cache.</p>
551
590
552
591
<p>The result is dramatic: TTT achieves up to 35x speedup over full attention at 2 million token context length, because the "cache" is a fixed-size set of model parameters rather than a linearly-growing tensor. However, the approach blurs the traditional boundary between training and inference, since gradient computation occurs at every forward pass. Unlike <ahref="../../part-4-training-adapting/module-14-fine-tuning-fundamentals/index.html" class="cross-ref">fine-tuning</a>, which permanently updates model weights for reuse across many requests, TTT creates temporary weight updates for a single inference request. The model compresses long-context information into these ephemeral weights, then discards them entirely once generation is complete. This makes TTT a form of adaptive inference rather than a training procedure.</p>
553
592
554
-
<h3>8.2 DeepSeek Sparse Attention (DSA)</h3>
593
+
<h3>9.2 DeepSeek Sparse Attention (DSA)</h3>
555
594
556
595
<p>Introduced in DeepSeek V3.2, DeepSeek Sparse Attention addresses long-context inference through a hierarchical two-stage pipeline. The first stage, called the <strong>Lightning indexer</strong>, performs a coarse scan over the full context to identify which segments are most relevant to the current query. The second stage applies fine-grained token-level attention only within the selected segments. This two-stage approach reduces inference compute by approximately 70% for long contexts while maintaining quality comparable to full attention.</p>
0 commit comments