You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2023-03-30-vicuna.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -120,7 +120,7 @@ Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user
120
120
Our training recipe builds on top of [Stanford’s alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) with the following improvements.
121
121
-**Multi-turn conversations:** We adjust the training loss to account for multi-turn conversations and compute the fine-tuning loss solely on the chatbot's output.
122
122
-**Memory Optimizations:** To enable Vicuna's understanding of long context, we expand the max context length from 512 in alpaca to 2048, which substantially increases GPU memory requirements. We tackle the memory pressure by utilizing [gradient checkpointing](https://arxiv.org/abs/1604.06174) and [flash attention](https://arxiv.org/abs/2205.14135).
123
-
-**Cost Reduction via Spot Instance:** The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ [SkyPilot](https://github.com/skypilot-org/skypilot)[managed spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.
123
+
-**Cost Reduction via Spot Instance:** The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ [SkyPilot](https://github.com/skypilot-org/skypilot)[managed spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from `$500` to around `$140` and the 13B model from around `$1K` to `$300`.
Copy file name to clipboardExpand all lines: blog/2023-06-29-longchat.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -70,7 +70,7 @@ After condensing the embedding, we perform the finetuning procedure on our curat
70
70
We reuse our collected user-shared conversations previously used for training Vicuna.
71
71
We clean the data using FastChat data pipeline, and truncate these conversations so they are no longer than 16K.
72
72
We finetune the model using standard next-token prediction loss. We fine-tune the 7B and 13B models with 80k and 18k conversations, respectively.
73
-
To save memory, we use Pytorch FSDP and Flash Attention. Assume A100 is \\$3/hour on Cloud, the 7B model costs ~\\$300, and the 13B model costs ~\\$700.
73
+
To save memory, we use Pytorch FSDP and Flash Attention. Assume A100 is `$3/hour` on Cloud, the 7B model costs `~$300`, and the 13B model costs `~$700`.
74
74
75
75
## Evaluation toolkits: LongEval
76
76
Recently, commercial and open-source models have continued to tout their abilities to support expanded context length (from 8K, 32K, 84K, to 100K) in their latest releases, but how can we verify these claims?
Copy file name to clipboardExpand all lines: blog/2023-11-21-lookahead-decoding.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -115,7 +115,7 @@ For powerful GPUs (e.g., A100), lookahead decoding can better squeeze its perfor
115
115
Increasing $N$ together with $W$ would be best to achieve balanced performance, avoiding hitting a theoretical cap if only increasing one side. Our experimental results show that on A100, the following configs in Table 1 work well in most cases. The 7B, 13B, and 33B models require 120x, 80x, and 56x extra FLOPs per step, respectively. However, because of the memory-intensive bound characteristic of the LLM decoding, these extra FLOPs only bring little per-step cost and a visible step compression ratio, resulting in a notable speedup.
116
116
117
117
118
-
<pstyle="color:gray; text-align: center;">Table 1. Good configurations for window size $W$ and N-gram size $N$ on A100. </p>
118
+
<pstyle="color:gray; text-align: center;">Table 1. Good configurations for window size W and N-gram size N on A100. </p>
Copy file name to clipboardExpand all lines: blog/2024-05-02-kaggle-competition.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,4 +18,4 @@ Current LLM benchmarks often fail to capture real-world LLM usage, resulting in
18
18
19
19
### Competition Details
20
20
21
-
The competition will run until August 5th, **with a total prize of $100,000**, featuring a $25,000 prize for 1st place, 20,000 prizes for 2nd through 4th places, and a 15,000 prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.
21
+
The competition will run until August 5th, **with a total prize of `$100,000`**, featuring a `$25,000` prize for 1st place, `$20,000` prizes for 2nd through 4th places, and a `$15,000` prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.
This form might look familiar, since it is the same type of model as the Arena Score: a logistic model. This is just a logistic model with a different, _additive_ structure—the model scores $\beta^{\rm Model}$ and prompt scores $\beta^{\rm Prompt}$ combine additively to generate a notion of total strength for the model-prompt pair. The player scores $\beta^{\rm Player}$ have a similar interpretation as the standard Elo score, and we let $\beta$ denote the concatenation $(\beta^{\rm Player}, \beta^{\rm Model}, \beta^{\rm Prompt})$. For lack of a better term, we call this model “Extended Elo”.
43
45
44
46
What problem is this new model solving that the old Elo algorithm couldn’t? The answer is in the efficiency of estimation. The standard Elo algorithm could apply in our setting by simply calling every model-prompt pair a distinct “opponent” for the purposes of calculating the leaderboard. However, this approach has two issues:
@@ -47,9 +49,11 @@ There are $M\times R$ model-prompt pairs, and only $M+R$ distinct models and pro
47
49
48
50
49
51
Now, we solve this logistic regression problem _online_. That is, letting $\ell(x,y;\beta)$ be the binary cross-entropy loss, we use the iteration
This is a generalization of the Elo update. In fact, if one removes the prompt coefficient, it reduces exactly to the Elo update between players and models, as if these were 1-1 games.
Copy file name to clipboardExpand all lines: blog/2026-01-15-chunked-pipeline.md
+12-4Lines changed: 12 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,13 +36,19 @@ The primary bottleneck in distributed inference scaling is inter-device communic
36
36
Assuming $B$ stands for the Batch Size (often 1 for ultra-long context inference), $S$ for the total Sequence Length, $H$ for the Hidden State dimension, $L$ for the total Layer Number, $M$ for the Micro-batches size, and the activation precision is FP8 (1 byte). Based on this, we analyzed the communication volume of different parallel strategies.
37
37
38
38
***TP:** TP splits individual weight tensors across multiple devices within a single layer. Due to this, TP incurs high communication overhead due to the necessity of synchronization after both the Attention Block and MLP Block. Consequently, the communication volume scales linearly with the number of layers. This frequent **All-Reduce** synchronization makes TP bandwidth-bound, limiting its scalability across large clusters.
39
-
$$\text{Commu Volume}({TP}) = 2 \cdot (TP_{Size} - 1) \cdot \left( B \cdot S \cdot \frac{H}{TP_{Size}} \right) \cdot 2 \cdot L \cdot \text{bytes} \approx 4 \cdot B \cdot S \cdot H \cdot L \cdot \text{bytes}$$
39
+
$$
40
+
\text{Commu Volume}({TP}) = 2 \cdot (TP_{Size} - 1) \cdot \left( B \cdot S \cdot \frac{H}{TP_{Size}} \right) \cdot 2 \cdot L \cdot \text{bytes} \approx 4 \cdot B \cdot S \cdot H \cdot L \cdot \text{bytes}
41
+
$$
40
42
(Note: Each All-Reduce involves $2 \times$ the data size in a ring-based implementation. Each layer involves $2 \times$ All-Reduce operations, one after the Attention Block, and one after the MLP Block.)
41
43
***CP:** Similarly, CP requires extensive synchronization communication to aggregate Key-Value (KV) states across devices. Typically, CP utilizes **All-Gather** at every layer, resulting in significant latency penalties in bandwidth-constrained environments.
42
-
$$\text{Commu Volume}({CP}) = (CP_{Size} - 1) \cdot \left( B \cdot \frac{S}{CP_{Size}} \cdot 2 \cdot H_{KV} \right) \cdot L \cdot \text{bytes} \approx 2 \cdot B \cdot S \cdot H_{KV} \cdot L \cdot \text{bytes}$$
44
+
$$
45
+
\text{Commu Volume}({CP}) = (CP_{Size} - 1) \cdot \left( B \cdot \frac{S}{CP_{Size}} \cdot 2 \cdot H_{KV} \right) \cdot L \cdot \text{bytes} \approx 2 \cdot B \cdot S \cdot H_{KV} \cdot L \cdot \text{bytes}
46
+
$$
43
47
(Note: Assuming CP utilizes Ring-Attention-based solution. For models utilizing GQA, $H_{KV}$ is smaller than $H$, which reduces CP's communication volume.)
44
48
***PP:** In contrast, PP exhibits a significantly reduced communication footprint. Data is transferred **only at the boundaries** of pipeline stages, using **Point-to-Point (P2P)** primitives rather than collective operations. Since a stage typically contains multiple layers, the communication frequency is determined by the number of stages ($P$), not the total number of layers ($L$). Crucially, for a fixed model, as we increase the number of layers per stage, the communication volume remains constant at the boundaries.
45
-
$$\text{Commu Volume}({PP}) = M \cdot \left( \frac{B}{M} \cdot S \cdot H \right) \cdot (P-1) \cdot \text{bytes} = B \cdot S \cdot H \cdot (P-1) \cdot \text{bytes}$$
49
+
$$
50
+
\text{Commu Volume}({PP}) = M \cdot \left( \frac{B}{M} \cdot S \cdot H \right) \cdot (P-1) \cdot \text{bytes} = B \cdot S \cdot H \cdot (P-1) \cdot \text{bytes}
51
+
$$
46
52
(Note: In multi-node deployments where $P \ll L$, PP achieves a nearly order-of-magnitude reduction in total communication volume compared to TP.)
47
53
48
54
### **2. The Bubble Ratio Trade-off**
@@ -122,7 +128,9 @@ We tested different models using a large PP size and found that they all conform
122
128
Therefore, if SGLang still **utilizes a fixed chunked prefill size for CPP, the pipeline bubble ratio will be greater than the theoretical expectation (i.e., $\frac{P - 1}{P - 1 + M}$)**.
123
129
124
130
To address this issue, SGLang introduces a dynamic chunking mechanism to predict the optimal size for the next chunk such that it satisfies this condition:
where $L$ denotes the Prefix Sequence Length, and $\Delta L$ denotes the Next Chunk Size. By profiling a series of requests with different ITLs, we model the cumulative runtime as a quadratic function of sequence length. Using this model, we solve the optimal next chunk size $\Delta L$ for any given prefix length $L$. Since the computation/communication complexity of the Attention mechanism scales with $L$, the next chunk size will be progressively reduced as $L$ grows to maintain an aligned chunk execution time across pipeline stages.
0 commit comments