Skip to content

Commit 4e82041

Browse files
authored
[Format] Fix tex format and dollar sign (#330)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
1 parent d8c1612 commit 4e82041

8 files changed

Lines changed: 22 additions & 10 deletions

blog/2023-03-30-vicuna.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user
120120
Our training recipe builds on top of [Stanford’s alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) with the following improvements.
121121
- **Multi-turn conversations:** We adjust the training loss to account for multi-turn conversations and compute the fine-tuning loss solely on the chatbot's output.
122122
- **Memory Optimizations:** To enable Vicuna's understanding of long context, we expand the max context length from 512 in alpaca to 2048, which substantially increases GPU memory requirements. We tackle the memory pressure by utilizing [gradient checkpointing](https://arxiv.org/abs/1604.06174) and [flash attention](https://arxiv.org/abs/2205.14135).
123-
- **Cost Reduction via Spot Instance:** The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ [SkyPilot](https://github.com/skypilot-org/skypilot) [managed spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.
123+
- **Cost Reduction via Spot Instance:** The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ [SkyPilot](https://github.com/skypilot-org/skypilot) [managed spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from `$500` to around `$140` and the 13B model from around `$1K` to `$300`.
124124

125125

126126
## Serving

blog/2023-06-29-longchat.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ After condensing the embedding, we perform the finetuning procedure on our curat
7070
We reuse our collected user-shared conversations previously used for training Vicuna.
7171
We clean the data using FastChat data pipeline, and truncate these conversations so they are no longer than 16K.
7272
We finetune the model using standard next-token prediction loss. We fine-tune the 7B and 13B models with 80k and 18k conversations, respectively.
73-
To save memory, we use Pytorch FSDP and Flash Attention. Assume A100 is \\$3/hour on Cloud, the 7B model costs ~\\$300, and the 13B model costs ~\\$700.
73+
To save memory, we use Pytorch FSDP and Flash Attention. Assume A100 is `$3/hour` on Cloud, the 7B model costs `~$300`, and the 13B model costs `~$700`.
7474

7575
## Evaluation toolkits: LongEval
7676
Recently, commercial and open-source models have continued to tout their abilities to support expanded context length (from 8K, 32K, 84K, to 100K) in their latest releases, but how can we verify these claims?

blog/2023-11-21-lookahead-decoding.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ For powerful GPUs (e.g., A100), lookahead decoding can better squeeze its perfor
115115
Increasing $N$ together with $W$ would be best to achieve balanced performance, avoiding hitting a theoretical cap if only increasing one side. Our experimental results show that on A100, the following configs in Table 1 work well in most cases. The 7B, 13B, and 33B models require 120x, 80x, and 56x extra FLOPs per step, respectively. However, because of the memory-intensive bound characteristic of the LLM decoding, these extra FLOPs only bring little per-step cost and a visible step compression ratio, resulting in a notable speedup.
116116

117117

118-
<p style="color:gray; text-align: center;">Table 1. Good configurations for window size $W$ and N-gram size $N$ on A100. </p>
118+
<p style="color:gray; text-align: center;">Table 1. Good configurations for window size W and N-gram size N on A100. </p>
119119

120120
<style>
121121
.tg {border-collapse:collapse;border-spacing:0;margin:0px auto;}

blog/2024-05-02-kaggle-competition.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,4 +18,4 @@ Current LLM benchmarks often fail to capture real-world LLM usage, resulting in
1818

1919
### Competition Details
2020

21-
The competition will run until August 5th, **with a total prize of $100,000**, featuring a $25,000 prize for 1st place, 20,000 prizes for 2nd through 4th places, and a 15,000 prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.
21+
The competition will run until August 5th, **with a total prize of `$100,000`**, featuring a `$25,000` prize for 1st place, `$20,000` prizes for 2nd through 4th places, and a `$15,000` prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.

blog/2024-09-13-redteam-arena.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,11 @@ People have been asking how we compute the leaderboard of players, models, and p
3636
* $Y_i \in \{0,1\}$, a binary outcome taking the value 1 if the player won (or forfeited) and 0 otherwise.
3737

3838
We then model the win probability of the player as
39+
$$
3940
\begin{equation}
4041
\mathbb{P}(Y_i = 1 | X_i^{\rm Model}, X_i^{\rm Player}, X_i^{\rm Prompt}) = \frac{e^{X_i^{\rm Player}\beta^{\rm Player}}}{e^{X_i^{\rm Player}\beta^{\rm Player}} + e^{X_i^{\rm Model}\beta^{\rm Model} + X_i^{\rm Prompt}\beta^{\rm Prompt}}}.
4142
\end{equation}
43+
$$
4244
This form might look familiar, since it is the same type of model as the Arena Score: a logistic model. This is just a logistic model with a different, _additive_ structure—the model scores $\beta^{\rm Model}$ and prompt scores $\beta^{\rm Prompt}$ combine additively to generate a notion of total strength for the model-prompt pair. The player scores $\beta^{\rm Player}$ have a similar interpretation as the standard Elo score, and we let $\beta$ denote the concatenation $(\beta^{\rm Player}, \beta^{\rm Model}, \beta^{\rm Prompt})$. For lack of a better term, we call this model “Extended Elo”.
4345

4446
What problem is this new model solving that the old Elo algorithm couldn’t? The answer is in the efficiency of estimation. The standard Elo algorithm could apply in our setting by simply calling every model-prompt pair a distinct “opponent” for the purposes of calculating the leaderboard. However, this approach has two issues:
@@ -47,9 +49,11 @@ There are $M\times R$ model-prompt pairs, and only $M+R$ distinct models and pro
4749

4850

4951
Now, we solve this logistic regression problem _online_. That is, letting $\ell(x,y;\beta)$ be the binary cross-entropy loss, we use the iteration
52+
$$
5053
\begin{equation}
5154
\beta_n = \beta_{n-1} - \eta \nabla_\beta \ell(X_{n-1}, Y_{n-1}; \beta_{n-1}),
5255
\end{equation}
56+
$$
5357
for some learning rate $\eta$.
5458
This is a generalization of the Elo update. In fact, if one removes the prompt coefficient, it reduces exactly to the Elo update between players and models, as if these were 1-1 games.
5559

blog/2025-07-20-k2-large-scale-ep.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ Note: The prefill-to-decode ratio is workload-dependent. We prioritized decode n
197197
| --- | --- |
198198
| **Prefill Throughput** | **224k tokens/sec (4 P Nodes)** |
199199
| **Decode Throughput** | **288k tokens/sec (12 D Nodes)** |
200-
| **Cost per 1M Output Tokens** | **~$0.21**(**H200 $2.3/hour**) |
200+
| **Cost per 1M Output Tokens** | **`~$0.21`**(**H200 `$2.3/hour`**) |
201201

202202
---
203203

blog/2025-09-01-sglang-longcat-flash.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ python3 -m sglang.launch_server \
100100
#### **Multi-Node Deployment(** 16xH800-80G ****
101101

102102
In a multi-node setup, Tensor Parallelism and Expert Parallelism are employed, with additional parallel strategies planned for future implementation.
103-
Replace $NODE_RANK and $MASTER_IP with the specific values for your cluster.
103+
Replace `$NODE_RANK` and `$MASTER_IP` with the specific values for your cluster.
104104
```Shell
105105
python3 -m sglang.launch_server \
106106
--model meituan-longcat/LongCat-Flash-Chat \

blog/2026-01-15-chunked-pipeline.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,19 @@ The primary bottleneck in distributed inference scaling is inter-device communic
3636
Assuming $B$ stands for the Batch Size (often 1 for ultra-long context inference), $S$ for the total Sequence Length, $H$ for the Hidden State dimension, $L$ for the total Layer Number, $M$ for the Micro-batches size, and the activation precision is FP8 (1 byte). Based on this, we analyzed the communication volume of different parallel strategies.
3737

3838
* **TP:** TP splits individual weight tensors across multiple devices within a single layer. Due to this, TP incurs high communication overhead due to the necessity of synchronization after both the Attention Block and MLP Block. Consequently, the communication volume scales linearly with the number of layers. This frequent **All-Reduce** synchronization makes TP bandwidth-bound, limiting its scalability across large clusters.
39-
$$\text{Commu Volume}({TP}) = 2 \cdot (TP_{Size} - 1) \cdot \left( B \cdot S \cdot \frac{H}{TP_{Size}} \right) \cdot 2 \cdot L \cdot \text{bytes} \approx 4 \cdot B \cdot S \cdot H \cdot L \cdot \text{bytes}$$
39+
$$
40+
\text{Commu Volume}({TP}) = 2 \cdot (TP_{Size} - 1) \cdot \left( B \cdot S \cdot \frac{H}{TP_{Size}} \right) \cdot 2 \cdot L \cdot \text{bytes} \approx 4 \cdot B \cdot S \cdot H \cdot L \cdot \text{bytes}
41+
$$
4042
(Note: Each All-Reduce involves $2 \times$ the data size in a ring-based implementation. Each layer involves $2 \times$ All-Reduce operations, one after the Attention Block, and one after the MLP Block.)
4143
* **CP:** Similarly, CP requires extensive synchronization communication to aggregate Key-Value (KV) states across devices. Typically, CP utilizes **All-Gather** at every layer, resulting in significant latency penalties in bandwidth-constrained environments.
42-
$$\text{Commu Volume}({CP}) = (CP_{Size} - 1) \cdot \left( B \cdot \frac{S}{CP_{Size}} \cdot 2 \cdot H_{KV} \right) \cdot L \cdot \text{bytes} \approx 2 \cdot B \cdot S \cdot H_{KV} \cdot L \cdot \text{bytes}$$
44+
$$
45+
\text{Commu Volume}({CP}) = (CP_{Size} - 1) \cdot \left( B \cdot \frac{S}{CP_{Size}} \cdot 2 \cdot H_{KV} \right) \cdot L \cdot \text{bytes} \approx 2 \cdot B \cdot S \cdot H_{KV} \cdot L \cdot \text{bytes}
46+
$$
4347
(Note: Assuming CP utilizes Ring-Attention-based solution. For models utilizing GQA, $H_{KV}$ is smaller than $H$, which reduces CP's communication volume.)
4448
* **PP:** In contrast, PP exhibits a significantly reduced communication footprint. Data is transferred **only at the boundaries** of pipeline stages, using **Point-to-Point (P2P)** primitives rather than collective operations. Since a stage typically contains multiple layers, the communication frequency is determined by the number of stages ($P$), not the total number of layers ($L$). Crucially, for a fixed model, as we increase the number of layers per stage, the communication volume remains constant at the boundaries.
45-
$$\text{Commu Volume}({PP}) = M \cdot \left( \frac{B}{M} \cdot S \cdot H \right) \cdot (P-1) \cdot \text{bytes} = B \cdot S \cdot H \cdot (P-1) \cdot \text{bytes}$$
49+
$$
50+
\text{Commu Volume}({PP}) = M \cdot \left( \frac{B}{M} \cdot S \cdot H \right) \cdot (P-1) \cdot \text{bytes} = B \cdot S \cdot H \cdot (P-1) \cdot \text{bytes}
51+
$$
4652
(Note: In multi-node deployments where $P \ll L$, PP achieves a nearly order-of-magnitude reduction in total communication volume compared to TP.)
4753

4854
### **2. The Bubble Ratio Trade-off**
@@ -122,7 +128,9 @@ We tested different models using a large PP size and found that they all conform
122128
Therefore, if SGLang still **utilizes a fixed chunked prefill size for CPP, the pipeline bubble ratio will be greater than the theoretical expectation (i.e., $\frac{P - 1}{P - 1 + M}$)**.
123129

124130
To address this issue, SGLang introduces a dynamic chunking mechanism to predict the optimal size for the next chunk such that it satisfies this condition:
125-
<center>$$ \text{Runtime}(L + \Delta L) - \text{Runtime}(L) = \text{Runtime}(\text{Initial Chunk Size}) $$</center>
131+
$$
132+
\text{Runtime}(L + \Delta L) - \text{Runtime}(L) = \text{Runtime}(\text{Initial Chunk Size})
133+
$$
126134

127135
where $L$ denotes the Prefix Sequence Length, and $\Delta L$ denotes the Next Chunk Size. By profiling a series of requests with different ITLs, we model the cumulative runtime as a quadratic function of sequence length. Using this model, we solve the optimal next chunk size $\Delta L$ for any given prefix length $L$. Since the computation/communication complexity of the Attention mechanism scales with $L$, the next chunk size will be progressively reduced as $L$ grows to maintain an aligned chunk execution time across pipeline stages.
128136

0 commit comments

Comments
 (0)