[Format] Fix tex format and dollar sign (#330)

ShangmingCai · web-flow · commit 4e820413b3ec · 2026-04-23T11:48:45.000-07:00
Signed-off-by: Shangming Cai &lt;csmthu@gmail.com&gt;
diff --git a/blog/2023-03-30-vicuna.md b/blog/2023-03-30-vicuna.md
@@ -120,7 +120,7 @@ Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user
 Our training recipe builds on top of [Stanford’s alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) with the following improvements.
 - **Multi-turn conversations:** We adjust the training loss to account for multi-turn conversations and compute the fine-tuning loss solely on the chatbot's output.
 - **Memory Optimizations:** To enable Vicuna's understanding of long context, we expand the max context length from 512 in alpaca to 2048, which substantially increases GPU memory requirements. We tackle the memory pressure by utilizing [gradient checkpointing](https://arxiv.org/abs/1604.06174) and [flash attention](https://arxiv.org/abs/2205.14135).
-- **Cost Reduction via Spot Instance:** The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ [SkyPilot](https://github.com/skypilot-org/skypilot) [managed spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.
+- **Cost Reduction via Spot Instance:** The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ [SkyPilot](https://github.com/skypilot-org/skypilot) [managed spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html) to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from `$500` to around `$140` and the 13B model from around `$1K` to `$300`.
 
 
 ## Serving
diff --git a/blog/2023-06-29-longchat.md b/blog/2023-06-29-longchat.md
@@ -70,7 +70,7 @@ After condensing the embedding, we perform the finetuning procedure on our curat
 We reuse our collected user-shared conversations previously used for training Vicuna. 
 We clean the data using FastChat data pipeline, and truncate these conversations so they are no longer than 16K. 
 We finetune the model using standard next-token prediction loss. We fine-tune the 7B and 13B models with 80k and 18k conversations, respectively. 
-To save memory, we use Pytorch FSDP and Flash Attention. Assume A100 is \\$3/hour on Cloud, the 7B model costs ~\\$300, and the 13B model costs ~\\$700. 
+To save memory, we use Pytorch FSDP and Flash Attention. Assume A100 is `$3/hour` on Cloud, the 7B model costs `~$300`, and the 13B model costs `~$700`.
 
 ## Evaluation toolkits: LongEval
 Recently, commercial and open-source models have continued to tout their abilities to support expanded context length (from 8K, 32K, 84K, to 100K) in their latest releases, but how can we verify these claims?
diff --git a/blog/2023-11-21-lookahead-decoding.md b/blog/2023-11-21-lookahead-decoding.md
@@ -115,7 +115,7 @@ For powerful GPUs (e.g., A100), lookahead decoding can better squeeze its perfor
 Increasing $N$ together with $W$ would be best to achieve balanced performance, avoiding hitting a theoretical cap if only increasing one side. Our experimental results show that on A100, the following configs in Table 1 work well in most cases. The 7B, 13B, and 33B models require 120x, 80x, and 56x extra FLOPs per step, respectively. However, because of the memory-intensive bound characteristic of the LLM decoding, these extra FLOPs only bring little per-step cost and a visible step compression ratio, resulting in a notable speedup.
 
 
-<p style="color:gray; text-align: center;">Table 1. Good configurations for window size $W$ and N-gram size $N$ on A100. </p>
+<p style="color:gray; text-align: center;">Table 1. Good configurations for window size W and N-gram size N on A100. </p>
 
 <style>
 .tg  {border-collapse:collapse;border-spacing:0;margin:0px auto;}
diff --git a/blog/2024-05-02-kaggle-competition.md b/blog/2024-05-02-kaggle-competition.md
@@ -18,4 +18,4 @@ Current LLM benchmarks often fail to capture real-world LLM usage, resulting in
 
 ### Competition Details
 
-The competition will run until August 5th, **with a total prize of $100,000**, featuring a $25,000 prize for 1st place, 20,000 prizes for 2nd through 4th places, and a 15,000 prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.
+The competition will run until August 5th, **with a total prize of `$100,000`**, featuring a `$25,000` prize for 1st place, `$20,000` prizes for 2nd through 4th places, and a `$15,000` prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.
diff --git a/blog/2024-09-13-redteam-arena.md b/blog/2024-09-13-redteam-arena.md
@@ -36,9 +36,11 @@ People have been asking how we compute the leaderboard of players, models, and p
 * $Y_i \in \{0,1\}$, a binary outcome taking the value 1 if the player won (or forfeited) and 0 otherwise.
 
 We then model the win probability of the player as
+$$
 \begin{equation}
 	\mathbb{P}(Y_i = 1 | X_i^{\rm Model}, X_i^{\rm Player}, X_i^{\rm Prompt}) = \frac{e^{X_i^{\rm Player}\beta^{\rm Player}}}{e^{X_i^{\rm Player}\beta^{\rm Player}} + e^{X_i^{\rm Model}\beta^{\rm Model} + X_i^{\rm Prompt}\beta^{\rm Prompt}}}.
 \end{equation}
+$$
 This form might look familiar, since it is the same type of model as the Arena Score: a logistic model. This is just a logistic model with a different, _additive_ structure—the model scores $\beta^{\rm Model}$ and prompt scores $\beta^{\rm Prompt}$ combine additively to generate a notion of total strength for the model-prompt pair. The player scores $\beta^{\rm Player}$ have a similar interpretation as the standard Elo score, and we let $\beta$ denote the concatenation $(\beta^{\rm Player}, \beta^{\rm Model}, \beta^{\rm Prompt})$. For lack of a better term, we call this model “Extended Elo”.
 
 What problem is this new model solving that the old Elo algorithm couldn’t? The answer is in the efficiency of estimation. The standard Elo algorithm could apply in our setting by simply calling every model-prompt pair a distinct “opponent” for the purposes of calculating the leaderboard. However, this approach has two issues: 
@@ -47,9 +49,11 @@ There are $M\times R$ model-prompt pairs, and only $M+R$ distinct models and pro
 
 
 Now, we solve this logistic regression problem _online_. That is, letting $\ell(x,y;\beta)$ be the binary cross-entropy loss, we use the iteration
+$$
 \begin{equation}
   \beta_n = \beta_{n-1} - \eta \nabla_\beta \ell(X_{n-1}, Y_{n-1}; \beta_{n-1}),
 \end{equation}
+$$
 for some learning rate $\eta$.
 This is a generalization of the Elo update. In fact, if one removes the prompt coefficient, it reduces exactly to the Elo update between players and models, as if these were 1-1 games.
 
diff --git a/blog/2025-07-20-k2-large-scale-ep.md b/blog/2025-07-20-k2-large-scale-ep.md
@@ -197,7 +197,7 @@ Note: The prefill-to-decode ratio is workload-dependent. We prioritized decode n
 | --- | --- |
 | **Prefill Throughput** | **224k tokens/sec (4 P Nodes)** |
 | **Decode Throughput** | **288k tokens/sec (12 D Nodes)** |
-| **Cost per 1M Output Tokens** | **~$0.21**(**H200 $2.3/hour**) |
+| **Cost per 1M Output Tokens** | **`~$0.21`**(**H200 `$2.3/hour`**) |
 
 ---
 
diff --git a/blog/2025-09-01-sglang-longcat-flash.md b/blog/2025-09-01-sglang-longcat-flash.md
@@ -100,7 +100,7 @@ python3 -m sglang.launch_server \
 #### **Multi-Node Deployment（** 16xH800-80G **）**
 
 In a multi-node setup, Tensor Parallelism and Expert Parallelism are employed, with additional parallel strategies planned for future implementation.
-Replace $NODE_RANK and $MASTER_IP with the specific values for your cluster.
+Replace `$NODE_RANK` and `$MASTER_IP` with the specific values for your cluster.
 ```Shell
 python3 -m sglang.launch_server \
     --model meituan-longcat/LongCat-Flash-Chat \
diff --git a/blog/2026-01-15-chunked-pipeline.md b/blog/2026-01-15-chunked-pipeline.md
@@ -36,13 +36,19 @@ The primary bottleneck in distributed inference scaling is inter-device communic
 Assuming $B$ stands for the Batch Size (often 1 for ultra-long context inference), $S$ for the total Sequence Length, $H$ for the Hidden State dimension, $L$ for the total Layer Number, $M$ for the Micro-batches size, and the activation precision is FP8 (1 byte). Based on this, we analyzed the communication volume of different parallel strategies.
 
 * **TP:** TP splits individual weight tensors across multiple devices within a single layer. Due to this, TP incurs high communication overhead due to the necessity of synchronization after both the Attention Block and MLP Block. Consequently, the communication volume scales linearly with the number of layers. This frequent **All-Reduce** synchronization makes TP bandwidth-bound, limiting its scalability across large clusters.
-$$\text{Commu Volume}({TP}) = 2 \cdot (TP_{Size} - 1) \cdot \left( B \cdot S \cdot \frac{H}{TP_{Size}} \right)  \cdot 2 \cdot L \cdot \text{bytes} \approx 4 \cdot B \cdot S \cdot H \cdot L \cdot \text{bytes}$$
+$$
+\text{Commu Volume}({TP}) = 2 \cdot (TP_{Size} - 1) \cdot \left( B \cdot S \cdot \frac{H}{TP_{Size}} \right)  \cdot 2 \cdot L \cdot \text{bytes} \approx 4 \cdot B \cdot S \cdot H \cdot L \cdot \text{bytes}
+$$
 (Note: Each All-Reduce involves $2 \times$ the data size in a ring-based implementation. Each layer involves $2 \times$ All-Reduce operations, one after the Attention Block, and one after the MLP Block.)
 * **CP:** Similarly, CP requires extensive synchronization communication to aggregate Key-Value (KV) states across devices. Typically, CP utilizes **All-Gather** at every layer, resulting in significant latency penalties in bandwidth-constrained environments.
-$$\text{Commu Volume}({CP}) = (CP_{Size} - 1) \cdot \left( B \cdot \frac{S}{CP_{Size}} \cdot 2 \cdot H_{KV} \right)  \cdot L \cdot \text{bytes} \approx 2 \cdot B \cdot S \cdot H_{KV} \cdot L \cdot \text{bytes}$$
+$$
+\text{Commu Volume}({CP}) = (CP_{Size} - 1) \cdot \left( B \cdot \frac{S}{CP_{Size}} \cdot 2 \cdot H_{KV} \right)  \cdot L \cdot \text{bytes} \approx 2 \cdot B \cdot S \cdot H_{KV} \cdot L \cdot \text{bytes}
+$$
 (Note: Assuming CP utilizes Ring-Attention-based solution. For models utilizing GQA, $H_{KV}$ is smaller than $H$, which reduces CP's communication volume.)
 * **PP:** In contrast, PP exhibits a significantly reduced communication footprint. Data is transferred **only at the boundaries** of pipeline stages, using **Point-to-Point (P2P)** primitives rather than collective operations. Since a stage typically contains multiple layers, the communication frequency is determined by the number of stages ($P$), not the total number of layers ($L$). Crucially, for a fixed model, as we increase the number of layers per stage, the communication volume remains constant at the boundaries.
-$$\text{Commu Volume}({PP}) = M \cdot \left( \frac{B}{M} \cdot S \cdot H \right) \cdot (P-1) \cdot \text{bytes} = B \cdot S \cdot H \cdot (P-1) \cdot \text{bytes}$$
+$$
+\text{Commu Volume}({PP}) = M \cdot \left( \frac{B}{M} \cdot S \cdot H \right) \cdot (P-1) \cdot \text{bytes} = B \cdot S \cdot H \cdot (P-1) \cdot \text{bytes}
+$$
 (Note: In multi-node deployments where $P \ll L$, PP achieves a nearly order-of-magnitude reduction in total communication volume compared to TP.)
 
 ### **2. The Bubble Ratio Trade-off**
@@ -122,7 +128,9 @@ We tested different models using a large PP size and found that they all conform
 Therefore, if SGLang still **utilizes a fixed chunked prefill size for CPP, the pipeline bubble ratio will be greater than the theoretical expectation (i.e., $\frac{P - 1}{P - 1 + M}$)**.
 
 To address this issue, SGLang introduces a dynamic chunking mechanism to predict the optimal size for the next chunk such that it satisfies this condition:
-<center>$$ \text{Runtime}(L + \Delta L) - \text{Runtime}(L) = \text{Runtime}(\text{Initial Chunk Size}) $$</center>
+$$
+\text{Runtime}(L + \Delta L) - \text{Runtime}(L) = \text{Runtime}(\text{Initial Chunk Size})
+$$
 
 where $L$ denotes the Prefix Sequence Length, and $\Delta L$ denotes the Next Chunk Size. By profiling a series of requests with different ITLs, we model the cumulative runtime as a quadratic function of sequence length. Using this model, we solve the optimal next chunk size $\Delta L$ for any given prefix length $L$. Since the computation/communication complexity of the Attention mechanism scales with $L$, the next chunk size will be progressively reduced as $L$ grows to maintain an aligned chunk execution time across pipeline stages.
 

Original file line number	Diff line number	Diff line change
`@@ -18,4 +18,4 @@ Current LLM benchmarks often fail to capture real-world LLM usage, resulting in`
`18`	`18`
`19`	`19`	`### Competition Details`
`20`	`20`
`21`		-The competition will run until August 5th, with a total prize of $100,000, featuring a $25,000 prize for 1st place, 20,000 prizes for 2nd through 4th places, and a 15,000 prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.
	`21`	+The competition will run until August 5th, with a total prize of `$100,000`, featuring a `$25,000` prize for 1st place, `$20,000` prizes for 2nd through 4th places, and a `$15,000` prize for 5th place. This is your opportunity to contribute to the advancement of human-aligned language models while gaining valuable insights into human preferences and decision-making. These insights could provide value to both the computer science and psychology communities, shedding light on the factors that shape human preferences in conversational AI.