You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Qwen 3.5 MoE to cuda-perf CI and add prefill throughput tracking (#18903)
- Add PyTorchObserver stats output to qwen3_5_moe runner (enables
cuda_benchmark.py parsing), --prompt_file flag, and GPU memory stats
- Add prefill_throughput metric to cuda_benchmark.py (prefill tok/s
alongside existing decode tok/s)
- Add Qwen3.5-35B-A3B-HQQ-INT4 to cuda-perf.yml with >1000 token prompt
and 512 output tokens, on linux.aws.a100
- Creata a new ciflow tag for cuda-erf ci
- Remove random model selection and schedule trigger; always run all
models when triggered
---------
Co-authored-by: gasoonjia <gasoonjia@fb.com>
Please analyze and summarize the following text in detail:
2
+
3
+
The Transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," fundamentally reshaped the landscape of natural language processing and, more broadly, machine learning. Prior to its introduction, sequence modeling had been dominated by recurrent neural networks such as the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU), as well as by convolutional approaches that attempted to capture local context efficiently. While these architectures achieved respectable results on a variety of tasks, they suffered from inherent limitations: recurrent computations could not be easily parallelized along the time dimension, gradients tended to vanish or explode across very long sequences, and convolutional models had difficulty capturing global dependencies without resorting to deep stacks of layers or specialized dilated kernels.
4
+
5
+
The Transformer addressed these limitations by replacing recurrence with a mechanism known as self-attention. In self-attention, every token in a sequence computes a weighted sum of representations from every other token, where the weights are determined by a compatibility function between learned queries and keys. This formulation has several appealing properties. First, the computation across positions is fully parallelizable, which dramatically accelerates training on modern accelerator hardware such as graphics processing units and tensor processing units. Second, the path length between any two tokens is constant, allowing the model to capture long-range dependencies far more directly than recurrent networks ever could. Third, the attention weights themselves carry interpretable information about which parts of the input the model is using to produce each output, offering a window into model behavior that earlier architectures rarely provided.
6
+
7
+
A standard Transformer block stacks multi-head self-attention with a position-wise feed-forward network, surrounded by residual connections and layer normalization. The multi-head design allows the model to attend to information from multiple representation subspaces simultaneously: each head learns its own projection of the input into queries, keys, and values, computes its own attention pattern, and contributes a slice of the final output. Empirically, different heads often specialize in different linguistic phenomena, with some focusing on syntactic relationships such as subject-verb agreement, others on coreference, and still others on broader semantic context.
8
+
9
+
Because self-attention is permutation invariant, the Transformer must inject information about the order of tokens explicitly. The original paper used fixed sinusoidal positional encodings added to the token embeddings, but subsequent work has explored a wide variety of alternatives. Learned positional embeddings, relative position representations, rotary positional embeddings, and attention-with-linear-biases schemes have all been proposed and adopted in different families of models. Each approach offers a different trade-off between expressiveness, generalization to longer contexts, and computational cost.
10
+
11
+
The original Transformer was an encoder-decoder model designed for machine translation, but the architecture quickly proved general enough to power a remarkable range of subsequent systems. Encoder-only models such as BERT framed pretraining as a masked language modeling task and dramatically improved performance on classification, question answering, and information retrieval. Decoder-only models such as the GPT family treated language modeling as next-token prediction at very large scale, and demonstrated that with sufficient data and compute, a single architecture could exhibit emergent capabilities such as few-shot learning, code generation, and chain-of-thought reasoning. Encoder-decoder variants such as T5 and BART unified many tasks under a common text-to-text framing, simplifying the engineering of multitask systems.
12
+
13
+
Scaling laws, established through systematic empirical study, have shown that Transformer performance improves predictably as a function of model parameters, dataset size, and compute budget. This insight motivated a wave of increasingly large models, culminating in dense networks with hundreds of billions of parameters and sparsely activated mixture-of-experts networks with trillions. Mixture-of-experts approaches in particular have become attractive because they decouple total parameter count from the per-token computation cost: a router network selects a small subset of experts for each token, allowing the model to grow in capacity without a proportional increase in inference latency. This trade-off makes mixture-of-experts especially appealing for deployment scenarios where memory bandwidth dominates compute as the bottleneck.
14
+
15
+
Beyond text, the Transformer has been adapted to images, audio, video, proteins, source code, and combinations thereof. Vision Transformers split images into patches and treat each patch as a token; speech models tokenize raw audio or spectrogram features; multimodal systems share a common Transformer backbone across modalities by aligning their embedding spaces. The flexibility of the architecture, combined with the practical advantages of large-scale pretraining followed by task-specific fine-tuning or in-context prompting, has made the Transformer the de facto foundation of contemporary deep learning research and applications.
16
+
17
+
Of course, the Transformer is not without its drawbacks. The quadratic memory and compute cost of standard self-attention with respect to sequence length remains a significant practical limitation. A vibrant subfield of research focuses on efficient attention variants, including sparse, low-rank, and kernelized formulations, as well as state-space models that recover linear complexity while preserving competitive accuracy. Long-context inference also stresses the key-value cache, motivating techniques such as paged attention, grouped-query attention, sliding-window attention, and aggressive quantization. Quantization in particular has become a critical tool for deploying large models on commodity hardware, with low-bit integer formats and weight-only quantization schemes enabling inference on consumer GPUs and edge devices that would otherwise be unable to host a model.
18
+
19
+
Tooling has co-evolved with the architecture. Compilers and runtimes such as ExecuTorch, TensorRT, vLLM, and various ONNX-based stacks specialize in lowering Transformer graphs onto target accelerators while applying optimizations such as kernel fusion, operator scheduling, and memory planning. These systems make it feasible to take a research model trained in a high-level framework and deploy it efficiently on production hardware ranging from data center GPUs to mobile system-on-chips. The end-to-end pipeline of training, fine-tuning, quantization, export, and runtime execution has become a recognizable engineering discipline in its own right.
20
+
21
+
Training a frontier-scale Transformer is itself a substantial systems undertaking. Modern pretraining runs combine data parallelism, tensor parallelism, pipeline parallelism, and sequence parallelism, often coordinated through libraries such as PyTorch FSDP, Megatron, and DeepSpeed ZeRO. Practitioners must balance compute and memory carefully, choosing micro-batch sizes that maximize accelerator utilization without exceeding device memory, designing checkpointing schemes that survive node failures over runs that can last for months, and overlapping communication with computation to hide network latency. Activation checkpointing trades extra computation for reduced memory pressure, while mixed precision training with bfloat16 or FP8 formats shrinks memory bandwidth requirements and unlocks newer hardware features.
22
+
23
+
Inference brings its own set of challenges. The autoregressive nature of decoder-only Transformers means each generated token requires a full forward pass, and the dominant cost shifts from raw matrix multiplication during prefill to memory-bandwidth-bound key-value cache reads during decode. Techniques such as speculative decoding, continuous batching, and prefix caching attempt to claw back utilization. For latency-sensitive deployments, careful kernel fusion, paged attention, and ahead-of-time compilation can reduce per-token overhead substantially, and the rise of small distilled or sparsely activated models offers an alternative path to acceptable quality at a fraction of the cost.
24
+
25
+
Looking ahead, the Transformer's dominance is being challenged by alternative architectures such as state-space models, linear recurrent networks, and hybrid designs that interleave attention with other mixing primitives. Whether any of these will displace the Transformer entirely remains to be seen, but it is already clear that the ideas the Transformer popularized — content-based mixing of tokens, parallelizable training, and large-scale pretraining followed by adaptation — will continue to shape the field for years to come.
0 commit comments