make prompt shorter without template

Gasoonjia · Gasoonjia · commit 3635fdffdfd0 · 2026-04-22T01:05:13.000-07:00
diff --git a/.ci/scripts/cuda_perf_prompts/qwen3_5_moe_long_prompt.txt b/.ci/scripts/cuda_perf_prompts/qwen3_5_moe_long_prompt.txt
@@ -1,4 +1,3 @@
-<|im_start|>user
 Please analyze and summarize the following text in detail:
 
 The Transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," fundamentally reshaped the landscape of natural language processing and, more broadly, machine learning. Prior to its introduction, sequence modeling had been dominated by recurrent neural networks such as the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU), as well as by convolutional approaches that attempted to capture local context efficiently. While these architectures achieved respectable results on a variety of tasks, they suffered from inherent limitations: recurrent computations could not be easily parallelized along the time dimension, gradients tended to vanish or explode across very long sequences, and convolutional models had difficulty capturing global dependencies without resorting to deep stacks of layers or specialized dilated kernels.
@@ -19,6 +18,8 @@ Of course, the Transformer is not without its drawbacks. The quadratic memory an
 
 Tooling has co-evolved with the architecture. Compilers and runtimes such as ExecuTorch, TensorRT, vLLM, and various ONNX-based stacks specialize in lowering Transformer graphs onto target accelerators while applying optimizations such as kernel fusion, operator scheduling, and memory planning. These systems make it feasible to take a research model trained in a high-level framework and deploy it efficiently on production hardware ranging from data center GPUs to mobile system-on-chips. The end-to-end pipeline of training, fine-tuning, quantization, export, and runtime execution has become a recognizable engineering discipline in its own right.
 
+Training a frontier-scale Transformer is itself a substantial systems undertaking. Modern pretraining runs combine data parallelism, tensor parallelism, pipeline parallelism, and sequence parallelism, often coordinated through libraries such as PyTorch FSDP, Megatron, and DeepSpeed ZeRO. Practitioners must balance compute and memory carefully, choosing micro-batch sizes that maximize accelerator utilization without exceeding device memory, designing checkpointing schemes that survive node failures over runs that can last for months, and overlapping communication with computation to hide network latency. Activation checkpointing trades extra computation for reduced memory pressure, while mixed precision training with bfloat16 or FP8 formats shrinks memory bandwidth requirements and unlocks newer hardware features.
+
+Inference brings its own set of challenges. The autoregressive nature of decoder-only Transformers means each generated token requires a full forward pass, and the dominant cost shifts from raw matrix multiplication during prefill to memory-bandwidth-bound key-value cache reads during decode. Techniques such as speculative decoding, continuous batching, and prefix caching attempt to claw back utilization. For latency-sensitive deployments, careful kernel fusion, paged attention, and ahead-of-time compilation can reduce per-token overhead substantially, and the rise of small distilled or sparsely activated models offers an alternative path to acceptable quality at a fraction of the cost.
+
 Looking ahead, the Transformer's dominance is being challenged by alternative architectures such as state-space models, linear recurrent networks, and hybrid designs that interleave attention with other mixing primitives. Whether any of these will displace the Transformer entirely remains to be seen, but it is already clear that the ideas the Transformer popularized — content-based mixing of tokens, parallelizable training, and large-scale pretraining followed by adaptation — will continue to shape the field for years to come.
-<|im_end|>
-<|im_start|>assistant