|
| 1 | +# Attention Sparsity for HuggingFace Models |
| 2 | + |
| 3 | +In this tutorial, we demonstrate how to use NVIDIA Model Optimizer to apply attention sparsity to HuggingFace models. Attention sparsity reduces computational cost by skipping near-zero attention scores during the softmax computation. |
| 4 | + |
| 5 | +## Getting Started |
| 6 | + |
| 7 | +### Quick Example |
| 8 | + |
| 9 | +```python |
| 10 | +import modelopt.torch.sparsity.attention_sparsity as mtsa |
| 11 | +from modelopt.torch.sparsity.attention_sparsity.config import SKIP_SOFTMAX_DEFAULT |
| 12 | + |
| 13 | +# Load your model |
| 14 | +model = AutoModelForCausalLM.from_pretrained( |
| 15 | + "Qwen/Qwen3-8B", |
| 16 | + attn_implementation="eager", # Required for sparse attention |
| 17 | + torch_dtype=torch.bfloat16, |
| 18 | +) |
| 19 | + |
| 20 | +# Apply sparse attention |
| 21 | +model = mtsa.sparsify(model, config=SKIP_SOFTMAX_DEFAULT) |
| 22 | +``` |
| 23 | + |
| 24 | +> [!Note] |
| 25 | +> `attn_implementation="eager"` is required for sparse attention to work properly. Flash Attention 2 or SDPA would bypass the softmax patching needed for stats collection. |
| 26 | +
|
| 27 | +## Configuration Options |
| 28 | + |
| 29 | +Two pre-defined configurations are available: |
| 30 | + |
| 31 | +### 1. Fixed Threshold (SKIP_SOFTMAX_DEFAULT) |
| 32 | + |
| 33 | +Uses a fixed threshold value. Simple but may not be optimal for all sequence lengths. |
| 34 | + |
| 35 | +```python |
| 36 | +from modelopt.torch.sparsity.attention_sparsity.config import SKIP_SOFTMAX_DEFAULT |
| 37 | + |
| 38 | +model = mtsa.sparsify(model, config=SKIP_SOFTMAX_DEFAULT) |
| 39 | +``` |
| 40 | + |
| 41 | +### 2. Calibrated Threshold (SKIP_SOFTMAX_CALIB) |
| 42 | + |
| 43 | +Uses RULER-based calibration to determine an optimal dynamic threshold that adapts to sequence length. Recommended for production use. |
| 44 | + |
| 45 | +```python |
| 46 | +from modelopt.torch.sparsity.attention_sparsity.config import SKIP_SOFTMAX_CALIB |
| 47 | + |
| 48 | +model = mtsa.sparsify(model, config=SKIP_SOFTMAX_CALIB) |
| 49 | +``` |
| 50 | + |
| 51 | +## Prerequisites |
| 52 | + |
| 53 | +### Local Installation |
| 54 | + |
| 55 | +For Hugging Face models, install Model Optimizer with `hf` dependencies using `pip` from [PyPI](https://pypi.org/project/nvidia-modelopt/) and install the requirements for the example: |
| 56 | + |
| 57 | +```bash |
| 58 | +pip install nvidia-modelopt[hf] |
| 59 | +``` |
| 60 | + |
| 61 | +### Download RULER Calibration Data (Required for Calibration) |
| 62 | + |
| 63 | +If using `SKIP_SOFTMAX_CALIB`, you need to download the RULER calibration dataset first: |
| 64 | + |
| 65 | +```bash |
| 66 | +bash ./download_ruler_data.sh |
| 67 | +``` |
| 68 | + |
| 69 | +This downloads the Paul Graham essays dataset used for generating calibration samples. |
| 70 | + |
| 71 | +## Run Sparse Attention on HuggingFace Models |
| 72 | + |
| 73 | +### Basic Usage (Without Calibration) |
| 74 | + |
| 75 | +Apply sparse attention with a fixed threshold: |
| 76 | + |
| 77 | +```bash |
| 78 | +python hf_sa.py \ |
| 79 | + --pyt_ckpt_path Qwen/Qwen3-8B \ |
| 80 | + --sparse_attn skip_softmax |
| 81 | +``` |
| 82 | + |
| 83 | +### With RULER Calibration |
| 84 | + |
| 85 | +Apply sparse attention with calibrated thresholds for optimal sparsity: |
| 86 | + |
| 87 | +```bash |
| 88 | +python hf_sa.py \ |
| 89 | + --pyt_ckpt_path Qwen/Qwen3-8B \ |
| 90 | + --sparse_attn skip_softmax_calib |
| 91 | +``` |
| 92 | + |
| 93 | +The calibration process: |
| 94 | + |
| 95 | +1. Generates RULER calibration samples |
| 96 | +2. Collects attention statistics during forward passes |
| 97 | +3. Determines optimal threshold scale factor for target sparsity ratio |
| 98 | + |
| 99 | +### Command Line Arguments |
| 100 | + |
| 101 | +| Argument | Default | Description | |
| 102 | +|----------|---------|-------------| |
| 103 | +| `--pyt_ckpt_path` | Required | HuggingFace model path or name | |
| 104 | +| `--sparse_attn` | `skip_softmax` | Configuration: `skip_softmax` or `skip_softmax_calib` | |
| 105 | +| `--backend` | `pytorch` | Backend: `pytorch` (only supported backend) | |
| 106 | +| `--seq_len` | `2048` | Maximum sequence length for input prompts | |
| 107 | +| `--export_dir` | `None` | Directory to export the sparsified model | |
| 108 | + |
| 109 | +## Output Comparison |
| 110 | + |
| 111 | +The script automatically compares outputs before and after applying sparse attention: |
| 112 | + |
| 113 | +1. Loads a test sample from the NarrativeQA dataset |
| 114 | +2. Generates text before sparse attention is applied |
| 115 | +3. Applies sparse attention (with optional calibration) |
| 116 | +4. Generates text after sparse attention is applied |
| 117 | +5. Compares and displays both outputs |
| 118 | + |
| 119 | +## Export Model |
| 120 | + |
| 121 | +Export the sparsified model to a HuggingFace checkpoint: |
| 122 | + |
| 123 | +```bash |
| 124 | +python hf_sa.py \ |
| 125 | + --pyt_ckpt_path Qwen/Qwen3-8B \ |
| 126 | + --sparse_attn skip_softmax_calib \ |
| 127 | + --export_dir ./exported_sparse_model |
| 128 | +``` |
| 129 | + |
| 130 | +The exported model can be loaded and used with standard HuggingFace APIs. |
| 131 | + |
| 132 | +## Custom Configuration |
| 133 | + |
| 134 | +You can create custom sparse attention configurations: |
| 135 | + |
| 136 | +```python |
| 137 | +custom_config = { |
| 138 | + "sparse_cfg": { |
| 139 | + "calibration": { # Optional: omit for fixed threshold |
| 140 | + "target_sparse_ratio": {"prefill": 0.5, "decode": 0.5}, # Target 50% sparsity |
| 141 | + "samples": 128, # Number of calibration samples |
| 142 | + "max_seqlen": 8192, # Maximum sequence length |
| 143 | + # Optional: customize threshold trials for calibration |
| 144 | + "threshold_trials": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 2e-2, 5e-2, 1e-1, 2e-1, 3e-1, 5e-1, 7e-1], |
| 145 | + }, |
| 146 | + "*attn*": { # Pattern to match attention modules |
| 147 | + "method": "flash_skip_softmax", |
| 148 | + "threshold": {"prefill": 1e-3, "decode": 1e-4}, # Phase-specific thresholds (ignored if calibration is used) |
| 149 | + "br": 128, # Flash Attention block rows |
| 150 | + "bc": 128, # Flash Attention block columns |
| 151 | + "backend": "pytorch", |
| 152 | + "collect_stats": True, |
| 153 | + "enable": True, |
| 154 | + }, |
| 155 | + "default": {"enable": False}, |
| 156 | + }, |
| 157 | +} |
| 158 | + |
| 159 | +model = mtsa.sparsify(model, config=custom_config) |
| 160 | +``` |
| 161 | + |
| 162 | +## References |
| 163 | + |
| 164 | +- [Model Optimizer Documentation](https://nvidia.github.io/Model-Optimizer/) |
| 165 | +- [RULER: What's the Real Context Size of Your Long-Context Language Models?](https://github.com/NVIDIA/RULER) |
0 commit comments