Skip to content

Commit b98a595

Browse files
Add vLLM-based runtime statistics for subblock latency measurement (#1358)
### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Runtime-based latency optimization: collect vLLM-measured inference latency to constrain optimization. * **Configuration** * New runtime config/template for Llama-3.1-8B pruning (runtime stats enabled, NCCL timeout templating, MIP target-latency). * Validation sample defaults adjusted (one flow: 128 → 8; runtime flow uses 128). * Human constraint key renamed to target_latency_seconds. * **Documentation** * README section describing runtime-based latency optimization setup and usage. * **Tests** * Added GPU end-to-end test for runtime stats collection. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1358?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent 01415c2 commit b98a595

18 files changed

Lines changed: 1036 additions & 155 deletions

File tree

docs/source/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@
124124

125125

126126
# Mock imports for autodoc
127-
autodoc_mock_imports = ["mpi4py", "tensorrt_llm", "triton"]
127+
autodoc_mock_imports = ["mpi4py", "tensorrt_llm", "triton", "vllm"]
128128

129129
autosummary_generate = True
130130
autosummary_imported_members = False

examples/puzzletron/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,46 @@ See [Megatron-Bridge distillation](../megatron_bridge/README.md#distillation) fo
343343
344344
For distillation results on Puzzletron-compressed models, see [examples/pruning/puzzletron/](../pruning/puzzletron/README.md).
345345
346+
## Runtime-Based Latency Optimization
347+
348+
You can enable **runtime stats** to measure actual inference latency via vLLM, which unlocks latency-based MIP constraints.
349+
350+
A ready-to-run example config is included at [`configs/llama-3_1-8B_pruneffn_runtime/`](./configs/llama-3_1-8B_pruneffn_runtime/llama-3_1-8B_pruneffn_runtime.yaml). The following key fields enable and control execution of the runtime statistics in the `llama-3_1-8B_pruneffn_runtime.yaml` config file:
351+
352+
```yaml
353+
calc_subblock_stats:
354+
runtime_stats:
355+
enabled: true
356+
num_warmup_iters: 2
357+
num_iters: 10
358+
```
359+
360+
The runtime constraint is specified in the `human_constraints` section of the config `Llama-3_1-8B.yaml`:
361+
362+
```yaml
363+
human_constraints:
364+
target_latency_seconds: 21
365+
```
366+
367+
Run the pipeline against this config the same way as the memory-constrained variant:
368+
369+
```bash
370+
torchrun --nproc_per_node 2 examples/puzzletron/main.py \
371+
--config examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/llama-3_1-8B_pruneffn_runtime.yaml 2>&1 | tee ./log.txt | grep "Puzzletron Progress"
372+
```
373+
374+
The MIP solver will now search for a heterogeneous architecture whose measured end-to-end latency is at or below `target_latency_seconds`, instead of optimizing for a memory budget.
375+
376+
Because vLLM startup adds substantial overhead during stats collection, extend the distributed process group timeout accordingly (already included in the example config):
377+
378+
```yaml
379+
nccl_timeout_minutes: 90 # default is 10 if omitted
380+
```
381+
382+
This field is supported in any Puzzletron YAML config and overrides the default 10-minute distributed timeout.
383+
384+
Due to non-linear extension of the runtime stats of single subblocks to the total runtime of the model, the `target_latency_seconds` value should be set to a value that is slightly lower than the desired latency. For example, in our experiments, the `target_latency_seconds` value of 5 resulted in a final model latency of 5.4 seconds.
385+
346386
## Advanced Usage
347387
348388
Modify `llama-3_1-8B_pruneffn_memory.yaml` file for advanced compression scenarios.

examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/Llama-3_1-8B.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ scoring:
4242
teacher_dir: ${to_path:${teacher_dir}}
4343
output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation
4444

45-
eval_samples: 128
45+
eval_samples: 8
4646
micro_batch_size: 1
4747
seed: 42
4848
shuffle_seed: 444

examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/validate_model_defaults.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
33
block_size: 8192
44
bos_rate: 0.5
55
data_column: messages
6-
val_dataset_name: valid
6+
val_dataset_name: validation
77
shuffle_seed: 81436
88
seed: 42
99
fim_rate: 0
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
defaults:
2+
- ../llama-3_1-8B_pruneffn_memory/pruning/ffn_pruning@pruning
3+
- ../llama-3_1-8B_pruneffn_memory/validate_solutions_defaults@scoring
4+
- ../llama-3_1-8B_pruneffn_memory/validate_solutions_defaults@realize_model
5+
- bypass:
6+
- override hydra/hydra_logging: disabled
7+
- _self_
8+
9+
puzzle_dir: ???
10+
descriptor: llama
11+
teacher_dir: ${puzzle_dir}/ckpts/teacher/
12+
replacement_library_path: ${puzzle_dir}/replacement_library.json
13+
dataset_path: ??? # ppath to Nemotron-Post-Training-Dataset-v2
14+
15+
skip_realize_model: false
16+
17+
build_replacement_library:
18+
add_ffn_no_ops: true
19+
add_attention_no_ops: true
20+
21+
calc_subblock_stats:
22+
batch_sizes: [1, 4]
23+
prefill_seq_len: 1024
24+
generation_seq_len: 1024
25+
num_active_tokens_override: # Optional override for sequence lengths
26+
prefill_queue_size: 0
27+
allocate_prefill_query: false
28+
merge_with_existing_stats: false
29+
subblock_stats_filename: "subblock_stats.json"
30+
moe_stats_filename: "moe_stats.json"
31+
32+
scoring:
33+
descriptor: ${descriptor}
34+
solutions_to_validate:
35+
skip_existing_solutions: true
36+
37+
replacement_library_path: ${replacement_library_path}
38+
solutions_path: ${to_path:${puzzle_dir}/single_sequence_replacement_solutions.json}
39+
teacher_dir: ${to_path:${teacher_dir}}
40+
output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation
41+
42+
eval_samples: 128
43+
micro_batch_size: 1
44+
seed: 42
45+
shuffle_seed: 444
46+
dataset_path: ${dataset_path}
47+
48+
mip:
49+
single_block_replacement_validation_dir: ${to_path:${scoring.output_dir}}
50+
subblock_stats_path: ${to_path:${puzzle_dir}/${calc_subblock_stats.subblock_stats_filename}}
51+
output_path: ${to_path:${puzzle_dir}/mip/puzzle_solutions}
52+
gathered_metrics_path:
53+
puzzle_profile:
54+
55+
# puzzle_profile:
56+
objective: metrics.cosine_embedding_loss_hidden_states
57+
bigger_is_better: false
58+
59+
subblock_stats_args:
60+
- batch_size: 1
61+
weights_dtype: torch.bfloat16
62+
63+
report_additional_costs:
64+
- stats.memory_mib
65+
- stats.num_params
66+
- stats.num_kv_heads
67+
- stats.has_attention
68+
- stats.has_ffn
69+
- stats.kv_cache_memory_mib
70+
- stats.attention_memory_mib
71+
- stats.ffn_memory_mib
72+
- stats.ffn_num_params
73+
- stats.attention_num_params
74+
75+
human_constraints:
76+
target_latency_seconds: 5
77+
78+
mip_constraints:
79+
metric_overrides:
80+
max_seconds_per_solution: 60
81+
82+
realize_model:
83+
descriptor: ${descriptor}
84+
teacher_dir: ${to_path:${teacher_dir}}
85+
tokenizer_name: ${to_path:${teacher_dir}}
86+
replacement_library_path: ${replacement_library_path}
87+
save_models: true
88+
solutions_path: # Filled dynamically
89+
90+
# Validate params
91+
skip_validation: false # To enable validation of the model solution set `skip_validation` as False
92+
eval_samples: 128
93+
micro_batch_size: 1
94+
seed: 42
95+
shuffle_seed: 444
96+
dataset_path: ${dataset_path}
97+
98+
nccl_timeout_minutes: ${timedelta_minutes:120}
99+
100+
# This section redirects Hydra outputs
101+
hydra:
102+
run:
103+
dir: ${puzzle_dir}/hydra_logs/${now:%Y-%m-%d}/${now:%H-%M-%S}
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
defaults:
2+
- Llama-3_1-8B
3+
- _self_
4+
5+
# Input Hugging Face model to compress
6+
input_hf_model_path: /workspace/hf_models/meta-llama/Llama-3.1-8B-Instruct
7+
8+
# Dataset path for pruning and NAS scoring
9+
dataset_path: /workspace/datasets/Nemotron-Post-Training-Dataset-v2
10+
11+
# Working directory for puzzletron outputs
12+
puzzle_dir: /workspace/puzzle_dir
13+
14+
calc_subblock_stats:
15+
runtime_stats:
16+
enabled: true
17+
num_warmup_iters: 2
18+
num_iters: 10
19+
20+
# FFN intermediate sizes to search over (heterogeneous architecture)
21+
pruning:
22+
intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336

examples/puzzletron/main.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,6 @@ def run_full_puzzletron(hydra_config_path: str):
6868
config_path: Path to the YAML configuration file
6969
"""
7070
mtpz.tools.mprint("Puzzletron Progress 1/8: starting puzzletron pipeline")
71-
dist.setup(timeout=timedelta(minutes=10))
7271

7372
# Register Hydra custom resolvers (needed for config resolution)
7473
mtpz.tools.register_hydra_resolvers()
@@ -84,6 +83,14 @@ def run_full_puzzletron(hydra_config_path: str):
8483
overrides=[],
8584
)
8685

86+
# Default timeout: 10 minutes, or extended to nccl_timeout_minutes if set in config
87+
if hasattr(hydra_cfg, "nccl_timeout_minutes"):
88+
timeout_minutes = hydra_cfg.nccl_timeout_minutes
89+
else:
90+
timeout_minutes = timedelta(minutes=10)
91+
92+
dist.setup(timeout=timeout_minutes)
93+
8794
# Convert model (convert from HF to DeciLM, score pruning activations,
8895
# prune the model and save pruned checkpoints)
8996
input_model = mtpz.puzzletron_nas_plugin.PuzzletronModel()

modelopt/torch/kernels/sparsity/attention/calibrate.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -200,17 +200,18 @@ def attention_calibrate(
200200
measuring how many KV tiles would be skipped at each threshold in
201201
``threshold_trials``. No autograd — forward only.
202202
203+
All arguments except ``threshold_trials`` match
204+
:func:`modelopt.torch.kernels.common.attention.attention`.
205+
203206
Args:
204-
q, k, v, b_start_loc, b_seq_len, max_input_len, is_causal,
205-
softmax_scale, b_start_loc_k, b_seq_len_k, max_input_len_k:
206-
Same as :func:`modelopt.torch.kernels.common.attention.attention`.
207207
threshold_trials: List of threshold values to measure sparsity for.
208208
Each value is converted to log2-scaled space for the kernel.
209209
210210
Returns:
211-
Tuple of (output, sparsity_counters):
212-
- output: ``[total_q_tokens, num_q_heads, head_dim]``
213-
- sparsity_counters: ``[num_thresholds, 2]`` int64 tensor where
211+
Tuple of ``(output, sparsity_counters)``:
212+
213+
- ``output``: ``[total_q_tokens, num_q_heads, head_dim]``
214+
- ``sparsity_counters``: ``[num_thresholds, 2]`` int64 tensor where
214215
``[:, 0]`` = total tile evaluations, ``[:, 1]`` = skipped tiles.
215216
Sparsity per threshold = ``counters[:, 1] / counters[:, 0]``.
216217
"""

modelopt/torch/puzzletron/mip/run_puzzle.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ class Type(enum.Enum):
7979
_ALLOWED_HUMAN_CONSTRAINTS = {
8080
"target_memory",
8181
"target_throughput",
82-
"target_latency",
82+
"target_latency_seconds",
8383
"target_time_to_first_token",
8484
"num_params",
8585
"stats.has_attention",
@@ -175,8 +175,8 @@ def to_mip_constraints(self, subblock_stats_args) -> dict[str, Any]:
175175
throughput_constraints.append(
176176
batch_size * generation_seq_len / self.constraints["target_throughput"]
177177
)
178-
if "target_latency" in self.constraints:
179-
throughput_constraints.append(self.constraints["target_latency"])
178+
if "target_latency_seconds" in self.constraints:
179+
throughput_constraints.append(self.constraints["target_latency_seconds"])
180180
if throughput_constraints:
181181
mip_constraints["stats.runtime_ms"] = 1000 * min(throughput_constraints)
182182

modelopt/torch/puzzletron/subblock_stats/__init__.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,4 @@
1515

1616
"""Subblock statistics collection for Puzzletron."""
1717

18-
from .calc_subblock_params_and_memory import *
1918
from .calc_subblock_stats import *

0 commit comments

Comments
 (0)