Skip to content

Commit 26f37ae

Browse files
committed
Add support and documentation for AnyModel checkpoints with Nemo evaluator
Signed-off-by: jrausch <jrausch@nvidia.com>
1 parent 4e744d4 commit 26f37ae

4 files changed

Lines changed: 790 additions & 9 deletions

File tree

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ repos:
109109
examples/speculative_decoding/main.py|
110110
examples/speculative_decoding/medusa_utils.py|
111111
examples/speculative_decoding/server_generate.py|
112+
examples/puzzletron/evaluation/hf_deployable_anymodel\.py|
112113
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py|
113114
)$
114115

examples/puzzletron/README.md

Lines changed: 63 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@ In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/
1515
1616
## Environment
1717

18-
- Install Model-Optimizer in editable mode with the corresponding dependencies:
18+
- Install Model-Optimizer in editable mode with the corresponding dependencies (run from the repo root):
1919

2020
```bash
2121
pip install -e .[hf,puzzletron]
22-
pip install -r requirements.txt
22+
pip install -r examples/puzzletron/requirements.txt
2323
```
2424

2525
- For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.
@@ -199,16 +199,71 @@ block_14: attention no_op ffn intermediate_3072
199199
200200
## Evaluation
201201
202-
Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on [Massive Multitask Language Understanding](https://huggingface.co/datasets/cais/mmlu) benchmark.
202+
### Local Evaluation with NeMo Evaluator (AnyModel)
203+
204+
AnyModel checkpoints are currently supported via the patched NeMo Evaluator deployable
205+
in [`examples/puzzletron/evaluation/`](./examples/puzzletron/evaluation/). This deploys a local OpenAI-style completions endpoint that evaluation can be run against.
206+
207+
> **Note:** This flow requires Ray. If it is missing, install it in the container/venv:
208+
>
209+
> ```bash
210+
> pip install ray
211+
> ```
212+
>
213+
**Deploy the model locally on an interactive node (2 GPUs example):**
203214
204215
```bash
205-
lm_eval --model hf \
206-
--model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \
207-
--tasks mmlu \
208-
--num_fewshot 5 \
209-
--batch_size 4
216+
# Repo root (not puzzle_dir)
217+
export MODELOPT_WORKDIR=/path/to/Model-Optimizer
218+
export NEMO_EXPORT_DEPLOY_DIR=/opt/Export-Deploy #When using a NeMo container, this is where Export-Deploy is located. Adjust, if needed
219+
220+
# Example 1: teacher checkpoint
221+
export CHECKPOINT_PATH=/path/to/ckpts/teacher
222+
223+
# Example 2: pruned checkpoint (solution_0)
224+
# for pruned checkpoints, for example:
225+
export CHECKPOINT_PATH=/path/to/ckpts/ffn_8704_attn_no_op
226+
227+
# First time only: back up the original deployable
228+
cp $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py.bak
229+
230+
cp examples/puzzletron/evaluation/hf_deployable_anymodel.py $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py
231+
ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
232+
233+
# Run in a separate terminal (blocks while server is up)
234+
python $NEMO_EXPORT_DEPLOY_DIR/scripts/deploy/nlp/deploy_ray_hf.py \
235+
--model_path $CHECKPOINT_PATH \
236+
--model_id anymodel-hf \
237+
--num_replicas 1 \
238+
--num_gpus 2 \
239+
--num_gpus_per_replica 2 \
240+
--num_cpus_per_replica 16 \
241+
--trust_remote_code \
242+
--port 8083 \
243+
--device_map "auto" \
244+
--cuda_visible_devices "0,1"
210245
```
211246
247+
Note: `deploy_ray_hf.py` runs a long-lived server. Keep it running in another terminal
248+
or background it (e.g., tmux) while you run eval. Adjust GPU counts and `cuda_visible_devices` to
249+
match your node.
250+
251+
**Run MMLU (full run on the interactive node):**
252+
253+
```bash
254+
eval-factory run_eval \
255+
--eval_type mmlu \
256+
--model_id anymodel-hf \
257+
--model_type completions \
258+
--model_url http://0.0.0.0:8083/v1/completions/ \
259+
--output_dir $PUZZLE_DIR/evals/mmlu_anymodel \
260+
--overrides "config.params.parallelism=2,config.params.task=mmlu,config.params.extra.tokenizer=$CHECKPOINT_PATH,config.params.extra.tokenizer_backend=huggingface,config.params.request_timeout=6000"
261+
```
262+
263+
Note: For a quick debug run, add `,config.params.limit_samples=5` to the `--overrides` list.
264+
265+
Results can be viewed in the generated `results.yml` file.
266+
212267
## Inference Performance Benchmarking
213268
214269
Now let's evaluate how much speedup we get with the compressed model in terms of throughput and latency.

0 commit comments

Comments
 (0)