Skip to content

Commit 0190f75

Browse files
committed
Add lm-eval wrapper for direct AnyModel checkpoint evaluation
Signed-off-by: jrausch <jrausch@nvidia.com>
1 parent f69fc7a commit 0190f75

3 files changed

Lines changed: 189 additions & 25 deletions

File tree

examples/puzzletron/README.md

Lines changed: 12 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -229,37 +229,24 @@ The plot shows how token accuracy changes with different compression rates. High
229229
230230
## Evaluation
231231
232-
Evaluate AnyModel checkpoints by deploying a local OpenAI-compatible completions endpoint and running benchmarks against it.
232+
Evaluate AnyModel checkpoints using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) directly — no deployment server or Ray needed. The wrapper script handles the heterogeneous layer loading automatically.
233233
234-
**1. Deploy the model (2 GPUs example):**
234+
> **Note:** NeMo containers ship `nvidia_lm_eval`, an NVIDIA fork that occupies the same
235+
> `lm_eval` namespace. If installed, uninstall it first: `pip uninstall nvidia-lm-eval -y`
235236
236237
```bash
237-
# Install the AnyModel-patched deployable (first time only: backs up the original)
238-
# /opt/Export-Deploy is the default path in NeMo containers — adjust if needed
239-
cp /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py.bak
240-
cp examples/puzzletron/evaluation/hf_deployable_anymodel.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py
241-
242-
# Start the server (blocks while running — use a separate terminal)
243-
ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
244-
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
245-
--model_path path/to/checkpoint \
246-
--model_id anymodel-hf \
247-
--num_gpus 2 --num_gpus_per_replica 2 --num_cpus_per_replica 16 \
248-
--trust_remote_code --port 8083 --device_map "auto" --cuda_visible_devices "0,1"
238+
python examples/puzzletron/evaluation/lm_eval_anymodel.py \
239+
--model hf \
240+
--model_args pretrained=path/to/checkpoint,dtype=bfloat16,parallelize=True \
241+
--tasks mmlu \
242+
--num_fewshot 5 \
243+
--batch_size 4
249244
```
250245
251-
**2. Run MMLU:**
246+
For a quick smoke test, add `--limit 10`. All standard [lm-eval flags](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#basic-usage) are supported.
252247
253-
```bash
254-
eval-factory run_eval \
255-
--eval_type mmlu \
256-
--model_id anymodel-hf \
257-
--model_type completions \
258-
--model_url http://0.0.0.0:8083/v1/completions/ \
259-
--output_dir examples/puzzletron/evals/mmlu_anymodel
260-
```
261-
262-
For a quick debug run, add `--overrides "config.params.limit_samples=5"`.
248+
> **Alternative:** For server-based evaluation via an OpenAI-compatible endpoint,
249+
> see [evaluation/nemo_evaluator_instructions.md](./evaluation/nemo_evaluator_instructions.md).
263250
264251
## Inference Performance Benchmarking
265252
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
"""Run lm-eval directly on AnyModel (puzzletron) checkpoints without a deployment server.
17+
18+
AnyModel checkpoints have heterogeneous decoder layers; this script patches
19+
lm-eval's HFLM to wrap model loading with deci_x_patcher so they load correctly.
20+
21+
Usage:
22+
# From the repo root (requires: pip install -e ".[hf,puzzletron]")
23+
# Descriptor is auto-detected from the checkpoint's config.json model_type.
24+
python examples/puzzletron/evaluation/lm_eval_anymodel.py \
25+
--model hf \
26+
--model_args pretrained=/path/to/anymodel_checkpoint,dtype=bfloat16,parallelize=True \
27+
--tasks mmlu \
28+
--num_fewshot 5 \
29+
--batch_size 4
30+
31+
# With sample limit for smoke tests
32+
python examples/puzzletron/evaluation/lm_eval_anymodel.py \
33+
--model hf \
34+
--model_args pretrained=/path/to/anymodel_checkpoint,dtype=bfloat16,parallelize=True \
35+
--tasks mmlu \
36+
--limit 10
37+
"""
38+
39+
from lm_eval.__main__ import cli_evaluate
40+
from lm_eval.api.model import T
41+
from lm_eval.models.huggingface import HFLM
42+
43+
# Trigger factory registration for all model descriptors
44+
import modelopt.torch.puzzletron.anymodel.models # noqa: F401
45+
from modelopt.torch.puzzletron.anymodel.model_descriptor import ModelDescriptorFactory
46+
from modelopt.torch.puzzletron.anymodel.puzzformer import deci_x_patcher
47+
48+
# Map from HuggingFace config.model_type (in checkpoint config.json) to ModelDescriptorFactory name.
49+
# Local to this script; add entries when supporting new model types for auto-detection.
50+
_MODEL_TYPE_TO_DESCRIPTOR = {
51+
"llama": "llama",
52+
"mistral": "mistral_small",
53+
"qwen2": "qwen2",
54+
"qwen3": "qwen3",
55+
"nemotron_h": "nemotron_h",
56+
"nemotron_h_v2": "nemotron_h_v2",
57+
"gpt_oss_20b": "gpt_oss_20b",
58+
}
59+
60+
61+
def _resolve_descriptor_from_pretrained(pretrained: str | None):
62+
"""Resolve the model descriptor by loading the checkpoint config and mapping model_type."""
63+
if not pretrained:
64+
raise ValueError(
65+
"pretrained must be set in --model_args "
66+
"(e.g. --model_args pretrained=/path/to/checkpoint,dtype=bfloat16)."
67+
)
68+
from transformers import AutoConfig
69+
70+
config = AutoConfig.from_pretrained(pretrained, trust_remote_code=True)
71+
model_type = getattr(config, "model_type", None)
72+
73+
if model_type and model_type in _MODEL_TYPE_TO_DESCRIPTOR:
74+
detected = _MODEL_TYPE_TO_DESCRIPTOR[model_type]
75+
print(
76+
f"[lm_eval_anymodel] Auto-detected model_type='{model_type}' → descriptor='{detected}'"
77+
)
78+
return ModelDescriptorFactory.get(detected)
79+
80+
known = sorted(_MODEL_TYPE_TO_DESCRIPTOR.keys())
81+
raise ValueError(
82+
f"Cannot auto-detect descriptor for model_type='{model_type}'. "
83+
f"Known model types: {known}. Add this model_type to _MODEL_TYPE_TO_DESCRIPTOR if supported."
84+
)
85+
86+
87+
def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict | None = None) -> T:
88+
"""Override HFLM.create_from_arg_obj to wrap model loading with deci_x_patcher."""
89+
pretrained = arg_dict.get("pretrained")
90+
descriptor = _resolve_descriptor_from_pretrained(pretrained)
91+
92+
additional_config = {} if additional_config is None else additional_config
93+
additional_config = {k: v for k, v in additional_config.items() if v is not None}
94+
95+
# The patcher must be active during HFLM.__init__ because that's where
96+
# AutoModelForCausalLM.from_pretrained() is called internally.
97+
with deci_x_patcher(model_descriptor=descriptor):
98+
model_obj = cls(**arg_dict, **additional_config)
99+
100+
return model_obj
101+
102+
103+
# Monkey-patch HFLM so lm-eval uses our patched model loading
104+
HFLM.create_from_arg_obj = classmethod(create_from_arg_obj)
105+
106+
107+
if __name__ == "__main__":
108+
cli_evaluate()
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Evaluation with NeMo Evaluator (Alternative)
2+
3+
> **Recommended approach:** Use lm-eval for direct evaluation without a
4+
> deployment server. See the main [README](../README.md#evaluation) for details.
5+
6+
This document describes an alternative evaluation flow using NeMo Evaluator
7+
via `eval-factory`. It deploys the checkpoint as a local OpenAI-style completions
8+
endpoint and runs evaluation against it.
9+
10+
## Prerequisites
11+
12+
- NeMo container (e.g. `nemo:25.11`) NeMo Evaluator and NeMo Export-Deploy
13+
- The AnyModel deploy patch: `examples/puzzletron/evaluation/hf_deployable_anymodel.py`
14+
15+
## Deploy the Model Locally (example: interactive node, 2 GPUs)
16+
17+
```bash
18+
# Repo root (not puzzle_dir)
19+
export MODELOPT_WORKDIR=/path/to/Model-Optimizer
20+
export NEMO_EXPORT_DEPLOY_DIR=/opt/Export-Deploy # NeMo container default; adjust if needed
21+
22+
# Choose a checkpoint
23+
export CHECKPOINT_PATH=/path/to/ckpts/teacher
24+
# or a pruned checkpoint:
25+
# export CHECKPOINT_PATH=/path/to/ckpts/ffn_8704_attn_no_op
26+
27+
# First time only: back up the original deployable
28+
cp $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py \
29+
$NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py.bak
30+
31+
# Patch the deployable for AnyModel support
32+
cp examples/puzzletron/evaluation/hf_deployable_anymodel.py \
33+
$NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py
34+
35+
ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
36+
37+
# Run in a separate terminal (blocks while server is up)
38+
python $NEMO_EXPORT_DEPLOY_DIR/scripts/deploy/nlp/deploy_ray_hf.py \
39+
--model_path $CHECKPOINT_PATH \
40+
--model_id anymodel-hf \
41+
--num_replicas 1 \
42+
--num_gpus 2 \
43+
--num_gpus_per_replica 2 \
44+
--num_cpus_per_replica 16 \
45+
--trust_remote_code \
46+
--port 8083 \
47+
--device_map "auto" \
48+
--cuda_visible_devices "0,1"
49+
```
50+
51+
`deploy_ray_hf.py` runs a long-lived server. Keep it running in another terminal
52+
or background it (e.g., tmux) while you run eval. Adjust GPU counts and
53+
`cuda_visible_devices` to match your node.
54+
55+
## Run MMLU
56+
57+
```bash
58+
eval-factory run_eval \
59+
--eval_type mmlu \
60+
--model_id anymodel-hf \
61+
--model_type completions \
62+
--model_url http://0.0.0.0:8083/v1/completions/ \
63+
--output_dir $PUZZLE_DIR/evals/mmlu_anymodel \
64+
--overrides "config.params.parallelism=2,config.params.task=mmlu,config.params.extra.tokenizer=$CHECKPOINT_PATH,config.params.extra.tokenizer_backend=huggingface,config.params.request_timeout=6000"
65+
```
66+
67+
For a quick debug run, add `,config.params.limit_samples=5` to the `--overrides` list.
68+
69+
Results can be viewed in the generated `results.yml` file.

0 commit comments

Comments
 (0)