Add lm-eval wrapper for direct AnyModel checkpoint evaluation

j-rausch · j-rausch · commit 0190f7575cbe · 2026-02-17T04:37:37.000-08:00
Signed-off-by: jrausch &lt;jrausch@nvidia.com&gt;
diff --git a/examples/puzzletron/README.md b/examples/puzzletron/README.md
@@ -229,37 +229,24 @@ The plot shows how token accuracy changes with different compression rates. High
 
 ## Evaluation
 
-Evaluate AnyModel checkpoints by deploying a local OpenAI-compatible completions endpoint and running benchmarks against it.
+Evaluate AnyModel checkpoints using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) directly — no deployment server or Ray needed. The wrapper script handles the heterogeneous layer loading automatically.
 
-**1. Deploy the model (2 GPUs example):**
+> **Note:** NeMo containers ship `nvidia_lm_eval`, an NVIDIA fork that occupies the same
+> `lm_eval` namespace. If installed, uninstall it first: `pip uninstall nvidia-lm-eval -y`
 
 ```bash
-# Install the AnyModel-patched deployable (first time only: backs up the original)
-# /opt/Export-Deploy is the default path in NeMo containers — adjust if needed
-cp /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py.bak
-cp examples/puzzletron/evaluation/hf_deployable_anymodel.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py
-
-# Start the server (blocks while running — use a separate terminal)
-ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
-python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
-    --model_path path/to/checkpoint \
-    --model_id anymodel-hf \
-    --num_gpus 2 --num_gpus_per_replica 2 --num_cpus_per_replica 16 \
-    --trust_remote_code --port 8083 --device_map "auto" --cuda_visible_devices "0,1"
+python examples/puzzletron/evaluation/lm_eval_anymodel.py \
+    --model hf \
+    --model_args pretrained=path/to/checkpoint,dtype=bfloat16,parallelize=True \
+    --tasks mmlu \
+    --num_fewshot 5 \
+    --batch_size 4
 ```
 
-**2. Run MMLU:**
+For a quick smoke test, add `--limit 10`. All standard [lm-eval flags](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#basic-usage) are supported.
 
-```bash
-eval-factory run_eval \
-    --eval_type mmlu \
-    --model_id anymodel-hf \
-    --model_type completions \
-    --model_url http://0.0.0.0:8083/v1/completions/ \
-    --output_dir examples/puzzletron/evals/mmlu_anymodel
-```
-
-For a quick debug run, add `--overrides "config.params.limit_samples=5"`.
+> **Alternative:** For server-based evaluation via an OpenAI-compatible endpoint,
+> see [evaluation/nemo_evaluator_instructions.md](./evaluation/nemo_evaluator_instructions.md).
 
 ## Inference Performance Benchmarking
 
diff --git a/examples/puzzletron/evaluation/lm_eval_anymodel.py b/examples/puzzletron/evaluation/lm_eval_anymodel.py
@@ -0,0 +1,108 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Run lm-eval directly on AnyModel (puzzletron) checkpoints without a deployment server.
+
+AnyModel checkpoints have heterogeneous decoder layers; this script patches
+lm-eval's HFLM to wrap model loading with deci_x_patcher so they load correctly.
+
+Usage:
+    # From the repo root (requires: pip install -e ".[hf,puzzletron]")
+    # Descriptor is auto-detected from the checkpoint's config.json model_type.
+    python examples/puzzletron/evaluation/lm_eval_anymodel.py \
+        --model hf \
+        --model_args pretrained=/path/to/anymodel_checkpoint,dtype=bfloat16,parallelize=True \
+        --tasks mmlu \
+        --num_fewshot 5 \
+        --batch_size 4
+
+    # With sample limit for smoke tests
+    python examples/puzzletron/evaluation/lm_eval_anymodel.py \
+        --model hf \
+        --model_args pretrained=/path/to/anymodel_checkpoint,dtype=bfloat16,parallelize=True \
+        --tasks mmlu \
+        --limit 10
+"""
+
+from lm_eval.__main__ import cli_evaluate
+from lm_eval.api.model import T
+from lm_eval.models.huggingface import HFLM
+
+# Trigger factory registration for all model descriptors
+import modelopt.torch.puzzletron.anymodel.models  # noqa: F401
+from modelopt.torch.puzzletron.anymodel.model_descriptor import ModelDescriptorFactory
+from modelopt.torch.puzzletron.anymodel.puzzformer import deci_x_patcher
+
+# Map from HuggingFace config.model_type (in checkpoint config.json) to ModelDescriptorFactory name.
+# Local to this script; add entries when supporting new model types for auto-detection.
+_MODEL_TYPE_TO_DESCRIPTOR = {
+    "llama": "llama",
+    "mistral": "mistral_small",
+    "qwen2": "qwen2",
+    "qwen3": "qwen3",
+    "nemotron_h": "nemotron_h",
+    "nemotron_h_v2": "nemotron_h_v2",
+    "gpt_oss_20b": "gpt_oss_20b",
+}
+
+
+def _resolve_descriptor_from_pretrained(pretrained: str | None):
+    """Resolve the model descriptor by loading the checkpoint config and mapping model_type."""
+    if not pretrained:
+        raise ValueError(
+            "pretrained must be set in --model_args "
+            "(e.g. --model_args pretrained=/path/to/checkpoint,dtype=bfloat16)."
+        )
+    from transformers import AutoConfig
+
+    config = AutoConfig.from_pretrained(pretrained, trust_remote_code=True)
+    model_type = getattr(config, "model_type", None)
+
+    if model_type and model_type in _MODEL_TYPE_TO_DESCRIPTOR:
+        detected = _MODEL_TYPE_TO_DESCRIPTOR[model_type]
+        print(
+            f"[lm_eval_anymodel] Auto-detected model_type='{model_type}' → descriptor='{detected}'"
+        )
+        return ModelDescriptorFactory.get(detected)
+
+    known = sorted(_MODEL_TYPE_TO_DESCRIPTOR.keys())
+    raise ValueError(
+        f"Cannot auto-detect descriptor for model_type='{model_type}'. "
+        f"Known model types: {known}. Add this model_type to _MODEL_TYPE_TO_DESCRIPTOR if supported."
+    )
+
+
+def create_from_arg_obj(cls: type[T], arg_dict: dict, additional_config: dict | None = None) -> T:
+    """Override HFLM.create_from_arg_obj to wrap model loading with deci_x_patcher."""
+    pretrained = arg_dict.get("pretrained")
+    descriptor = _resolve_descriptor_from_pretrained(pretrained)
+
+    additional_config = {} if additional_config is None else additional_config
+    additional_config = {k: v for k, v in additional_config.items() if v is not None}
+
+    # The patcher must be active during HFLM.__init__ because that's where
+    # AutoModelForCausalLM.from_pretrained() is called internally.
+    with deci_x_patcher(model_descriptor=descriptor):
+        model_obj = cls(**arg_dict, **additional_config)
+
+    return model_obj
+
+
+# Monkey-patch HFLM so lm-eval uses our patched model loading
+HFLM.create_from_arg_obj = classmethod(create_from_arg_obj)
+
+
+if __name__ == "__main__":
+    cli_evaluate()
diff --git a/examples/puzzletron/evaluation/nemo_evaluator_instructions.md b/examples/puzzletron/evaluation/nemo_evaluator_instructions.md
@@ -0,0 +1,69 @@
+# Evaluation with NeMo Evaluator (Alternative)
+
+> **Recommended approach:** Use lm-eval for direct evaluation without a
+> deployment server. See the main [README](../README.md#evaluation) for details.
+
+This document describes an alternative evaluation flow using NeMo Evaluator
+via `eval-factory`. It deploys the checkpoint as a local OpenAI-style completions
+endpoint and runs evaluation against it.
+
+## Prerequisites
+
+- NeMo container (e.g. `nemo:25.11`) NeMo Evaluator and NeMo Export-Deploy
+- The AnyModel deploy patch: `examples/puzzletron/evaluation/hf_deployable_anymodel.py`
+
+## Deploy the Model Locally (example: interactive node, 2 GPUs)
+
+```bash
+# Repo root (not puzzle_dir)
+export MODELOPT_WORKDIR=/path/to/Model-Optimizer
+export NEMO_EXPORT_DEPLOY_DIR=/opt/Export-Deploy  # NeMo container default; adjust if needed
+
+# Choose a checkpoint
+export CHECKPOINT_PATH=/path/to/ckpts/teacher
+# or a pruned checkpoint:
+# export CHECKPOINT_PATH=/path/to/ckpts/ffn_8704_attn_no_op
+
+# First time only: back up the original deployable
+cp $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py \
+   $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py.bak
+
+# Patch the deployable for AnyModel support
+cp examples/puzzletron/evaluation/hf_deployable_anymodel.py \
+   $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py
+
+ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
+
+# Run in a separate terminal (blocks while server is up)
+python $NEMO_EXPORT_DEPLOY_DIR/scripts/deploy/nlp/deploy_ray_hf.py \
+  --model_path $CHECKPOINT_PATH \
+  --model_id anymodel-hf \
+  --num_replicas 1 \
+  --num_gpus 2 \
+  --num_gpus_per_replica 2 \
+  --num_cpus_per_replica 16 \
+  --trust_remote_code \
+  --port 8083 \
+  --device_map "auto" \
+  --cuda_visible_devices "0,1"
+```
+
+`deploy_ray_hf.py` runs a long-lived server. Keep it running in another terminal
+or background it (e.g., tmux) while you run eval. Adjust GPU counts and
+`cuda_visible_devices` to match your node.
+
+## Run MMLU
+
+```bash
+eval-factory run_eval \
+  --eval_type mmlu \
+  --model_id anymodel-hf \
+  --model_type completions \
+  --model_url http://0.0.0.0:8083/v1/completions/ \
+  --output_dir $PUZZLE_DIR/evals/mmlu_anymodel \
+  --overrides "config.params.parallelism=2,config.params.task=mmlu,config.params.extra.tokenizer=$CHECKPOINT_PATH,config.params.extra.tokenizer_backend=huggingface,config.params.request_timeout=6000"
+```
+
+For a quick debug run, add `,config.params.limit_samples=5` to the `--overrides` list.
+
+Results can be viewed in the generated `results.yml` file.