NVIDIA
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/puzzletron/README.md‎
Lines changed: 63 additions & 8 deletions b/‎examples/puzzletron/README.md‎
Lines changed: 63 additions & 8 deletions
@@ -109,6 +109,7 @@ repos:
               examples/speculative_decoding/main.py|
               examples/speculative_decoding/medusa_utils.py|
               examples/speculative_decoding/server_generate.py|
+              examples/puzzletron/evaluation/hf_deployable_anymodel\.py|
               modelopt/torch/puzzletron/decilm/deci_lm_hf_code/transformers_.*\.py|
           )$
 
 
@@ -15,11 +15,11 @@ In this example, we compress the [Llama-3.1-8B-Instruct](https://huggingface.co/
 
 ## Environment
 
-- Install Model-Optimizer in editable mode with the corresponding dependencies:
+- Install Model-Optimizer in editable mode with the corresponding dependencies (run from the repo root):
 
 ```bash
 pip install -e .[hf,puzzletron]
-pip install -r requirements.txt
+pip install -r examples/puzzletron/requirements.txt
 ```
 
 - For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU.
@@ -199,16 +199,71 @@ block_14:  attention  no_op   ffn  intermediate_3072
 
 ## Evaluation
 
-Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on [Massive Multitask Language Understanding](https://huggingface.co/datasets/cais/mmlu) benchmark.
+### Local Evaluation with NeMo Evaluator (AnyModel)
+
+AnyModel checkpoints are currently supported via the patched NeMo Evaluator deployable
+in [`examples/puzzletron/evaluation/`](./examples/puzzletron/evaluation/). This deploys a local OpenAI-style completions endpoint that evaluation can be run against.
+
+> **Note:** This flow requires Ray. If it is missing, install it in the container/venv:
+>
+> ```bash
+> pip install ray
+> ```
+>
+**Deploy the model locally on an interactive node (2 GPUs example):**
 
 ```bash
-lm_eval --model hf \
-  --model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \
-  --tasks mmlu \
-  --num_fewshot 5 \
-  --batch_size 4
+# Repo root (not puzzle_dir)
+export MODELOPT_WORKDIR=/path/to/Model-Optimizer
+export NEMO_EXPORT_DEPLOY_DIR=/opt/Export-Deploy #When using a NeMo container, this is where Export-Deploy is located. Adjust, if needed
+
+# Example 1: teacher checkpoint
+export CHECKPOINT_PATH=/path/to/ckpts/teacher
+
+# Example 2: pruned checkpoint (solution_0)
+# for pruned checkpoints, for example:
+export CHECKPOINT_PATH=/path/to/ckpts/ffn_8704_attn_no_op
+
+# First time only: back up the original deployable
+cp $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py.bak
+
+cp examples/puzzletron/evaluation/hf_deployable_anymodel.py $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py
+ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
+
+# Run in a separate terminal (blocks while server is up)
+python $NEMO_EXPORT_DEPLOY_DIR/scripts/deploy/nlp/deploy_ray_hf.py \
+  --model_path $CHECKPOINT_PATH \
+  --model_id anymodel-hf \
+  --num_replicas 1 \
+  --num_gpus 2 \
+  --num_gpus_per_replica 2 \
+  --num_cpus_per_replica 16 \
+  --trust_remote_code \
+  --port 8083 \
+  --device_map "auto" \
+  --cuda_visible_devices "0,1"
 ```
 
+Note: `deploy_ray_hf.py` runs a long-lived server. Keep it running in another terminal
+or background it (e.g., tmux) while you run eval. Adjust GPU counts and `cuda_visible_devices` to
+match your node.
+
+**Run MMLU (full run on the interactive node):**
+
+```bash
+eval-factory run_eval \
+  --eval_type mmlu \
+  --model_id anymodel-hf \
+  --model_type completions \
+  --model_url http://0.0.0.0:8083/v1/completions/ \
+  --output_dir $PUZZLE_DIR/evals/mmlu_anymodel \
+  --overrides "config.params.parallelism=2,config.params.task=mmlu,config.params.extra.tokenizer=$CHECKPOINT_PATH,config.params.extra.tokenizer_backend=huggingface,config.params.request_timeout=6000"
+```
+
+Note: For a quick debug run, add `,config.params.limit_samples=5` to the `--overrides` list.
+
+Results can be viewed in the generated `results.yml` file.
+
 ## Inference Performance Benchmarking
 
 Now let's evaluate how much speedup we get with the compressed model in terms of throughput and latency.