Simplify readme

j-rausch · j-rausch · commit 7b88e66f38eb · 2026-02-17T04:37:38.000-08:00
Signed-off-by: jrausch &lt;jrausch@nvidia.com&gt;
diff --git a/examples/puzzletron/README.md b/examples/puzzletron/README.md
@@ -229,10 +229,7 @@ The plot shows how token accuracy changes with different compression rates. High
 
 ## Evaluation
 
-Evaluate AnyModel checkpoints using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) directly — no deployment server or Ray needed. The wrapper script handles the heterogeneous layer loading automatically.
-
-> **Note:** NeMo containers ship `nvidia_lm_eval`, an NVIDIA fork that occupies the same
-> `lm_eval` namespace. If installed, uninstall it first: `pip uninstall nvidia-lm-eval -y`
+Evaluate AnyModel checkpoints using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) directly — no deployment server needed.
 
 ```bash
 python examples/puzzletron/evaluation/lm_eval_anymodel.py \
@@ -245,6 +242,9 @@ python examples/puzzletron/evaluation/lm_eval_anymodel.py \
 
 For a quick smoke test, add `--limit 10`. All standard [lm-eval flags](https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#basic-usage) are supported.
 
+> **Note:** NeMo containers may ship `nvidia-lm-eval` which conflicts with upstream `lm-eval`.
+> If so, run `pip uninstall nvidia-lm-eval -y` before installing requirements.
+
 > **Alternative:** For server-based evaluation via an OpenAI-compatible endpoint,
 > see [evaluation/nemo_evaluator_instructions.md](./evaluation/nemo_evaluator_instructions.md).
 
diff --git a/examples/puzzletron/evaluation/nemo_evaluator_instructions.md b/examples/puzzletron/evaluation/nemo_evaluator_instructions.md
@@ -3,67 +3,44 @@
 > **Recommended approach:** Use lm-eval for direct evaluation without a
 > deployment server. See the main [README](../README.md#evaluation) for details.
 
-This document describes an alternative evaluation flow using NeMo Evaluator
-via `eval-factory`. It deploys the checkpoint as a local OpenAI-style completions
-endpoint and runs evaluation against it.
+This document describes an alternative evaluation flow using NeMo Evaluator.
+It deploys the checkpoint as a local OpenAI-compatible completions endpoint
+and runs evaluation against it.
 
 ## Prerequisites
 
-- NeMo container (e.g. `nemo:25.11`) NeMo Evaluator and NeMo Export-Deploy
-- The AnyModel deploy patch: `examples/puzzletron/evaluation/hf_deployable_anymodel.py`
+- NeMo container (e.g. `nemo:25.11`) with NeMo Evaluator and NeMo Export-Deploy
+- Ray (`pip install -r examples/puzzletron/requirements.txt`)
 
-## Deploy the Model Locally (example: interactive node, 2 GPUs)
+## 1. Deploy the model (2 GPUs example)
 
 ```bash
-# Repo root (not puzzle_dir)
-export MODELOPT_WORKDIR=/path/to/Model-Optimizer
-export NEMO_EXPORT_DEPLOY_DIR=/opt/Export-Deploy  # NeMo container default; adjust if needed
-
-# Choose a checkpoint
-export CHECKPOINT_PATH=/path/to/ckpts/teacher
-# or a pruned checkpoint:
-# export CHECKPOINT_PATH=/path/to/ckpts/ffn_8704_attn_no_op
-
-# First time only: back up the original deployable
-cp $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py \
-   $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py.bak
-
-# Patch the deployable for AnyModel support
-cp examples/puzzletron/evaluation/hf_deployable_anymodel.py \
-   $NEMO_EXPORT_DEPLOY_DIR/nemo_deploy/llm/hf_deployable.py
+# Install the AnyModel-patched deployable (first time only: backs up the original)
+# /opt/Export-Deploy is the default path in NeMo containers — adjust if needed
+cp /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py.bak
+cp examples/puzzletron/evaluation/hf_deployable_anymodel.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py
 
+# Start the server (blocks while running — use a separate terminal)
 ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
-
-# Run in a separate terminal (blocks while server is up)
-python $NEMO_EXPORT_DEPLOY_DIR/scripts/deploy/nlp/deploy_ray_hf.py \
-  --model_path $CHECKPOINT_PATH \
-  --model_id anymodel-hf \
-  --num_replicas 1 \
-  --num_gpus 2 \
-  --num_gpus_per_replica 2 \
-  --num_cpus_per_replica 16 \
-  --trust_remote_code \
-  --port 8083 \
-  --device_map "auto" \
-  --cuda_visible_devices "0,1"
+python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
+    --model_path path/to/checkpoint \
+    --model_id anymodel-hf \
+    --num_gpus 2 --num_gpus_per_replica 2 --num_cpus_per_replica 16 \
+    --trust_remote_code --port 8083 --device_map "auto" --cuda_visible_devices "0,1"
 ```
 
-`deploy_ray_hf.py` runs a long-lived server. Keep it running in another terminal
-or background it (e.g., tmux) while you run eval. Adjust GPU counts and
-`cuda_visible_devices` to match your node.
+Adjust GPU counts and `cuda_visible_devices` to match your node.
 
-## Run MMLU
+## 2. Run MMLU
 
 ```bash
 eval-factory run_eval \
-  --eval_type mmlu \
-  --model_id anymodel-hf \
-  --model_type completions \
-  --model_url http://0.0.0.0:8083/v1/completions/ \
-  --output_dir $PUZZLE_DIR/evals/mmlu_anymodel \
-  --overrides "config.params.parallelism=2,config.params.task=mmlu,config.params.extra.tokenizer=$CHECKPOINT_PATH,config.params.extra.tokenizer_backend=huggingface,config.params.request_timeout=6000"
+    --eval_type mmlu \
+    --model_id anymodel-hf \
+    --model_type completions \
+    --model_url http://0.0.0.0:8083/v1/completions/ \
+    --output_dir examples/puzzletron/evals/mmlu_anymodel \
+    --overrides "config.params.task=mmlu,config.params.extra.tokenizer=path/to/checkpoint,config.params.extra.tokenizer_backend=huggingface"
 ```
 
-For a quick debug run, add `,config.params.limit_samples=5` to the `--overrides` list.
-
-Results can be viewed in the generated `results.yml` file.
+For a quick debug run, add `,config.params.limit_samples=5` to `--overrides`.