Recommended approach: Use lm-eval for direct evaluation without a deployment server. See the main README for details.
Evaluate AnyModel checkpoints by deploying a local OpenAI-compatible completions endpoint and running benchmarks against it.
This flow requires Ray for serving the model and NeMo Export-Deploy (included in NeMo containers):
pip install -r examples/puzzletron/requirements.txt1. Deploy the model (2 GPUs example):
We need to patch the hf_deployable.py script from Export-Deploy. Best way is to do it as a mount in docker run:
export MODELOPT_DIR=${PWD}/Model-Optimizer # or set to your local Model-Optimizer repository path if you have cloned it
if [ ! -d "${MODELOPT_DIR}" ]; then
git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
fi
export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02
docker run \
--gpus all \
--shm-size=16GB \
--net=host \
--ulimit memlock=-1 \
--rm -it \
-v ${MODELOPT_DIR}:/opt/Model-Optimizer \
-v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \
-v ${MODELOPT_DIR}/examples/puzzletron/evaluation/hf_deployable_anymodel.py:/opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py \
-w /opt/Model-Optimizer/examples/megatron_bridge \
${DOCKER_IMAGE} bashAlternatively you can manually update the file
# Install the AnyModel-patched deployable (first time only: backs up the original)
# /opt/Export-Deploy is the default path in NeMo containers — adjust if needed
cp /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.py.bak
cp examples/puzzletron/evaluation/hf_deployable_anymodel.py /opt/Export-Deploy/nemo_deploy/llm/hf_deployable.pyNow start ray server and deploy the model
# Start the server (blocks while running — use a separate terminal)
ray start --head --num-gpus 2 --port 6379 --disable-usage-stats
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
--model_path path/to/checkpoint \
--model_id anymodel-hf \
--num_gpus 2 --num_gpus_per_replica 2 --num_cpus_per_replica 16 \
--trust_remote_code --port 8083 --device_map "auto" --cuda_visible_devices "0,1"2. Run MMLU:
eval-factory run_eval \
--eval_type mmlu \
--model_id anymodel-hf \
--model_type completions \
--model_url http://0.0.0.0:8083/v1/completions/ \
--output_dir examples/puzzletron/evals/mmlu_anymodelFor a quick debug run, add --overrides "config.params.limit_samples=5".