Skip to content

Commit b0748dd

Browse files
committed
Vendor launching-evals and accessing-mlflow skills from NVIDIA-NeMo/Evaluator
Both are vendored verbatim from commit 01899f8 with SHA-pin provenance in frontmatter. `launching-evals` covers run/monitor/debug/analyze flows for NEL evaluations; `accessing-mlflow` covers MLflow run querying via mlflow-mcp. These complement (do not duplicate) our existing `evaluation` skill, which remains focused on config generation with ModelOpt-specific additions. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
1 parent 8176fc7 commit b0748dd

File tree

10 files changed

+911
-0
lines changed

10 files changed

+911
-0
lines changed
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
---
2+
name: accessing-mlflow
3+
description: Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
4+
license: Apache-2.0
5+
# Vendored verbatim from NVIDIA NeMo Evaluator (commit 01899f8)
6+
# https://github.com/NVIDIA-NeMo/Evaluator/tree/01899f89e8f31116efbca56e8f87fbd8513e24ac/packages/nemo-evaluator-launcher/.claude/skills/accessing-mlflow
7+
# To re-sync: scripts/sync-upstream-skills.sh
8+
# Note: this skill depends on the mlflow-mcp MCP server (https://github.com/kkruglik/mlflow-mcp)
9+
# configured in the user's Claude Code setup.
10+
---
11+
12+
# Accessing MLflow
13+
14+
## MCP Server
15+
16+
[mlflow-mcp](https://github.com/kkruglik/mlflow-mcp) gives agents direct access to MLflow — query runs, compare metrics, browse artifacts, all through natural language.
17+
18+
## ID Convention
19+
20+
When the user provides a hex ID (e.g. `71f3f3199ea5e1f0`) without specifying what it is, assume it is an **invocation_id** (not an MLflow run_id). An invocation_id identifies a launcher invocation and is stored as both a tag and a param on MLflow runs. One invocation can produce multiple MLflow runs (one per task). You may need to search across multiple experiments if you don't know which experiment the run belongs to.
21+
22+
## Querying Runs
23+
24+
```python
25+
# Find runs by invocation_id
26+
MLflow:search_runs_by_tags(experiment_id, {"invocation_id": "<invocation_id>"})
27+
28+
# Query for example model/task runs
29+
MLflow:query_runs(experiment_id, "tags.model LIKE '%<model>%'")
30+
MLflow:query_runs(experiment_id, "tags.task_name LIKE '%<task_name>%'")
31+
32+
# Get a config from run's artifacts
33+
MLflow:get_artifact_content(run_id, "config.yml")
34+
35+
# Get nested stats from run's artifacts
36+
MLflow:get_artifact_content(run_id, "artifacts/eval_factory_metrics.json")
37+
```
38+
39+
NOTE: You WILL NOT find PENDING, RUNNING, KILLED, or FAILED runs in MLflow! Only SUCCESSFUL runs are exported to MLflow.
40+
41+
## Workflow Tips
42+
43+
When comparing metrics across runs, fetch the data via MCP, then run the computation in Python for exact results rather than doing math in-context:
44+
45+
```bash
46+
uv run --with pandas python3 << 'EOF'
47+
import pandas as pd
48+
# ... compute deltas, averages, etc.
49+
EOF
50+
```
51+
52+
## Artifacts Structure
53+
54+
```
55+
<harness>.<task>/
56+
├── artifacts/
57+
│ ├── config.yml # Fully resolved config used during the evaluation
58+
│ ├── launcher_unresolved_config.yaml # Unresolved config passed to the launcher
59+
│ ├── results.yml # All results in YAML format
60+
│ ├── eval_factory_metrics.json # Runtime stats (latency, tokens count, memory)
61+
│ ├── report.html # Request-Response Pairs samples in HTML format (if enabled)
62+
│ └── report.json # Request-Response Pairs samples in JSON format (if enabled)
63+
└── logs/
64+
├── client-*.log # Evaluation client
65+
├── server-*-N.log # Deployment per node
66+
├── slurm-*.log # Slurm job
67+
└── proxy-*.log # Request proxy
68+
```
69+
70+
## Troubleshooting
71+
72+
If the MLflow MCP server fails to load or its tools are unavailable:
73+
74+
1. **`uvx` not found** — install [uv](https://docs.astral.sh/uv/getting-started/installation/):
75+
```bash
76+
curl -LsSf https://astral.sh/uv/install.sh | sh
77+
```
78+
2. **MCP server not configured** — add the config and restart the agent:
79+
80+
**For Claude Code** — add to `.claude/settings.json` (project or user level), under `"mcpServers"`:
81+
```json
82+
"MLflow": {
83+
"command": "uvx",
84+
"args": ["mlflow-mcp"],
85+
"env": {
86+
"MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/"
87+
}
88+
}
89+
```
90+
91+
**For Cursor** — edit `~/.cursor/mcp.json` (Settings > Tools & MCP > New MCP Server):
92+
```json
93+
{
94+
"mcpServers": {
95+
"MLflow": {
96+
"command": "uvx",
97+
"args": ["mlflow-mcp"],
98+
"env": {
99+
"MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/"
100+
}
101+
}
102+
}
103+
}
104+
```
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
name: launching-evals
3+
description: Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.
4+
license: Apache-2.0
5+
# Vendored verbatim from NVIDIA NeMo Evaluator (commit 01899f8)
6+
# https://github.com/NVIDIA-NeMo/Evaluator/tree/01899f89e8f31116efbca56e8f87fbd8513e24ac/packages/nemo-evaluator-launcher/.claude/skills/launching-evals
7+
# To re-sync: scripts/sync-upstream-skills.sh
8+
---
9+
10+
# NeMo Evaluator Skill
11+
12+
## Quick Reference
13+
14+
### nemo-evaluator-launcher CLI
15+
16+
```bash
17+
# Run evaluation
18+
uv run nemo-evaluator-launcher run --config <path.yaml>
19+
uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name>
20+
uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ...
21+
uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...
22+
23+
# Preview the resolved config and the sbatch script without running the evaluation
24+
uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run
25+
26+
# Check status (--json for machine-readable output)
27+
uv run nemo-evaluator-launcher status <invocation_id> --json
28+
29+
# Get evaluation run info (output paths, slurm job IDs, cluster hostname, etc.)
30+
uv run nemo-evaluator-launcher info <invocation_id>
31+
32+
# Copy just the logs (quick — good for debugging)
33+
uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/
34+
35+
# For artifacts: use `nel info` to discover paths. If remote, SSH to explore and rsync what you need.
36+
# If local, just read directly from the paths shown by `nel info`.
37+
# ssh <user>@<hostname> "ls <artifacts_path>/"
38+
# rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/
39+
40+
# List past runs
41+
uv run nemo-evaluator-launcher ls runs --since 1d
42+
43+
# List available evaluation tasks (by default, only shows tasks from the latest released containers)
44+
uv run nemo-evaluator-launcher ls tasks
45+
uv run nemo-evaluator-launcher ls tasks --from_container gitlab-master.nvidia.com/dl/joc/competitive_evaluation/nvidia-core-evals/ci-llm/long-context-eval:dev-2025-12-16T14-37-1693de28-amd64
46+
```
47+
48+
## Workflow
49+
50+
The complete evaluation workflow is divided into the following steps you should follow IN ORDER.
51+
52+
1. Create or modify a config using the `nel-assistant` skill. If the user provides a past run, use its `config.yml` artifact as a starting point.
53+
2. Run the evaluation. See `references/run-evaluation.md` when executing this step.
54+
3. Check progress (while RUNNING). See `references/check-progress.md` when executing this step.
55+
4. Post-run actions (when terminal state reached):
56+
1. When the evaluation status is `SUCCESS`, analyze the results. See `references/analyze-results.md` when executing this step.
57+
2. When the evaluation status is `FAILED`, debug the failed run. See `references/debug-failed-runs.md` when executing this step.
58+
59+
# Key Facts
60+
61+
- Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
62+
- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval``coreai_dlalgo_llm`).
63+
- **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
64+
- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.
65+
- **`data_parallel_size` is per node**: `dp_size=1` with `num_nodes=8` means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret `dp_size` as the global replica count.
66+
- **`payload_modifier` interceptor**: The `params_to_remove` list (e.g. `[max_tokens, max_completion_tokens]`) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
67+
- **Auto-export git workaround**: The export container (`python:3.12-slim`) lacks `git`. When installing the launcher from a git URL, set `auto_export.launcher_install_cmd` to install git first (e.g., `apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"`).
68+
- **Do NOT use `nemo-evaluator-launcher export --dest local`** — it only writes a summary JSON (`processed_results.json`), it does NOT copy actual logs or artifacts despite accepting `--copy_logs` and `--copy-artifacts` flags. `nel info --copy-artifacts` works but copies everything (very slow for large benchmarks). Preferred approach: use `nel info` to discover paths — if local, read directly; if remote, SSH to explore and rsync only what you need. Note that `nel info` prints standard artifacts but benchmarks produce additional artifacts in subdirs — explore to find them.
69+
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Analyze the results
2+
3+
Copy this checklist and track your progress:
4+
5+
```
6+
Analysis progress:
7+
- [ ] Step 1: Gather information
8+
- [ ] Step 2: Scan logs for runtime problems (per run)
9+
- [ ] Step 3: Validate config and methodology (per run)
10+
- [ ] Step 4: Report findings
11+
```
12+
13+
Steps 2-3 are executed for EACH run separately.
14+
15+
## Step 1: Gather information
16+
17+
**IMPORTANT**: Copy what you need (and only what you need) locally BEFORE analysis — each SSH command requires user approval, so remote one-by-one reads are disruptive, and copying too much is slow.
18+
19+
- Get one or more successful invocation IDs to analyze from the user. You might already have the invocation ID in your memory from the previous step.
20+
- Get paths: `uv run nemo-evaluator-launcher info <invocation_id>`
21+
- If artifacts are local, read them directly from the paths shown by `nel info`.
22+
- If artifacts are remote:
23+
- Copy logs: `uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/`
24+
- Rsync analysis-relevant artifacts: `rsync -avzP <user>@<host>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/`
25+
- For MLflow access, see the `accessing-mlflow` skill.
26+
- Read benchmark-specific analysis notes from `references/benchmarks/` if available for the evaluated benchmarks.
27+
- For Terminal Bench agent trace analysis, follow the procedure in `references/benchmarks/terminal-bench-trace-analysis.md`.
28+
29+
## Step 2: Scan logs for runtime problems
30+
31+
Access logs from locally copied files (`./evaluation-results/<invocation_id>.<job_index>/logs/`). Do NOT read logs via SSH — use the local copies from Step 1.
32+
33+
Check logs for silent errors that may invalidate results:
34+
35+
1. **Tool calling failures**: Search `client-*.log` for "failed" tests, `server-*.log` for "invalid tool call"
36+
2. **Unfinished reasoning**: Check `server-*.log` for `finish_reason: length`, or truncation warnings in `client-*.log`
37+
3. **API errors**: HTTP status != 200 in `client-*.log`, trace to `server-*.log` or `proxy-*.log`
38+
4. **Config mismatches**: Compare `config.yml` params with actual values in `server-*.log` startup and `client-*.log` command
39+
5. **Performance anomalies**: Low throughput, 0% prefix cache hit rate in `server-*.log`
40+
6. **Cached responses**: Count "Returning cached response" in `client-*.log`
41+
7. **KV cache preemptions**: Search `server-*.log` for `PreemptionMode.RECOMPUTE`. If found, consider increasing `tensor_parallel_size` (even at the cost of `data_parallel_size`) to relieve KV cache memory pressure.
42+
43+
## Step 3: Validate config and methodology
44+
45+
1. **Methodology consistency**: Verify same benchmark versions, prompt templates, sampling params, and infrastructure across all models. Flag discrepancies.
46+
2. **HF model card compliance**: Read the model's HuggingFace model card. Flag any deviations in inference parameters (temperature, top_p, max_new_tokens, deployment args, reasoning flags, etc.).
47+
3. **Reasoning model validation**: Verify temp > 0, top_p > 0, `max_tokens` = null (allow full output length).
48+
NOTE: `use_reasoning: False` in adapter_config does NOT mean reasoning is disabled — it only controls the reasoning interceptor. Whether reasoning is active depends on the model's own controls (deployment args, system prompt, API payload fields, etc.).
49+
4. **Non-reasoning model validation**: Verify `max_tokens` = 16k
50+
5. **Max model length**: Verify `max-model-len` = 131072 (leaderboard-recommended). Long context benchmarks (AA LCR, RULER) and agentic benchmarks may require a longer `max-model-len`.
51+
6. **RULER tasks**: Check thinking disabled, walltime=4h, rope-scaling for Qwen models
52+
7. **AA baseline comparison**: Compare results against Artificial Analysis published scores. Exact match not expected — flag significant deviations.
53+
54+
## Step 4: Report findings
55+
56+
Present key metrics from `results.yml` in a table and summarize the metrics from `eval_factory_metrics.json` in a concise manner (include only the most important metrics or anomalies). If multiple runs, include side-by-side comparison of metrics (e.g. accuracy, latency, tokens count, memory). Summarize any issues found. Recommend improvements if applicable.
57+

0 commit comments

Comments
 (0)