NVIDIA
diff --git a/‎.claude/skills/accessing-mlflow/SKILL.md‎
Lines changed: 104 additions & 0 deletions b/‎.claude/skills/accessing-mlflow/SKILL.md‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎.claude/skills/launching-evals/SKILL.md‎
Lines changed: 69 additions & 0 deletions b/‎.claude/skills/launching-evals/SKILL.md‎
Lines changed: 69 additions & 0 deletions
diff --git a/‎.claude/skills/launching-evals/references/analyze-results.md‎
Lines changed: 57 additions & 0 deletions b/‎.claude/skills/launching-evals/references/analyze-results.md‎
Lines changed: 57 additions & 0 deletions
@@ -0,0 +1,104 @@
+---
+name: accessing-mlflow
+description: Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
+license: Apache-2.0
+# Vendored verbatim from NVIDIA NeMo Evaluator (commit 01899f8)
+# https://github.com/NVIDIA-NeMo/Evaluator/tree/01899f89e8f31116efbca56e8f87fbd8513e24ac/packages/nemo-evaluator-launcher/.claude/skills/accessing-mlflow
+# To re-sync: scripts/sync-upstream-skills.sh
+# Note: this skill depends on the mlflow-mcp MCP server (https://github.com/kkruglik/mlflow-mcp)
+# configured in the user's Claude Code setup.
+---
+
+# Accessing MLflow
+
+## MCP Server
+
+[mlflow-mcp](https://github.com/kkruglik/mlflow-mcp) gives agents direct access to MLflow — query runs, compare metrics, browse artifacts, all through natural language.
+
+## ID Convention
+
+When the user provides a hex ID (e.g. `71f3f3199ea5e1f0`) without specifying what it is, assume it is an **invocation_id** (not an MLflow run_id). An invocation_id identifies a launcher invocation and is stored as both a tag and a param on MLflow runs. One invocation can produce multiple MLflow runs (one per task). You may need to search across multiple experiments if you don't know which experiment the run belongs to.
+
+## Querying Runs
+
+```python
+# Find runs by invocation_id
+MLflow:search_runs_by_tags(experiment_id, {"invocation_id": "<invocation_id>"})
+
+# Query for example model/task runs
+MLflow:query_runs(experiment_id, "tags.model LIKE '%<model>%'")
+MLflow:query_runs(experiment_id, "tags.task_name LIKE '%<task_name>%'")
+
+# Get a config from run's artifacts
+MLflow:get_artifact_content(run_id, "config.yml")
+
+# Get nested stats from run's artifacts
+MLflow:get_artifact_content(run_id, "artifacts/eval_factory_metrics.json")
+```
+
+NOTE: You WILL NOT find PENDING, RUNNING, KILLED, or FAILED runs in MLflow! Only SUCCESSFUL runs are exported to MLflow.
+
+## Workflow Tips
+
+When comparing metrics across runs, fetch the data via MCP, then run the computation in Python for exact results rather than doing math in-context:
+
+```bash
+uv run --with pandas python3 << 'EOF'
+import pandas as pd
+# ... compute deltas, averages, etc.
+EOF
+```
+
+## Artifacts Structure
+
+```
+<harness>.<task>/
+├── artifacts/
+│   ├── config.yml                # Fully resolved config used during the evaluation
+│   ├── launcher_unresolved_config.yaml # Unresolved config passed to the launcher
+│   ├── results.yml               # All results in YAML format
+│   ├── eval_factory_metrics.json # Runtime stats (latency, tokens count, memory)
+│   ├── report.html               # Request-Response Pairs samples in HTML format (if enabled)
+│   └── report.json               # Request-Response Pairs samples in JSON format (if enabled)
+└── logs/
+    ├── client-*.log              # Evaluation client
+    ├── server-*-N.log            # Deployment per node
+    ├── slurm-*.log               # Slurm job
+    └── proxy-*.log               # Request proxy
+```
+
+## Troubleshooting
+
+If the MLflow MCP server fails to load or its tools are unavailable:
+
+1. **`uvx` not found** — install [uv](https://docs.astral.sh/uv/getting-started/installation/):
+   ```bash
+   curl -LsSf https://astral.sh/uv/install.sh | sh
+   ```
+2. **MCP server not configured** — add the config and restart the agent:
+
+   **For Claude Code** — add to `.claude/settings.json` (project or user level), under `"mcpServers"`:
+   ```json
+   "MLflow": {
+     "command": "uvx",
+     "args": ["mlflow-mcp"],
+     "env": {
+       "MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/"
+     }
+   }
+   ```
+
+   **For Cursor** — edit `~/.cursor/mcp.json` (Settings > Tools & MCP > New MCP Server):
+   ```json
+   {
+     "mcpServers": {
+       "MLflow": {
+         "command": "uvx",
+         "args": ["mlflow-mcp"],
+         "env": {
+           "MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/"
+         }
+       }
+     }
+   }
+   ```
@@ -0,0 +1,69 @@
+---
+name: launching-evals
+description: Run, monitor, analyze, and debug LLM evaluations via nemo-evaluator-launcher. Covers running evaluations, checking status and live progress, debugging failed runs, exporting artifacts and logs, and analyzing results. ALWAYS triggers on mentions of running evaluations, checking progress, debugging failed evals, analyzing or analysing runs or results, run directories or artifact paths on clusters, Slurm job issues, invocation IDs, or inspecting logs (client logs, server logs, SSH to cluster, tail logs, grep logs). Do NOT use for creating or modifying evaluation configs.
+license: Apache-2.0
+# Vendored verbatim from NVIDIA NeMo Evaluator (commit 01899f8)
+# https://github.com/NVIDIA-NeMo/Evaluator/tree/01899f89e8f31116efbca56e8f87fbd8513e24ac/packages/nemo-evaluator-launcher/.claude/skills/launching-evals
+# To re-sync: scripts/sync-upstream-skills.sh
+---
+
+# NeMo Evaluator Skill
+
+## Quick Reference
+
+### nemo-evaluator-launcher CLI
+
+```bash
+# Run evaluation
+uv run nemo-evaluator-launcher run --config <path.yaml>
+uv run nemo-evaluator-launcher run --config <path.yaml> -t <a_single_task_to_be_run_by_name>
+uv run nemo-evaluator-launcher run --config <path.yaml> -t <task_name_1> -t <task_name_2> ...
+uv run nemo-evaluator-launcher run --config <path.yaml> -o evaluation.nemo_evaluator_config.config.params.limit_samples=10 ...
+
+# Preview the resolved config and the sbatch script without running the evaluation
+uv run nemo-evaluator-launcher run --config <path.yaml> --dry-run
+
+# Check status (--json for machine-readable output)
+uv run nemo-evaluator-launcher status <invocation_id> --json
+
+# Get evaluation run info (output paths, slurm job IDs, cluster hostname, etc.)
+uv run nemo-evaluator-launcher info <invocation_id>
+
+# Copy just the logs (quick — good for debugging)
+uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/
+
+# For artifacts: use `nel info` to discover paths. If remote, SSH to explore and rsync what you need.
+# If local, just read directly from the paths shown by `nel info`.
+# ssh <user>@<hostname> "ls <artifacts_path>/"
+# rsync -avzP <user>@<hostname>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/
+
+# List past runs
+uv run nemo-evaluator-launcher ls runs --since 1d   
+
+# List available evaluation tasks (by default, only shows tasks from the latest released containers)
+uv run nemo-evaluator-launcher ls tasks
+uv run nemo-evaluator-launcher ls tasks --from_container gitlab-master.nvidia.com/dl/joc/competitive_evaluation/nvidia-core-evals/ci-llm/long-context-eval:dev-2025-12-16T14-37-1693de28-amd64
+```
+
+## Workflow
+
+The complete evaluation workflow is divided into the following steps you should follow IN ORDER.
+
+1. Create or modify a config using the `nel-assistant` skill. If the user provides a past run, use its `config.yml` artifact as a starting point.
+2. Run the evaluation. See `references/run-evaluation.md` when executing this step.
+3. Check progress (while RUNNING). See `references/check-progress.md` when executing this step.
+4. Post-run actions (when terminal state reached):
+   1. When the evaluation status is `SUCCESS`, analyze the results. See `references/analyze-results.md` when executing this step.
+   2. When the evaluation status is `FAILED`, debug the failed run. See `references/debug-failed-runs.md` when executing this step.
+
+# Key Facts
+
+- Benchmark-specific info learned during launching/analyzing evals should be added to `references/benchmarks/`
+- **PPP** = Slurm account (the `account` field in cluster_config.yaml). When the user says "change PPP to X", update the account value (e.g., `coreai_dlalgo_compeval` → `coreai_dlalgo_llm`).
+- **Slurm job pairs**: NEL (nemo-evaluator-launcher) submits paired Slurm jobs — a RUNNING job + a PENDING restart job (for when the 4h walltime expires). Never cancel the pending restart jobs — they are expected and necessary.
+- **HF cache requirement**: For configs with `HF_HUB_OFFLINE=1`, models must be pre-downloaded to the HF cache on each cluster before launching. **Before running a model on a new cluster, always ask the user if the model is already cached there.** If not, on the cluster login node: `python3 -m venv hf_cli && source hf_cli/bin/activate && pip install huggingface_hub` then `HF_HOME=/lustre/fsw/portfolios/coreai/users/<username>/cache/huggingface hf download <model>`. Without this, vLLM will fail with `LocalEntryNotFoundError`.
+- **`data_parallel_size` is per node**: `dp_size=1` with `num_nodes=8` means 8 model instances total (one per node), load-balanced by haproxy. Do NOT interpret `dp_size` as the global replica count.
+- **`payload_modifier` interceptor**: The `params_to_remove` list (e.g. `[max_tokens, max_completion_tokens]`) strips those fields from the outgoing payload, intentionally lifting output length limits so reasoning models can think as long as they need.
+- **Auto-export git workaround**: The export container (`python:3.12-slim`) lacks `git`. When installing the launcher from a git URL, set `auto_export.launcher_install_cmd` to install git first (e.g., `apt-get update -qq && apt-get install -qq -y git && pip install "nemo-evaluator-launcher[all] @ git+...#subdirectory=packages/nemo-evaluator-launcher"`).
+- **Do NOT use `nemo-evaluator-launcher export --dest local`** — it only writes a summary JSON (`processed_results.json`), it does NOT copy actual logs or artifacts despite accepting `--copy_logs` and `--copy-artifacts` flags. `nel info --copy-artifacts` works but copies everything (very slow for large benchmarks). Preferred approach: use `nel info` to discover paths — if local, read directly; if remote, SSH to explore and rsync only what you need. Note that `nel info` prints standard artifacts but benchmarks produce additional artifacts in subdirs — explore to find them.
+
@@ -0,0 +1,57 @@
+# Analyze the results
+
+Copy this checklist and track your progress:
+
+```
+Analysis progress:
+- [ ] Step 1: Gather information
+- [ ] Step 2: Scan logs for runtime problems (per run)
+- [ ] Step 3: Validate config and methodology (per run)
+- [ ] Step 4: Report findings
+```
+
+Steps 2-3 are executed for EACH run separately.
+
+## Step 1: Gather information
+
+**IMPORTANT**: Copy what you need (and only what you need) locally BEFORE analysis — each SSH command requires user approval, so remote one-by-one reads are disruptive, and copying too much is slow.
+
+- Get one or more successful invocation IDs to analyze from the user. You might already have the invocation ID in your memory from the previous step.
+- Get paths: `uv run nemo-evaluator-launcher info <invocation_id>`
+- If artifacts are local, read them directly from the paths shown by `nel info`.
+- If artifacts are remote:
+  - Copy logs: `uv run nemo-evaluator-launcher info <invocation_id> --copy-logs ./evaluation-results/`
+  - Rsync analysis-relevant artifacts: `rsync -avzP <user>@<host>:<artifacts_path>/{results.yml,eval_factory_metrics.json,config.yml} ./evaluation-results/<invocation_id>.<job_index>/artifacts/`
+- For MLflow access, see the `accessing-mlflow` skill.
+- Read benchmark-specific analysis notes from `references/benchmarks/` if available for the evaluated benchmarks.
+  - For Terminal Bench agent trace analysis, follow the procedure in `references/benchmarks/terminal-bench-trace-analysis.md`.
+
+## Step 2: Scan logs for runtime problems
+
+Access logs from locally copied files (`./evaluation-results/<invocation_id>.<job_index>/logs/`). Do NOT read logs via SSH — use the local copies from Step 1.
+
+Check logs for silent errors that may invalidate results:
+
+1. **Tool calling failures**: Search `client-*.log` for "failed" tests, `server-*.log` for "invalid tool call"
+2. **Unfinished reasoning**: Check `server-*.log` for `finish_reason: length`, or truncation warnings in `client-*.log`
+3. **API errors**: HTTP status != 200 in `client-*.log`, trace to `server-*.log` or `proxy-*.log`
+4. **Config mismatches**: Compare `config.yml` params with actual values in `server-*.log` startup and `client-*.log` command
+5. **Performance anomalies**: Low throughput, 0% prefix cache hit rate in `server-*.log`
+6. **Cached responses**: Count "Returning cached response" in `client-*.log`
+7. **KV cache preemptions**: Search `server-*.log` for `PreemptionMode.RECOMPUTE`. If found, consider increasing `tensor_parallel_size` (even at the cost of `data_parallel_size`) to relieve KV cache memory pressure.
+
+## Step 3: Validate config and methodology
+
+1. **Methodology consistency**: Verify same benchmark versions, prompt templates, sampling params, and infrastructure across all models. Flag discrepancies.
+2. **HF model card compliance**: Read the model's HuggingFace model card. Flag any deviations in inference parameters (temperature, top_p, max_new_tokens, deployment args, reasoning flags, etc.).
+3. **Reasoning model validation**: Verify temp > 0, top_p > 0, `max_tokens` = null (allow full output length).  
+   NOTE: `use_reasoning: False` in adapter_config does NOT mean reasoning is disabled — it only controls the reasoning interceptor. Whether reasoning is active depends on the model's own controls (deployment args, system prompt, API payload fields, etc.).
+4. **Non-reasoning model validation**: Verify `max_tokens` = 16k
+5. **Max model length**: Verify `max-model-len` = 131072 (leaderboard-recommended). Long context benchmarks (AA LCR, RULER) and agentic benchmarks may require a longer `max-model-len`.
+6. **RULER tasks**: Check thinking disabled, walltime=4h, rope-scaling for Qwen models
+7. **AA baseline comparison**: Compare results against Artificial Analysis published scores. Exact match not expected — flag significant deviations.
+
+## Step 4: Report findings
+
+Present key metrics from `results.yml` in a table and summarize the metrics from `eval_factory_metrics.json` in a concise manner (include only the most important metrics or anomalies). If multiple runs, include side-by-side comparison of metrics (e.g. accuracy, latency, tokens count, memory). Summarize any issues found. Recommend improvements if applicable.
+