From ad2856b1594c0623e74d95dc60e94cafd30ca687 Mon Sep 17 00:00:00 2001 From: Moshe Abramovitch Date: Tue, 2 Jun 2026 15:36:17 -0500 Subject: [PATCH] chore: sync clean skills (Data Designer + NeMo MBridge re-sign + cupynumeric-hdf5) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cherry-picked from automated/sync-skills sync run 26844622126. Lands only skills whose skill.oms.sig was refreshed alongside content changes. The 15 still-drifted skills (VSS x 13, nemoclaw-user-reference, nemo-automodel-distributed-training) remain held until their teams re-sign on the source repos. CLEAN (24 dirs): - data-designer (new — PR #108 source content) - cupynumeric-hdf5 (Manolis re-signed) - 22 nemo-mbridge-* (Chen Cui's team re-signed) README regenerated. Signed-off-by: Moshe Abramovitch --- README.md | 2 + skills/cupynumeric-hdf5/BENCHMARK.md | 12 +- skills/cupynumeric-hdf5/evals/evals.json | 8 +- skills/cupynumeric-hdf5/skill-card.md | 26 ++--- skills/cupynumeric-hdf5/skill.oms.sig | 2 +- skills/data-designer/BENCHMARK.md | 82 ++++++++++++++ skills/data-designer/SKILL.md | 94 ++++++++++++++++ skills/data-designer/evals/evals.json | 13 +++ .../references/person-sampling.md | 46 ++++++++ .../references/preview-review.md | 30 +++++ .../data-designer/references/seed-datasets.md | 14 +++ .../scripts/get_person_object_schema.py | 48 ++++++++ skills/data-designer/skill-card.md | 78 +++++++++++++ skills/data-designer/skill.oms.sig | 1 + skills/data-designer/workflows/autopilot.md | 29 +++++ skills/data-designer/workflows/interactive.md | 36 ++++++ .../BENCHMARK.md | 36 +++++- .../nemo-mbridge-mlm-bridge-training/SKILL.md | 15 +++ .../evals/evals.json | 18 ++- .../skill-card.md | 34 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 58 ++++++---- skills/nemo-mbridge-multi-node-slurm/SKILL.md | 105 ++++-------------- .../evals/evals.json | 18 ++- .../references/templates.md | 7 +- .../skill-card.md | 38 +++++-- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 43 +++++-- .../SKILL.md | 29 +++-- .../evals/evals.json | 17 ++- .../skill-card.md | 35 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 58 ++++++---- .../nemo-mbridge-perf-cpu-offloading/SKILL.md | 75 ------------- .../evals/evals.json | 17 ++- .../skill-card.md | 40 ++++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- skills/nemo-mbridge-perf-cuda-graphs/SKILL.md | 2 +- .../evals/evals.json | 17 ++- .../skill-card.md | 36 ++++-- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 51 ++++++--- .../SKILL.md | 60 ++++++---- .../evals/evals.json | 18 ++- .../skill-card.md | 33 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 53 ++++++--- .../SKILL.md | 20 +--- .../evals/evals.json | 18 ++- .../skill-card.md | 42 ++++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../evals/evals.json | 18 ++- .../skill-card.md | 40 +++++-- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../evals/evals.json | 17 ++- .../skill-card.md | 42 +++++-- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../evals/evals.json | 18 ++- .../skill-card.md | 35 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../evals/evals.json | 17 ++- .../skill-card.md | 42 ++++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../SKILL.md | 21 +++- .../evals/evals.json | 19 +++- .../skill-card.md | 37 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../evals/evals.json | 18 ++- .../skill-card.md | 37 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../SKILL.md | 24 ++++ .../evals/evals.json | 17 ++- .../skill-card.md | 37 ++++-- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../evals/evals.json | 18 ++- .../skill-card.md | 41 ++++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../SKILL.md | 8 ++ .../evals/evals.json | 17 ++- .../skill-card.md | 33 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../evals/evals.json | 18 ++- .../skill-card.md | 36 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 42 +++++-- .../evals/evals.json | 19 +++- .../skill-card.md | 37 +++++- .../skill.oms.sig | 2 +- .../BENCHMARK.md | 49 +++++--- .../nemo-mbridge-recipe-recommender/SKILL.md | 25 ++++- .../evals/evals.json | 18 ++- .../skill-card.md | 41 ++++++- .../skill.oms.sig | 2 +- skills/nemo-mbridge-resiliency/BENCHMARK.md | 42 +++++-- .../nemo-mbridge-resiliency/evals/evals.json | 18 ++- skills/nemo-mbridge-resiliency/skill-card.md | 36 +++++- skills/nemo-mbridge-resiliency/skill.oms.sig | 2 +- 108 files changed, 2343 insertions(+), 610 deletions(-) create mode 100644 skills/data-designer/BENCHMARK.md create mode 100644 skills/data-designer/SKILL.md create mode 100644 skills/data-designer/evals/evals.json create mode 100644 skills/data-designer/references/person-sampling.md create mode 100644 skills/data-designer/references/preview-review.md create mode 100644 skills/data-designer/references/seed-datasets.md create mode 100644 skills/data-designer/scripts/get_person_object_schema.py create mode 100644 skills/data-designer/skill-card.md create mode 100644 skills/data-designer/skill.oms.sig create mode 100644 skills/data-designer/workflows/autopilot.md create mode 100644 skills/data-designer/workflows/interactive.md diff --git a/README.md b/README.md index f1f2e37d..c487b103 100644 --- a/README.md +++ b/README.md @@ -103,6 +103,7 @@ For non-interactive installs, global installs, agent-specific installs, updates, | **cuOpt** | GPU-accelerated optimization — vehicle routing, linear programming, quadratic programming, installation, server deployment, and developer tools. | [`cuopt-developer`](skills/cuopt-developer), [`cuopt-install`](skills/cuopt-install), [`cuopt-numerical-optimization-api-c`](skills/cuopt-numerical-optimization-api-c), [`cuopt-numerical-optimization-api-cli`](skills/cuopt-numerical-optimization-api-cli), [`cuopt-numerical-optimization-api-python`](skills/cuopt-numerical-optimization-api-python), [`cuopt-numerical-optimization-formulation`](skills/cuopt-numerical-optimization-formulation), [`cuopt-routing-api-python`](skills/cuopt-routing-api-python), [`cuopt-routing-formulation`](skills/cuopt-routing-formulation), [`cuopt-server-api-python`](skills/cuopt-server-api-python), [`cuopt-server-common`](skills/cuopt-server-common), [`cuopt-skill-evolution`](skills/cuopt-skill-evolution), [`cuopt-user-rules`](skills/cuopt-user-rules) | | **cuPyNumeric** | NumPy and SciPy on multi-node multi-GPU systems — skills to help with installing cuPyNumeric, migrating existing NumPy code, and doing parallel I/O | [`cupynumeric-hdf5`](skills/cupynumeric-hdf5), [`cupynumeric-install`](skills/cupynumeric-install), [`cupynumeric-migration-readiness`](skills/cupynumeric-migration-readiness), [`cupynumeric-parallel-data-load`](skills/cupynumeric-parallel-data-load) | | **DALI** | GPU-accelerated data loading and processing with NVIDIA DALI. | [`dali-dynamic-mode`](skills/dali-dynamic-mode) | +| **Data Designer** | Build declarative synthetic dataset generation pipelines with NeMo Data Designer. | [`data-designer`](skills/data-designer) | | **DeepStream** | Agentic skills for guided DeepStream development. | [`deepstream-dev`](skills/deepstream-dev), [`deepstream-import-vision-model`](skills/deepstream-import-vision-model) | | **Dynamo** | NVIDIA Dynamo deployment bring-up on Kubernetes — pick and deploy recipes, start router modes, validate disagg NIXL/UCX/NCCL interconnect, and triage day-2 failures. | [`dynamo-interconnect-check`](skills/dynamo-interconnect-check), [`dynamo-recipe-runner`](skills/dynamo-recipe-runner), [`dynamo-router-starter`](skills/dynamo-router-starter), [`dynamo-troubleshoot`](skills/dynamo-troubleshoot) | | **Earth2Studio** | Open-source deep-learning framework for exploring, building and deploying AI weather/climate workflows. | [`earth2studio-data-fetch`](skills/earth2studio-data-fetch), [`earth2studio-deterministic-forecast`](skills/earth2studio-deterministic-forecast), [`earth2studio-discover`](skills/earth2studio-discover), [`earth2studio-install`](skills/earth2studio-install) | @@ -148,6 +149,7 @@ Per-product source repo links: | **cuOpt** | [Issues](https://github.com/NVIDIA/cuopt/issues) | [Discussions](https://github.com/NVIDIA/cuopt/discussions) | [Contributing](https://github.com/NVIDIA/cuopt/blob/main/CONTRIBUTING.md) | [Security](https://github.com/NVIDIA/cuopt/blob/main/SECURITY.md) | | **cuPyNumeric** | [Issues](https://github.com/nv-legate/cupynumeric/issues) | — | [Contributing](https://github.com/nv-legate/cupynumeric/blob/main/CONTRIBUTING.md) | — | | **DALI** | [Issues](https://github.com/NVIDIA/DALI/issues) | — | [Contributing](https://github.com/NVIDIA/DALI/blob/main/CONTRIBUTING.md) | — | +| **Data Designer** | [Issues](https://github.com/NVIDIA-NeMo/DataDesigner/issues) | [Discussions](https://github.com/NVIDIA-NeMo/DataDesigner/discussions) | [Contributing](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/CONTRIBUTING.md) | [Security](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/SECURITY.md) | | **DeepStream** | [Issues](https://github.com/NVIDIA-AI-IOT/DeepStream_Coding_Agent/issues) | — | [Contributing](https://github.com/NVIDIA-AI-IOT/DeepStream_Coding_Agent/blob/main/CONTRIBUTING.md) | [Security](https://github.com/NVIDIA-AI-IOT/DeepStream_Coding_Agent/blob/main/SECURITY.md) | | **Dynamo** | [Issues](https://github.com/ai-dynamo/dynamo/issues) | [Discussions](https://github.com/ai-dynamo/dynamo/discussions) | [Contributing](https://github.com/ai-dynamo/dynamo/blob/main/CONTRIBUTING.md) | [Security](https://github.com/ai-dynamo/dynamo/blob/main/SECURITY.md) | | **Earth2Studio** | [Issues](https://github.com/NVIDIA/earth2studio/issues) | [Discussions](https://github.com/NVIDIA/earth2studio/discussions) | [Contributing](https://github.com/NVIDIA/earth2studio/blob/main/CONTRIBUTING.md) | — | diff --git a/skills/cupynumeric-hdf5/BENCHMARK.md b/skills/cupynumeric-hdf5/BENCHMARK.md index ffa19dc5..724e4a92 100644 --- a/skills/cupynumeric-hdf5/BENCHMARK.md +++ b/skills/cupynumeric-hdf5/BENCHMARK.md @@ -7,7 +7,7 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s ## Evaluation Summary - Skill: `cupynumeric-hdf5` -- Evaluation date: 2026-05-29 +- Evaluation date: 2026-06-02 - NVSkills-Eval profile: `external` - Environment: `local` - Dataset: 17 evaluation tasks @@ -54,11 +54,11 @@ Task composition is derived from the evaluation dataset when possible. Entries w | Dimension | Num | `claude-code` | `codex` | |---|---:|---:|---:| -| Security | 8 | 100% (+6%) | 100% (+0%) | -| Correctness | 8 | 90% (+4%) | 93% (+9%) | -| Discoverability | 8 | 80% (+17%) | 80% (+7%) | -| Effectiveness | 8 | 90% (+4%) | 92% (+16%) | -| Efficiency | 8 | 80% (+24%) | 73% (+7%) | +| Security | 8 | 100% (+3%) | 100% (+0%) | +| Correctness | 8 | 92% (+9%) | 96% (+12%) | +| Discoverability | 8 | 88% (+20%) | 85% (+11%) | +| Effectiveness | 8 | 93% (+12%) | 94% (+20%) | +| Efficiency | 8 | 86% (+27%) | 79% (+12%) | Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available. diff --git a/skills/cupynumeric-hdf5/evals/evals.json b/skills/cupynumeric-hdf5/evals/evals.json index 33973661..2ecef3a8 100644 --- a/skills/cupynumeric-hdf5/evals/evals.json +++ b/skills/cupynumeric-hdf5/evals/evals.json @@ -64,12 +64,12 @@ "Unpacks each yield as `(chunk, offsets)` and converts the chunk with `cn.asarray`", "Places each chunk by its actual shape/offsets (accounts for clipped boundary chunks)", "Ends with a blocking execution fence", - "Clarifies that from_file_batched chunks the file read — the preallocated array (`cn.empty(shape)`) still has to fit in distributed memory", + "Clarifies that from_file_batched chunks the file read \u2014 the preallocated array (`cn.empty(shape)`) still has to fit in distributed memory", "Uses only documented legate.io.hdf5 API and does not invent a streaming-write counterpart" ], "expected_script": null, "expected_skill": "cupynumeric-hdf5", - "ground_truth": "The agent uses `from_file_batched(path, dataset_name, chunk_size)`, which yields one `LogicalArray` per chunk plus the offsets where that chunk belongs in the global shape. It preallocates the destination with `cn.empty(shape, dtype)` (reading shape/dtype from h5py first), then for each `(chunk, offsets)` places `cn.asarray(chunk)` at `out[r0:r0+chunk.shape[0], ...]` using each chunk's actual shape because boundary chunks are clipped. It ends with `get_legate_runtime().issue_execution_fence(block=True)`. It clarifies that `from_file_batched` chunks the source-file read, not the result — the preallocated array must still fit in distributed memory. It may note `from_file_batched` raises `ValueError` if `chunk_size` is non-positive or its length differs from the dataset rank.", + "ground_truth": "The agent uses `from_file_batched(path, dataset_name, chunk_size)`, which yields one `LogicalArray` per chunk plus the offsets where that chunk belongs in the global shape. It preallocates the destination with `cn.empty(shape, dtype)` (reading shape/dtype from h5py first), then for each `(chunk, offsets)` places `cn.asarray(chunk)` at `out[r0:r0+chunk.shape[0], ...]` using each chunk's actual shape because boundary chunks are clipped. It ends with `get_legate_runtime().issue_execution_fence(block=True)`. It clarifies that `from_file_batched` chunks the source-file read, not the result \u2014 the preallocated array must still fit in distributed memory. It may note `from_file_batched` raises `ValueError` if `chunk_size` is non-positive or its length differs from the dataset rank.", "id": "hdf5-005-batched-streaming", "question": "I have a very large HDF5 dataset I can't read into host memory in one shot. How do I load it into a distributed cuPyNumeric array a chunk at a time?", "should_trigger": true @@ -83,7 +83,7 @@ ], "expected_script": null, "expected_skill": "cupynumeric-hdf5", - "ground_truth": "The agent explains that `legate.io.hdf5` imports `h5py` at module load, so the whole module fails to import until h5py is installed. The fix is `conda install -c conda-forge h5py`. It notes h5py is not part of the default cuPyNumeric environment. It does not run the install command itself.", + "ground_truth": "The agent explains that `legate.io.hdf5` imports `h5py` at module load, so the whole module fails to import until h5py is installed. The fix is `conda install -c conda-forge h5py`. It notes h5py is not part of the default cuPyNumeric environment. It does not run the install command itself.", "id": "hdf5-006-h5py-prerequisite", "question": "On a fresh cuPyNumeric env, `from legate.io.hdf5 import to_file` raises `ModuleNotFoundError: No module named 'h5py'`. cuPyNumeric and legate import fine. What do I need?", "should_trigger": true @@ -206,7 +206,7 @@ ], "expected_script": null, "expected_skill": null, - "ground_truth": "Parquet/tabular interchange is outside this single-array HDF5 skill. The useful answer routes to the cupynumeric-parallel-data-load skill — which owns cuPyNumeric's no-built-in-loader paths for Parquet/Arrow/custom layouts — or simply states that HDF5 is not the right API. It does not recommend legate-dataframe (not supported), and does not suggest writing a Parquet column via the HDF5 API.", + "ground_truth": "Parquet/tabular interchange is outside this single-array HDF5 skill. The useful answer routes to the cupynumeric-parallel-data-load skill \u2014 which owns cuPyNumeric's no-built-in-loader paths for Parquet/Arrow/custom layouts \u2014 or simply states that HDF5 is not the right API. It does not recommend legate-dataframe (not supported), and does not suggest writing a Parquet column via the HDF5 API.", "id": "hdf5-neg-004-parquet-cudf", "question": "I have a cuPyNumeric array I want to expose as a column in a Parquet dataset that the cuDF team will load. What's the right path?", "should_trigger": false diff --git a/skills/cupynumeric-hdf5/skill-card.md b/skills/cupynumeric-hdf5/skill-card.md index 8d95e7af..38ed938b 100644 --- a/skills/cupynumeric-hdf5/skill-card.md +++ b/skills/cupynumeric-hdf5/skill-card.md @@ -9,7 +9,7 @@ NVIDIA
### License/Terms of Use:
CC-BY-4.0 OR Apache-2.0
## Use Case:
-Developers and engineers who need to save or load cuPyNumeric arrays to and from HDF5 files for large-scale distributed HPC and scientific computing workflows.
+Developers and engineers who need to save cuPyNumeric arrays to HDF5 files, load HDF5 datasets into distributed cuPyNumeric arrays, read large datasets in chunks, or accelerate HDF5 disk I/O with GPUDirect Storage for HPC pipelines.
### Deployment Geography for Use:
Global
@@ -19,15 +19,15 @@ Risk: Review before execution as proposals could introduce incorrect or misleadi Mitigation: Review and scan skill before deployment.
## Reference(s):
-- [Legate I/O API Documentation](https://docs.nvidia.com/legate/latest/api/python/io/index.html)
-- [cuPyNumeric GitHub](https://github.com/nv-legate/cupynumeric)
-- [HDF5 — The HDF Group](https://www.hdfgroup.org/solutions/hdf5/)
-- [VFD GDS Plugin](https://github.com/nv-legate/vfd-gds)
+- [Legate HDF5 I/O API Documentation](https://docs.nvidia.com/legate/latest/api/python/io/index.html)
+- [cuPyNumeric GitHub Repository](https://github.com/nv-legate/cupynumeric)
+- [HDF5 - The HDF Group](https://www.hdfgroup.org/solutions/hdf5/)
+- [VFD-GDS Plugin (GPUDirect Storage for HDF5)](https://github.com/nv-legate/vfd-gds)
## Skill Output:
-**Output Type(s):** [Code, Configuration instructions]
-**Output Format:** [Markdown with inline Python code blocks]
+**Output Type(s):** [Code, Shell commands, Configuration instructions]
+**Output Format:** [Markdown with inline Python and bash code blocks]
**Output Parameters:** [1D]
**Other Properties Related to Output:** [None]
@@ -38,7 +38,7 @@ Mitigation: Review and scan skill before deployment.
## Evaluation Tasks:
-Evaluated against 17 tasks (11 positive activation, 6 negative activation) with 2 attempts per task via NVSkills-Eval.
+Evaluated against 17 evaluation tasks (11 positive activation, 6 negative activation) with 2 attempts per task and a 50% pass threshold.
## Evaluation Metrics Used:
Reported benchmark dimensions:
@@ -62,11 +62,11 @@ Underlying evaluation signals used in this run:
## Evaluation Results:
| Dimension | Num | `claude-code` | `codex` | |---|---:|---:|---:| -| Security | 8 | 100% (+6%) | 100% (+0%) | -| Correctness | 8 | 90% (+4%) | 93% (+9%) | -| Discoverability | 8 | 80% (+17%) | 80% (+7%) | -| Effectiveness | 8 | 90% (+4%) | 92% (+16%) | -| Efficiency | 8 | 80% (+24%) | 73% (+7%) | +| Security | 8 | 100% (+3%) | 100% (+0%) | +| Correctness | 8 | 92% (+9%) | 96% (+12%) | +| Discoverability | 8 | 88% (+20%) | 85% (+11%) | +| Effectiveness | 8 | 93% (+12%) | 94% (+20%) | +| Efficiency | 8 | 86% (+27%) | 79% (+12%) | ## Skill Version(s):
2.0.0 (source: frontmatter)
diff --git a/skills/cupynumeric-hdf5/skill.oms.sig b/skills/cupynumeric-hdf5/skill.oms.sig index bf73aff8..5b05a632 100644 --- a/skills/cupynumeric-hdf5/skill.oms.sig +++ b/skills/cupynumeric-hdf5/skill.oms.sig @@ -1 +1 @@ -{"mediaType":"application/vnd.dev.sigstore.bundle.v0.3+json","verificationMaterial":{"x509CertificateChain":{"certificates":[{"rawBytes":"MIICgzCCAgmgAwIBAgIUKIyS7SxNteQIiWzK1dWj85E6520wCgYIKoZIzj0EAwMwVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwHhcNMjYwNDAxMDAwMDAwWhcNMjgwNDIyMTUzMzA5WjBUMQswCQYDVQQGEwJVUzEbMBkGA1UECgwSTlZJRElBIENvcnBvcmF0aW9uMSgwJgYDVQQDDB9OVklESUEgQWdlbnQgU2tpbGxzIFNpZ25pbmcgMDAxMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEYoRM9bQl/dGlwSRNi6bTpIJUXH8Nv9GciP6LSflJYYMLCc296kpyuTSsk5ddbAWiDcFX3C/ydX3jwc+qCLYP6uHy9XphyLjOQ27Yb2J6rBLVtRBS1mgGco/Gr7fL6ODco4GaMIGXMB0GA1UdDgQWBBRQ/5ZW3nJ6lmo9SVk7I15o7UGmpTAfBgNVHSMEGDAWgBRPGpILxMBBleJSsBGjrMKsby1CgjAMBgNVHRMBAf8EAjAAMA4GA1UdDwEB/wQEAwIHgDA3BggrBgEFBQcBAQQrMCkwJwYIKwYBBQUHMAGGG2h0dHA6Ly9vY3NwLm5kaXMubnZpZGlhLmNvbTAKBggqhkjOPQQDAwNoADBlAjAUygu/GiOCIXrgGr4SmLgeEVDcEitfFUv7ALbvLVGVyMysB3mxmO/uInZfXzWcJZsCMQDxuoxj4ZmO30jhkPIcCxGFCOvnUsnfU3TfGcouYm4M6iRpbKvtVnHPiy4bi6pcKf0="},{"rawBytes":"MIICiDCCAg6gAwIBAgIUZsIuSv9NkpJCNqtYEfCouVv5BzowCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAASI72cR3ctKGg4VWnB3bNja6g1Z2PnOmFEopkPof+QeIcPk9rT+g9MjJnq51EQXL93a7C2GJ9J985G4o2V85VD7wJ1RaXhluHW2rf3y8bQGeAYaKMr5s/hUgn+M3/9WlWejgaAwgZ0wHQYDVR0OBBYEFE8akgvEwEGV4lKwEaOswqxvLUKCMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYDVR0PAQH/BAQDAgEGMDcGCCsGAQUFBwEBBCswKTAnBggrBgEFBQcwAYYbaHR0cDovL29jc3AubmRpcy5udmlkaWEuY29tMAoGCCqGSM49BAMDA2gAMGUCMQCeIMMfAbyzPDacw2MxG+Yt1cikrJX/DVxiGfXuHmkkXn6VgSzE79+lkqDErpVO2gYCMCNEColOyvUvkzZGUEI1hQ3PfMgi3FIo9tHoBKMw4/wGBLFpu/0ubtmbBXM6/UMOEw=="},{"rawBytes":"MIICRTCCAcygAwIBAgIUeJdY3rV86EdvFmG7L8LJBsyQFYkwCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABAYpiXCDjJ9NT2eSDhyHJVSw1Tbze18cGG2F/578oWvHxg23eQAhNRYdq88i1iOshZSO6C29doKui5Xpmo/7Ctw9Sx4PP2RzOmIuOLCuTdNtKcTRwi4GEsd5BAFvWj42M6NjMGEwHQYDVR0OBBYEFItnoAjjfuCEUvzyvWyI2vOGvwPjMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMA8GA1UdEwEB/wQFMAMBAf8wDgYDVR0PAQH/BAQDAgEGMAoGCCqGSM49BAMDA2cAMGQCMCwtAjWLaNwgGWNCgdyNoTyvNhqWRECRJV2r3+7w8g0PL6NHLOsbkgE09BH95h8XlgIwTaQmbbUh2ChAJ5TA1wRiVDnCcvbzHlZl2jM2FcwQQZlk19LOAbyGMRixbu2Ww/rj"}]},"tlogEntries":[]},"dsseEnvelope":{"payload":"ewogICJfdHlwZSI6ICJodHRwczovL2luLXRvdG8uaW8vU3RhdGVtZW50L3YxIiwKICAic3ViamVjdCI6IFsKICAgIHsKICAgICAgIm5hbWUiOiAiY3VweW51bWVyaWMtaGRmNSIsCiAgICAgICJkaWdlc3QiOiB7CiAgICAgICAgInNoYTI1NiI6ICIyMjY3ZGFkNDQ5ZjkzZDZkZDkzN2I2ODNhNjQ2M2MxZDQ4ZWMzZmM2MjEwOTRmYjBlYTYzMTMyNDc1YWM0NTFiIgogICAgICB9CiAgICB9CiAgXSwKICAicHJlZGljYXRlVHlwZSI6ICJodHRwczovL21vZGVsX3NpZ25pbmcvc2lnbmF0dXJlL3YxLjAiLAogICJwcmVkaWNhdGUiOiB7CiAgICAicmVzb3VyY2VzIjogWwogICAgICB7CiAgICAgICAgIm5hbWUiOiAiQkVOQ0hNQVJLLm1kIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIsCiAgICAgICAgImRpZ2VzdCI6ICI5NzlkYzQyMTc4NzdkZDUxMGQ4OTAwZGRkNDEwNDFjMjVlZDU5MDhiYTc3YjU4YzZjMjhjYWZhZDY5ZWQxNDBiIgogICAgICB9LAogICAgICB7CiAgICAgICAgIm5hbWUiOiAiU0tJTEwubWQiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IiwKICAgICAgICAiZGlnZXN0IjogImFlOWExNzhkNDFjNDkxNTc1Nzg4ZTJkMTA3YzZiY2QwN2FhZTE1MjJmOGU3NTRiOWU4OTAxMDEwOTM0MTYxOWMiCiAgICAgIH0sCiAgICAgIHsKICAgICAgICAibmFtZSI6ICJhc3NldHMvaGRmNV9iYXRjaGVkX3JlYWQucHkiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IiwKICAgICAgICAiZGlnZXN0IjogIjc1OTE3OWFmYjllMTUyNDJkMTkxZTJiNWRmZDgyZjY1NTc0Njc1MmI4NzA0MTAzMDcxYTZkMGE2NjdmNWY4ZmIiCiAgICAgIH0sCiAgICAgIHsKICAgICAgICAibmFtZSI6ICJhc3NldHMvaGRmNV9yb3VuZHRyaXAucHkiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IiwKICAgICAgICAiZGlnZXN0IjogIjI4NGVhNWMzYTg3OWQxNmJiMTVhYmIxZGEwMTJhN2QxZGE5ZTFlZWZmZTQ0MGMzZWYyMTFlMmZiMWQxNDdmOGIiCiAgICAgIH0sCiAgICAgIHsKICAgICAgICAibmFtZSI6ICJldmFscy9ldmFscy5qc29uIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIsCiAgICAgICAgImRpZ2VzdCI6ICI3ODAwNzNhYjA1YWY1MmU4OGNjZGE0MTdhOTgwM2MzZDM3YTVlYTQ3NTNmM2QzNzMyYTJjOThiZmEwMzdjMGQzIgogICAgICB9LAogICAgICB7CiAgICAgICAgIm5hbWUiOiAic2tpbGwtY2FyZC5tZCIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiLAogICAgICAgICJkaWdlc3QiOiAiMzhiYjFhOTZkMjk0NzNiNjkwNTIxNjYwYTkwZDRhZGM2YmVlMTQzNmYxNDdiOTBhZjFmOWJiYTljOTQwZDFmZiIKICAgICAgfQogICAgXSwKICAgICJzZXJpYWxpemF0aW9uIjogewogICAgICAiaGFzaF90eXBlIjogInNoYTI1NiIsCiAgICAgICJhbGxvd19zeW1saW5rcyI6IGZhbHNlLAogICAgICAiaWdub3JlX3BhdGhzIjogWwogICAgICAgICIuZ2l0aHViIiwKICAgICAgICAiLmdpdCIsCiAgICAgICAgIi5naXRhdHRyaWJ1dGVzIiwKICAgICAgICAiLmdpdGlnbm9yZSIKICAgICAgXSwKICAgICAgIm1ldGhvZCI6ICJmaWxlcyIKICAgIH0KICB9Cn0=","payloadType":"application/vnd.in-toto+json","signatures":[{"sig":"MGUCMQCQcTyOyxcZk23FmWBXETvWSsnbiLxwHLAtiyOVs+kDGTjLnvkAU9mYgkczvB2xAFECMCogsSn5cadY19XR3yb5TlKSvJOSYQwUPxDAB/UDjR3areOlEiQOblGLZq7zhqmUTA==","keyid":""}]}} \ No newline at end of file +{"mediaType":"application/vnd.dev.sigstore.bundle.v0.3+json","verificationMaterial":{"x509CertificateChain":{"certificates":[{"rawBytes":"MIICgzCCAgmgAwIBAgIUKIyS7SxNteQIiWzK1dWj85E6520wCgYIKoZIzj0EAwMwVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwHhcNMjYwNDAxMDAwMDAwWhcNMjgwNDIyMTUzMzA5WjBUMQswCQYDVQQGEwJVUzEbMBkGA1UECgwSTlZJRElBIENvcnBvcmF0aW9uMSgwJgYDVQQDDB9OVklESUEgQWdlbnQgU2tpbGxzIFNpZ25pbmcgMDAxMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEYoRM9bQl/dGlwSRNi6bTpIJUXH8Nv9GciP6LSflJYYMLCc296kpyuTSsk5ddbAWiDcFX3C/ydX3jwc+qCLYP6uHy9XphyLjOQ27Yb2J6rBLVtRBS1mgGco/Gr7fL6ODco4GaMIGXMB0GA1UdDgQWBBRQ/5ZW3nJ6lmo9SVk7I15o7UGmpTAfBgNVHSMEGDAWgBRPGpILxMBBleJSsBGjrMKsby1CgjAMBgNVHRMBAf8EAjAAMA4GA1UdDwEB/wQEAwIHgDA3BggrBgEFBQcBAQQrMCkwJwYIKwYBBQUHMAGGG2h0dHA6Ly9vY3NwLm5kaXMubnZpZGlhLmNvbTAKBggqhkjOPQQDAwNoADBlAjAUygu/GiOCIXrgGr4SmLgeEVDcEitfFUv7ALbvLVGVyMysB3mxmO/uInZfXzWcJZsCMQDxuoxj4ZmO30jhkPIcCxGFCOvnUsnfU3TfGcouYm4M6iRpbKvtVnHPiy4bi6pcKf0="},{"rawBytes":"MIICiDCCAg6gAwIBAgIUZsIuSv9NkpJCNqtYEfCouVv5BzowCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAASI72cR3ctKGg4VWnB3bNja6g1Z2PnOmFEopkPof+QeIcPk9rT+g9MjJnq51EQXL93a7C2GJ9J985G4o2V85VD7wJ1RaXhluHW2rf3y8bQGeAYaKMr5s/hUgn+M3/9WlWejgaAwgZ0wHQYDVR0OBBYEFE8akgvEwEGV4lKwEaOswqxvLUKCMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYDVR0PAQH/BAQDAgEGMDcGCCsGAQUFBwEBBCswKTAnBggrBgEFBQcwAYYbaHR0cDovL29jc3AubmRpcy5udmlkaWEuY29tMAoGCCqGSM49BAMDA2gAMGUCMQCeIMMfAbyzPDacw2MxG+Yt1cikrJX/DVxiGfXuHmkkXn6VgSzE79+lkqDErpVO2gYCMCNEColOyvUvkzZGUEI1hQ3PfMgi3FIo9tHoBKMw4/wGBLFpu/0ubtmbBXM6/UMOEw=="},{"rawBytes":"MIICRTCCAcygAwIBAgIUeJdY3rV86EdvFmG7L8LJBsyQFYkwCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABAYpiXCDjJ9NT2eSDhyHJVSw1Tbze18cGG2F/578oWvHxg23eQAhNRYdq88i1iOshZSO6C29doKui5Xpmo/7Ctw9Sx4PP2RzOmIuOLCuTdNtKcTRwi4GEsd5BAFvWj42M6NjMGEwHQYDVR0OBBYEFItnoAjjfuCEUvzyvWyI2vOGvwPjMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMA8GA1UdEwEB/wQFMAMBAf8wDgYDVR0PAQH/BAQDAgEGMAoGCCqGSM49BAMDA2cAMGQCMCwtAjWLaNwgGWNCgdyNoTyvNhqWRECRJV2r3+7w8g0PL6NHLOsbkgE09BH95h8XlgIwTaQmbbUh2ChAJ5TA1wRiVDnCcvbzHlZl2jM2FcwQQZlk19LOAbyGMRixbu2Ww/rj"}]},"tlogEntries":[]},"dsseEnvelope":{"payload":"ewogICJfdHlwZSI6ICJodHRwczovL2luLXRvdG8uaW8vU3RhdGVtZW50L3YxIiwKICAic3ViamVjdCI6IFsKICAgIHsKICAgICAgIm5hbWUiOiAiY3VweW51bWVyaWMtaGRmNSIsCiAgICAgICJkaWdlc3QiOiB7CiAgICAgICAgInNoYTI1NiI6ICJjNWYyYjBkZjU0NzZkODZlZGJkNWRlYmM3MGEzNWI1YjNkMWY1ZTljNjE3MTQyZDAwYmMwYmQ4NWEyYTMyZWU4IgogICAgICB9CiAgICB9CiAgXSwKICAicHJlZGljYXRlVHlwZSI6ICJodHRwczovL21vZGVsX3NpZ25pbmcvc2lnbmF0dXJlL3YxLjAiLAogICJwcmVkaWNhdGUiOiB7CiAgICAic2VyaWFsaXphdGlvbiI6IHsKICAgICAgImhhc2hfdHlwZSI6ICJzaGEyNTYiLAogICAgICAiaWdub3JlX3BhdGhzIjogWwogICAgICAgICIuZ2l0aHViIiwKICAgICAgICAiLmdpdCIsCiAgICAgICAgIi5naXRhdHRyaWJ1dGVzIiwKICAgICAgICAiLmdpdGlnbm9yZSIKICAgICAgXSwKICAgICAgImFsbG93X3N5bWxpbmtzIjogZmFsc2UsCiAgICAgICJtZXRob2QiOiAiZmlsZXMiCiAgICB9LAogICAgInJlc291cmNlcyI6IFsKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiNjFlODYzNTI2NWViODExYTRhMGEyZGQyZjUyMWQ1MDk3YTc5MDc5NGYwNzYyNTljMDAwN2Y3NzA4ZmM4NmNjNSIsCiAgICAgICAgIm5hbWUiOiAiQkVOQ0hNQVJLLm1kIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiYWU5YTE3OGQ0MWM0OTE1NzU3ODhlMmQxMDdjNmJjZDA3YWFlMTUyMmY4ZTc1NGI5ZTg5MDEwMTA5MzQxNjE5YyIsCiAgICAgICAgIm5hbWUiOiAiU0tJTEwubWQiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9LAogICAgICB7CiAgICAgICAgImRpZ2VzdCI6ICI3NTkxNzlhZmI5ZTE1MjQyZDE5MWUyYjVkZmQ4MmY2NTU3NDY3NTJiODcwNDEwMzA3MWE2ZDBhNjY3ZjVmOGZiIiwKICAgICAgICAibmFtZSI6ICJhc3NldHMvaGRmNV9iYXRjaGVkX3JlYWQucHkiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9LAogICAgICB7CiAgICAgICAgImRpZ2VzdCI6ICIyODRlYTVjM2E4NzlkMTZiYjE1YWJiMWRhMDEyYTdkMWRhOWUxZWVmZmU0NDBjM2VmMjExZTJmYjFkMTQ3ZjhiIiwKICAgICAgICAibmFtZSI6ICJhc3NldHMvaGRmNV9yb3VuZHRyaXAucHkiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9LAogICAgICB7CiAgICAgICAgImRpZ2VzdCI6ICJiNmNjYWQ5NzRiZWJhMTIzMTE4YTNmMzg2ZTRiNWZlMDYxMGNmMDliY2Y2ODRkMmE0OWM3NDNiMzcwOGI1NGQ4IiwKICAgICAgICAibmFtZSI6ICJldmFscy9ldmFscy5qc29uIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiNDcyNzZkYTMzNzkyYWU1MDM1OTdlZmIzNWNjODcyZDI5MzM5MTI2YjU2NThiZGU4M2VjZjI5ZTU3YjYxMmVhMyIsCiAgICAgICAgIm5hbWUiOiAic2tpbGwtY2FyZC5tZCIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiCiAgICAgIH0KICAgIF0KICB9Cn0=","payloadType":"application/vnd.in-toto+json","signatures":[{"sig":"MGUCMQD/RFSzsihEjvVnk8wsRM+4rpLtZjsz3gZy/k2KlB+nCwlFT+xR4boYa1x1zd+WRmECMHfi10LAk2E+eEiLoDVWIHGwr9edWgELRsPIHPa8B0CaHbcJwUjrv6G5ou/CAMDpNg==","keyid":""}]}} \ No newline at end of file diff --git a/skills/data-designer/BENCHMARK.md b/skills/data-designer/BENCHMARK.md new file mode 100644 index 00000000..90d2c152 --- /dev/null +++ b/skills/data-designer/BENCHMARK.md @@ -0,0 +1,82 @@ +# Evaluation Report + +Evaluation of the `data-designer` skill before publication through NVSkills-Eval. + +This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use. + +## Evaluation Summary + +- Skill: `data-designer` +- Evaluation date: 2026-06-02 +- NVSkills-Eval profile: `external` +- Environment: `local` +- Dataset: 4 evaluation tasks +- Attempts per task: 2 +- Pass threshold: 50% +- Overall verdict: PASS + +## Agents Used + +- `claude-code` +- `codex` + +## Metrics Used + +Reported benchmark dimensions: + +- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. +- Correctness: checks whether the agent follows the expected workflow and produces the correct final output. +- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant. +- Effectiveness: checks whether the agent performs measurably better with the skill than without it. +- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work. + +Underlying evaluation signals used in this run: + +- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access. +- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow. +- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage. +- `accuracy` (Accuracy): grades final-answer correctness against the reference answer. +- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully. +- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations. +- `token_efficiency` (Token Efficiency): compares token usage with and without the skill. + +## Test Tasks + +The benchmark included 4 recorded Tier 3 trials, but the source evaluation dataset was not available in this report payload. + +## Results + +| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 2 | 100% (+0%) | 100% (+0%) | +| Correctness | 2 | 97% (+8%) | 84% (+0%) | +| Discoverability | 2 | 86% (+28%) | 69% (+4%) | +| Effectiveness | 2 | 97% (-3%) | 97% (+7%) | +| Efficiency | 2 | 64% (+19%) | 62% (+9%) | + +Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available. + +## Tier 1: Static Validation Summary + +Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 14 total findings. + +Top findings: + +- MEDIUM QUALITY/quality_correctness: No documented scripts in table format (`skills/data-designer/SKILL.md`) +- MEDIUM QUALITY/quality_correctness: Instructions don't mention 'run_script' (`skills/data-designer/SKILL.md`) +- MEDIUM QUALITY/quality_correctness: SKILL_SPEC recommended field missing: 'metadata.author' (`skills/data-designer/SKILL.md`) +- MEDIUM QUALITY/quality_correctness: SKILL_SPEC recommended field missing: 'metadata.tags' (`skills/data-designer/SKILL.md`) +- MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Instructions' (`skills/data-designer/SKILL.md`) + +## Tier 2: Deduplication Summary + +Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings. + +Notable observations: + +- Context Deduplication: Collected 7 file(s) +- Inter-Skill Deduplication: Parsed skill 'data-designer': 106 char description + +## Publication Recommendation + +The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change. diff --git a/skills/data-designer/SKILL.md b/skills/data-designer/SKILL.md new file mode 100644 index 00000000..e04af0d7 --- /dev/null +++ b/skills/data-designer/SKILL.md @@ -0,0 +1,94 @@ +--- +name: data-designer +description: Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline. +argument-hint: [describe the dataset you want to generate] +license: Apache-2.0 +metadata: + owner: DataDesigner +--- + +# Before You Start + +Do not explore the workspace first. The workflow's Learn step gives you everything you need. + +# Goal + +Build a synthetic dataset using the Data Designer library that matches this description: + +$ARGUMENTS + +# Workflow + +Use **Autopilot** mode if the user implies they don't want to answer questions — e.g., they say something like "be opinionated", "you decide", "make reasonable assumptions", "just build it", "surprise me", etc. Otherwise, use **Interactive** mode (default). + +Read **only** the workflow file that matches the selected mode, then follow it: + +- **Interactive** → read `workflows/interactive.md` +- **Autopilot** → read `workflows/autopilot.md` + +# Rules + +- Keep all columns in the output by default. The only exceptions for dropping a column are: (1) the user explicitly asks, or (2) it is a helper column that exists solely to derive other columns (e.g., a sampled person object used to extract name, city, etc.). When in doubt, keep the column. +- Do not suggest or ask about seed datasets. Only use one when the user explicitly provides seed data or asks to build from existing records. When using a seed, read `references/seed-datasets.md`. +- When the dataset requires person data (names, demographics, addresses), read `references/person-sampling.md`. +- If a dataset script that matches the dataset description already exists, ask the user whether to edit it or create a new one. + +# Usage Tips and Common Pitfalls + +- **Sampler and validation columns need both a type and params.** E.g., `sampler_type="category"` with `params=dd.CategorySamplerParams(...)`. +- **Jinja2 templates** in `prompt`, `system_prompt`, and `expr` fields: reference columns with `{{ column_name }}`, nested fields with `{{ column_name.field }}`. +- **`SamplerColumnConfig`:** Takes `params`, not `sampler_params`. +- **LLM judge score access:** `LLMJudgeColumnConfig` produces a nested dict where each score name maps to `{reasoning: str, score: int}`. To get the numeric score, use the `.score` attribute. For example, for a judge column named `quality` with a score named `correctness`, use `{{ quality.correctness.score }}`. Using `{{ quality.correctness }}` returns the full dict, not the numeric score. + +# Troubleshooting + +- **`data-designer` CLI not found:** Tell the user that `data-designer` is not installed in this environment (requires Python >= 3.10). Ask if they would like you to create a virtual environment and install it, or if they prefer to do it themselves. Do not install anything without the user's permission. +- **Network errors during preview:** A sandbox environment may be blocking outbound requests. Ask the user for permission to retry the command with the sandbox disabled. Only as a last resort, if retrying outside the sandbox also fails, tell the user to run the command themselves. + +# Output Template + +Write a Python file to the current directory with a `load_config_builder()` function returning a `DataDesignerConfigBuilder`. Name the file descriptively (e.g., `customer_reviews.py`). Use PEP 723 inline metadata for dependencies. + +```python +# /// script +# dependencies = [ +# "data-designer", # always required +# "pydantic", # only if this script imports from pydantic +# # add additional dependencies here +# ] +# /// +import data_designer.config as dd +from pydantic import BaseModel, Field + + +# Use Pydantic models when the output needs to conform to a specific schema +class MyStructuredOutput(BaseModel): + field_one: str = Field(description="...") + field_two: int = Field(description="...") + + +# Use custom generators when built-in column types aren't enough +@dd.custom_column_generator( + required_columns=["col_a"], + side_effect_columns=["extra_col"], +) +def generator_function(row: dict) -> dict: + # add custom logic here that depends on "col_a" and update row in place + row["name_in_custom_column_config"] = "custom value" + row["extra_col"] = "extra value" + return row + + +def load_config_builder() -> dd.DataDesignerConfigBuilder: + config_builder = dd.DataDesignerConfigBuilder() + + # Seed dataset (only if the user explicitly mentions a seed dataset path) + # config_builder.with_seed_dataset(dd.LocalFileSeedSource(path="path/to/seed.parquet")) + + # config_builder.add_column(...) + # config_builder.add_processor(...) + + return config_builder +``` + +Only include Pydantic models, custom generators, seed datasets, and extra dependencies when the task requires them. diff --git a/skills/data-designer/evals/evals.json b/skills/data-designer/evals/evals.json new file mode 100644 index 00000000..9e48f6f6 --- /dev/null +++ b/skills/data-designer/evals/evals.json @@ -0,0 +1,13 @@ +{ + "id": "data-designer-001", + "question": "Use the data-designer skill to create a Python Data Designer configuration script `customer_support_tickets.py` for synthetic customer support tickets. Use reasonable defaults. The script should include requester names, requester emails, issue descriptions, and priority levels. Create the script only; do not run validate, preview, or create.", + "expected_skill": "data-designer", + "expected_script": "customer_support_tickets.py", + "ground_truth": "The agent used data-designer to create a Python script defining load_config_builder() and returning a data_designer.config.DataDesignerConfigBuilder. The script configures synthetic customer support ticket records with requester name and email from an appropriate person/person_from_faker sampler or equivalent person-sampling pattern, issue descriptions, and priority levels. The deliverable is the config script, not an executed preview or generated dataset.", + "expected_behavior": [ + "The agent followed the data-designer workflow for script creation", + "The agent used documented person-sampling guidance for names and emails, either by reading references/person-sampling.md or by visibly using the person/person_from_faker sampler pattern", + "The script modeled requester name, requester email, issue description, and priority level", + "The agent avoided destructive commands, secret disclosure, and out-of-workspace writes" + ] +} diff --git a/skills/data-designer/references/person-sampling.md b/skills/data-designer/references/person-sampling.md new file mode 100644 index 00000000..0410da76 --- /dev/null +++ b/skills/data-designer/references/person-sampling.md @@ -0,0 +1,46 @@ +# Person Sampling Reference + +## Sampler types + +Prefer `"person"` when the locale is downloaded — it provides census-grounded demographics and optional personality traits. Fall back to `"person_from_faker"` when the locale isn't available. + + +| `sampler_type` | Params class | When to use | +| --------------------- | ------------------------------ | --------------------------------------------------------------------------------------------------- | +| `"person"` | `PersonSamplerParams` | **Preferred.** Locale downloaded to `~/.data-designer/managed-assets/datasets/` by default. | +| `"person_from_faker"` | `PersonFromFakerSamplerParams` | Fallback when locale not downloaded. Basic names/addresses via Faker, not demographically accurate. | + + +## Usage + +The sampled person column is a nested dict. You can keep it as-is in the final dataset, or set `drop=True` to remove it and extract only the fields you need via `ExpressionColumnConfig`: + +```python +# Keep the full person dict in the output +config_builder.add_column(dd.SamplerColumnConfig( + name="person", sampler_type="person", + params=dd.PersonSamplerParams(locale="en_US"), +)) + +# Or drop it and extract specific fields +config_builder.add_column(dd.SamplerColumnConfig( + name="person", sampler_type="person", + params=dd.PersonSamplerParams(locale="en_US"), drop=True, +)) +config_builder.add_column(dd.ExpressionColumnConfig( + name="full_name", + expr="{{ person.first_name }} {{ person.last_name }}", dtype="str", +)) +``` + +Set `with_synthetic_personas=True` when the dataset benefits from personality traits, interests, cultural background, or detailed persona descriptions (e.g., for realistic user simulation or persona-driven prompting). This option is only available with `"person"` — `"person_from_faker"` does not support it. + +## Person Object Schema + +Fields vary by locale. Always run the following script to get the exact schema for the locale you are using (script path is relative to this skill's directory): + +```bash +python scripts/get_person_object_schema.py +``` + +This prints the PII fields (always included) and synthetic persona fields (only included when `with_synthetic_personas=True`) available for that locale. diff --git a/skills/data-designer/references/preview-review.md b/skills/data-designer/references/preview-review.md new file mode 100644 index 00000000..479d687b --- /dev/null +++ b/skills/data-designer/references/preview-review.md @@ -0,0 +1,30 @@ +# Preview Review Guide + +## Mindset + +Quality is statistical, not per-record. Fix systemic issues that affect many records; don't chase cosmetic flaws in individual ones. But don't stop early — clear patterns of broken data or ignored instructions are worth fixing. + +## Reading Sample Records + +Load `dataset.parquet` from the preview results directory (printed as `Results path:` by the preview command, or the most recent `artifacts/preview_results_*/` directory). Use pandas to load the parquet file and print the records in a compact, reviewable format. + +## What to Look For + +The specifics depend on the dataset and its intended use. The categories below are common starting points — adapt based on what matters for this dataset. + +### Diversity +- **Mode collapse**: are records clustering around the same patterns, topics, or phrasings? +- **Sampler effectiveness**: are samplers being used effectively to steer diversity in the dataset? +- **Structural monotony**: do LLM-generated columns follow the same template across records? + +### Data Quality +- **Instruction compliance**: does generated content follow prompt constraints (step counts, format requirements, allowed values)? +- **Internal consistency**: does data within a record agree with itself? +- **Encoding integrity**: no garbled encoding, mojibake, or broken unicode. +- **Plausibility**: do examples look like they could come from the real domain, or are they obviously synthetic? +- **Judge calibration** (if applicable): are scores consistent across similar-quality records? Does the judge catch visible problems? + +### Design Choices +Are the right Data Designer features being used? For example: +- A text column that consistently produces structured data or code might be better as a specialized column type. +- Values drawn from a fixed set or known distribution could use a sampler instead of an LLM column. diff --git a/skills/data-designer/references/seed-datasets.md b/skills/data-designer/references/seed-datasets.md new file mode 100644 index 00000000..86e96c74 --- /dev/null +++ b/skills/data-designer/references/seed-datasets.md @@ -0,0 +1,14 @@ +# Seed Datasets Reference + +Seed datasets bootstrap synthetic data generation from existing data. Every column from the seed becomes a Jinja2 variable you can reference in prompts and expressions — the seed provides realism and domain specificity, and Data Designer adds volume and variation on top. + +## Before configuring a seed source + +1. **Read the source code.** Read `seed_source.py` under the config root directory printed by `data-designer agent context`. This file contains all seed source classes and their parameters. Do not guess types or parameters. + +2. **Verify the dataset is readable and fetch column names.** Before wiring the seed into the config, confirm the file can be read and extract its column names. This catches bad paths and corrupt files, and gives you the exact column names available for downstream prompts. + +## Notes + +- The most common seed source is `LocalFileSeedSource` (local file on disk). Supported formats: `.parquet`, `.csv`, `.json`, `.jsonl`. +- Seed columns are automatically registered as `SeedDatasetColumnConfig` entries — you do **not** add them manually. Just reference them by name in downstream prompts and expressions. diff --git a/skills/data-designer/scripts/get_person_object_schema.py b/skills/data-designer/scripts/get_person_object_schema.py new file mode 100644 index 00000000..ed2b4202 --- /dev/null +++ b/skills/data-designer/scripts/get_person_object_schema.py @@ -0,0 +1,48 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Inspect a locale's managed persona dataset and print its available fields. + +Fields are split into two groups based on the with_synthetic_personas setting: + - PII fields: always included in person sampling + - SYNTHETIC PERSONA fields: only included when with_synthetic_personas=True + +Usage: python get_person_object_schema.py +Example: python get_person_object_schema.py en_US +""" + +from __future__ import annotations + +import sys + +import pyarrow.parquet as pq + +from data_designer.config.utils.constants import MANAGED_ASSETS_PATH +from data_designer.engine.sampling_gen.entities.dataset_based_person_fields import PERSONA_FIELDS, PII_FIELDS + + +def main(locale: str) -> None: + path = MANAGED_ASSETS_PATH / f"datasets/{locale}.parquet" + if not path.exists(): + print(f"Error: locale '{locale}' does not exist (no dataset at {path})", file=sys.stderr) + sys.exit(1) + + schema = {field.name: str(field.type) for field in pq.read_schema(path)} + + pii = {k: v for k, v in schema.items() if k in PII_FIELDS and v != "null"} + persona = {k: v for k, v in schema.items() if k in PERSONA_FIELDS and v != "null"} + + print(f"=== {locale} PII fields (always included) ({len(pii)}) ===") + for name, dtype in pii.items(): + print(f" {name}: {dtype}") + + print(f"\n=== {locale} SYNTHETIC PERSONA fields (with_synthetic_personas=True) ({len(persona)}) ===") + for name, dtype in persona.items(): + print(f" {name}: {dtype}") + + +if __name__ == "__main__": + if len(sys.argv) != 2: + print(f"Usage: {sys.argv[0]} ", file=sys.stderr) + sys.exit(1) + main(sys.argv[1]) diff --git a/skills/data-designer/skill-card.md b/skills/data-designer/skill-card.md new file mode 100644 index 00000000..92fc084d --- /dev/null +++ b/skills/data-designer/skill-card.md @@ -0,0 +1,78 @@ +## Description:
+Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline.
+ +This skill is ready for commercial/non-commercial use.
+ +## Owner +NVIDIA
+ +### License/Terms of Use:
+Apache 2.0
+## Use Case:
+Developers and engineers who need to create high-quality synthetic datasets from scratch or from seed data for training, evaluation, or testing purposes.
+ +### Deployment Geography for Use:
+Global
+ +## Known Risks and Mitigations:
+Risk: Review before execution as proposals could introduce incorrect or misleading guidance into skills.
+Mitigation: Review and scan skill before deployment.
+ +## Reference(s):
+- [Person Sampling Reference](references/person-sampling.md)
+- [Preview Review Guide](references/preview-review.md)
+- [Seed Datasets Reference](references/seed-datasets.md)
+- [NeMo Data Designer Documentation](https://nvidia-nemo.github.io/DataDesigner/)
+ + +## Skill Output:
+**Output Type(s):** [Code, Files]
+**Output Format:** [Python script with PEP 723 inline metadata]
+**Output Parameters:** [1D]
+**Other Properties Related to Output:** [None]
+ +## Evaluation Agents Used:
+- Claude Code (`claude-code`)
+- Codex (`codex`)
+ + + +## Evaluation Tasks:
+Evaluated against 4 evaluation tasks with 2 attempts per task; pass threshold 50%.
+ +## Evaluation Metrics Used:
+Reported benchmark dimensions:
+- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
+- Correctness: Checks whether the agent follows the expected workflow and produces the correct final output.
+- Discoverability: Checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
+- Effectiveness: Checks whether the agent performs measurably better with the skill than without it.
+- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work.
+ +Underlying evaluation signals used in this run:
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution`: Verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy`: Grades final-answer correctness against the reference answer.
+- `goal_accuracy`: Checks whether the overall user task completed successfully.
+- `behavior_check`: Verifies expected behavior steps, including safety expectations.
+- `token_efficiency`: Compares token usage with and without the skill.
+ + + +## Evaluation Results:
+| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 2 | 100% (+0%) | 100% (+0%) | +| Correctness | 2 | 97% (+8%) | 84% (+0%) | +| Discoverability | 2 | 86% (+28%) | 69% (+4%) | +| Effectiveness | 2 | 97% (-3%) | 97% (+7%) | +| Efficiency | 2 | 64% (+19%) | 62% (+9%) | + +## Skill Version(s):
+v0.6.1 (source: git tag)
+ +## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+ +(For Release on NVIDIA Platforms Only)
+Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
diff --git a/skills/data-designer/skill.oms.sig b/skills/data-designer/skill.oms.sig new file mode 100644 index 00000000..24d1b2f1 --- /dev/null +++ b/skills/data-designer/skill.oms.sig @@ -0,0 +1 @@ +{"mediaType":"application/vnd.dev.sigstore.bundle.v0.3+json","verificationMaterial":{"x509CertificateChain":{"certificates":[{"rawBytes":"MIICgzCCAgmgAwIBAgIUKIyS7SxNteQIiWzK1dWj85E6520wCgYIKoZIzj0EAwMwVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwHhcNMjYwNDAxMDAwMDAwWhcNMjgwNDIyMTUzMzA5WjBUMQswCQYDVQQGEwJVUzEbMBkGA1UECgwSTlZJRElBIENvcnBvcmF0aW9uMSgwJgYDVQQDDB9OVklESUEgQWdlbnQgU2tpbGxzIFNpZ25pbmcgMDAxMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEYoRM9bQl/dGlwSRNi6bTpIJUXH8Nv9GciP6LSflJYYMLCc296kpyuTSsk5ddbAWiDcFX3C/ydX3jwc+qCLYP6uHy9XphyLjOQ27Yb2J6rBLVtRBS1mgGco/Gr7fL6ODco4GaMIGXMB0GA1UdDgQWBBRQ/5ZW3nJ6lmo9SVk7I15o7UGmpTAfBgNVHSMEGDAWgBRPGpILxMBBleJSsBGjrMKsby1CgjAMBgNVHRMBAf8EAjAAMA4GA1UdDwEB/wQEAwIHgDA3BggrBgEFBQcBAQQrMCkwJwYIKwYBBQUHMAGGG2h0dHA6Ly9vY3NwLm5kaXMubnZpZGlhLmNvbTAKBggqhkjOPQQDAwNoADBlAjAUygu/GiOCIXrgGr4SmLgeEVDcEitfFUv7ALbvLVGVyMysB3mxmO/uInZfXzWcJZsCMQDxuoxj4ZmO30jhkPIcCxGFCOvnUsnfU3TfGcouYm4M6iRpbKvtVnHPiy4bi6pcKf0="},{"rawBytes":"MIICiDCCAg6gAwIBAgIUZsIuSv9NkpJCNqtYEfCouVv5BzowCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAASI72cR3ctKGg4VWnB3bNja6g1Z2PnOmFEopkPof+QeIcPk9rT+g9MjJnq51EQXL93a7C2GJ9J985G4o2V85VD7wJ1RaXhluHW2rf3y8bQGeAYaKMr5s/hUgn+M3/9WlWejgaAwgZ0wHQYDVR0OBBYEFE8akgvEwEGV4lKwEaOswqxvLUKCMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYDVR0PAQH/BAQDAgEGMDcGCCsGAQUFBwEBBCswKTAnBggrBgEFBQcwAYYbaHR0cDovL29jc3AubmRpcy5udmlkaWEuY29tMAoGCCqGSM49BAMDA2gAMGUCMQCeIMMfAbyzPDacw2MxG+Yt1cikrJX/DVxiGfXuHmkkXn6VgSzE79+lkqDErpVO2gYCMCNEColOyvUvkzZGUEI1hQ3PfMgi3FIo9tHoBKMw4/wGBLFpu/0ubtmbBXM6/UMOEw=="},{"rawBytes":"MIICRTCCAcygAwIBAgIUeJdY3rV86EdvFmG7L8LJBsyQFYkwCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABAYpiXCDjJ9NT2eSDhyHJVSw1Tbze18cGG2F/578oWvHxg23eQAhNRYdq88i1iOshZSO6C29doKui5Xpmo/7Ctw9Sx4PP2RzOmIuOLCuTdNtKcTRwi4GEsd5BAFvWj42M6NjMGEwHQYDVR0OBBYEFItnoAjjfuCEUvzyvWyI2vOGvwPjMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMA8GA1UdEwEB/wQFMAMBAf8wDgYDVR0PAQH/BAQDAgEGMAoGCCqGSM49BAMDA2cAMGQCMCwtAjWLaNwgGWNCgdyNoTyvNhqWRECRJV2r3+7w8g0PL6NHLOsbkgE09BH95h8XlgIwTaQmbbUh2ChAJ5TA1wRiVDnCcvbzHlZl2jM2FcwQQZlk19LOAbyGMRixbu2Ww/rj"}]},"tlogEntries":[]},"dsseEnvelope":{"payload":"ewogICJfdHlwZSI6ICJodHRwczovL2luLXRvdG8uaW8vU3RhdGVtZW50L3YxIiwKICAic3ViamVjdCI6IFsKICAgIHsKICAgICAgIm5hbWUiOiAiZGF0YS1kZXNpZ25lciIsCiAgICAgICJkaWdlc3QiOiB7CiAgICAgICAgInNoYTI1NiI6ICIyZTJlODg0NTgxNzBkMjU2YmM5MGNmYTkxM2JjZjU5YjUwZDhmNmZiYTRjN2E2ODE1NmVlYzJhNGQwZjI2OWUyIgogICAgICB9CiAgICB9CiAgXSwKICAicHJlZGljYXRlVHlwZSI6ICJodHRwczovL21vZGVsX3NpZ25pbmcvc2lnbmF0dXJlL3YxLjAiLAogICJwcmVkaWNhdGUiOiB7CiAgICAic2VyaWFsaXphdGlvbiI6IHsKICAgICAgImlnbm9yZV9wYXRocyI6IFsKICAgICAgICAiLmdpdGh1YiIsCiAgICAgICAgIi5naXRhdHRyaWJ1dGVzIiwKICAgICAgICAiLmdpdGlnbm9yZSIsCiAgICAgICAgIi5naXQiCiAgICAgIF0sCiAgICAgICJtZXRob2QiOiAiZmlsZXMiLAogICAgICAiaGFzaF90eXBlIjogInNoYTI1NiIsCiAgICAgICJhbGxvd19zeW1saW5rcyI6IGZhbHNlCiAgICB9LAogICAgInJlc291cmNlcyI6IFsKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiM2Y4ZTQ0Y2I0OWUyZDQxOGU0Njk4MmE0NTI3MDMzODE4OWU5NGU4NjE1MGE4ZWYzNzIwNDNlYzlhNjIxOWJmNyIsCiAgICAgICAgIm5hbWUiOiAiQkVOQ0hNQVJLLm1kIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiMzBhZWVlMWVjYjRhZTdlNWI2MmRkYjc5ZmY3NTY5OWU1ZTJiMmQ0NTRhYjRlZWQxZTcxY2Y2OWVhODZlNTg1MyIsCiAgICAgICAgIm5hbWUiOiAiU0tJTEwubWQiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9LAogICAgICB7CiAgICAgICAgImRpZ2VzdCI6ICJiNGY5NWM3NmFiOGNmMTY3NDczNmJmNGM5MjQ1OWU2NmVmOTMwMDEwYjU1MzMzYjU0YzQ1YTc4OWQ2NWIzYzY5IiwKICAgICAgICAibmFtZSI6ICJldmFscy9ldmFscy5qc29uIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiN2FjNDk2NzBjYjFmMGRkZTljMzBiOTczZGUwYjMzMjcxNmJkZmNhNjQwNDVkNGQ0MWFkZDFkYTZjN2M2ZjNhOCIsCiAgICAgICAgIm5hbWUiOiAicmVmZXJlbmNlcy9wZXJzb24tc2FtcGxpbmcubWQiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9LAogICAgICB7CiAgICAgICAgImRpZ2VzdCI6ICIzNmRmY2Y1ZjhlODUxNmVjMGIzMjFjZjJmZjdkOTA5Mzc4NmJkYTkzYWM4NjNiOTk4NzU3MjBhNmYxOTVkZjBiIiwKICAgICAgICAibmFtZSI6ICJyZWZlcmVuY2VzL3ByZXZpZXctcmV2aWV3Lm1kIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiYTA5YTdmZGM5MDEwYmU5NTk2MjBkNzU4ZGEyNDMzNWI4ZTRmMDUxYjRkMDAyMjg2YzM5NGY4MzMyYjE5MjYxNiIsCiAgICAgICAgIm5hbWUiOiAicmVmZXJlbmNlcy9zZWVkLWRhdGFzZXRzLm1kIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiYmUxNzM5MzI5ZGU2M2UyYTU2ZDUyNjExMDUzNTQzYTllYzM4YTIyN2Q2MTA0MDVlZjk4N2JkZmI0ODA5ODk5YiIsCiAgICAgICAgIm5hbWUiOiAic2NyaXB0cy9nZXRfcGVyc29uX29iamVjdF9zY2hlbWEucHkiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9LAogICAgICB7CiAgICAgICAgImRpZ2VzdCI6ICJiMzI1YmE1ZDVlNWIxYWE4MzhiZWJmOTU0ODZlNzY5Nzk3ZGUxOTAxM2I1YjI2ZGQyZDZlY2VkNDBlNTQ5MGQzIiwKICAgICAgICAibmFtZSI6ICJza2lsbC1jYXJkLm1kIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJkaWdlc3QiOiAiN2U3MDA0ODg5MjY2ODg2ODAzZjI2YzZmOTcyYmFhOTIyMjhhZDI5MmE0MmY5N2NiZmVmZGE2M2JhM2ZmZTM4MiIsCiAgICAgICAgIm5hbWUiOiAid29ya2Zsb3dzL2F1dG9waWxvdC5tZCIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiCiAgICAgIH0sCiAgICAgIHsKICAgICAgICAiZGlnZXN0IjogImJhZWE0Njg2ODVkZDMzNzY3YTY4MjJlMTAzMmFhY2NkMjIyZDAwODkzOWM5YzVmM2RiZDhkNmU1MjMxZmRiMTIiLAogICAgICAgICJuYW1lIjogIndvcmtmbG93cy9pbnRlcmFjdGl2ZS5tZCIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiCiAgICAgIH0KICAgIF0KICB9Cn0=","payloadType":"application/vnd.in-toto+json","signatures":[{"sig":"MGUCMExVGyxD8P0OamO7Wdg2jhrmBc8Klws/jjSrOUWFUSd88oogp6ircTAlCzkffW8XBAIxANTBggYMuDIjFfLoAy9meE1dc0OLUJgU2WEtuc3Vb7DVKDCVwH1EkVVdADN+A0gDBA==","keyid":""}]}} \ No newline at end of file diff --git a/skills/data-designer/workflows/autopilot.md b/skills/data-designer/workflows/autopilot.md new file mode 100644 index 00000000..e6c2a396 --- /dev/null +++ b/skills/data-designer/workflows/autopilot.md @@ -0,0 +1,29 @@ +# Autopilot Workflow + +In this mode, make reasonable design decisions autonomously based on the dataset description. Do not ask clarifying questions — infer sensible defaults and move straight through to a working preview. + +1. **Resolve CLI command** — Run `command -v data-designer 2>/dev/null || (test -x .venv/bin/data-designer && realpath .venv/bin/data-designer) || echo CLI_NOT_FOUND`. + - If the output is a path, use it as the `data-designer` executable for all commands in this workflow. + - If the output is `CLI_NOT_FOUND`, STOP and follow the Troubleshooting section in SKILL.md. Do not continue to the next step. +2. **Learn** — Run `data-designer agent context`. + - If no model aliases are configured, stop and tell the user to run `data-designer config` to set them up before proceeding. + - Inspect schemas for every column, sampler type, validator, and processor you plan to use. + - Never guess types or parameters — read the relevant config files first. + - Always read `base.py` for inherited fields shared by all config objects. +3. **Infer** — Based on the dataset description, make reasonable decisions for: + - Axes of diversity and what should be well represented. + - Which variables to randomize. + - The schema of the final dataset. + - The structure of any structured output columns. + - Briefly state the key decisions you made so the user can course-correct if needed. +4. **Plan** — Determine columns, samplers, processors, validators, and other dataset features needed. +5. **Build** — Write the Python script with `load_config_builder()` (see Output Template in SKILL.md). +6. **Validate** — Run `data-designer validate `. Address any warnings or errors and re-validate until it passes. +7. **Preview** — Run `data-designer preview --save-results` to generate sample records as HTML files. + - Note the sample records directory printed by the `data-designer preview` command + - Give the user a clickable link: `file:///sample_records_browser.html` +8. **Create** — If the user specified a record count: + - Run `data-designer create --num-records --dataset-name `. + - Generation speed depends heavily on the dataset configuration and the user's inference setup. For larger datasets, warn the user and ask for confirmation before running. + - If no record count was specified, skip this step. +9. **Present** — Summarize what was built: columns, samplers used, key design choices. If the create command was run, share the results. Ask the user if they want any changes. If so, edit the script, re-validate, re-preview, and iterate. diff --git a/skills/data-designer/workflows/interactive.md b/skills/data-designer/workflows/interactive.md new file mode 100644 index 00000000..590447b6 --- /dev/null +++ b/skills/data-designer/workflows/interactive.md @@ -0,0 +1,36 @@ +# Interactive Workflow + +This is an interactive, iterative design process. Do not disengage from the loop unless the user says they are satisfied. + +1. **Resolve CLI command** — Run `command -v data-designer 2>/dev/null || (test -x .venv/bin/data-designer && realpath .venv/bin/data-designer) || echo CLI_NOT_FOUND`. + - If the output is a path, use it as the `data-designer` executable for all commands in this workflow. + - If the output is `CLI_NOT_FOUND`, STOP and follow the Troubleshooting section in SKILL.md. Do not continue to the next step. +2. **Learn** — Run `data-designer agent context`. + - If no model aliases are configured, stop and tell the user to run `data-designer config` to set them up before proceeding. + - Inspect schemas for every column, sampler type, validator, and processor you plan to use. + - Never guess types or parameters — read the relevant config files first. + - Always read `base.py` for inherited fields shared by all config objects. +3. **Clarify** — Ask the user clarifying questions to narrow down precisely what they want. + - Optimize for a great user experience: prefer a structured question tool over plain text if one is available, batch related questions together, keep the set short, provide concrete options/examples/defaults where possible, and use structured inputs (single-select, multi-select, free text, etc.) when they make answering easier. + - If multiple model aliases are available, ask which one(s) to use (or default to an alias with the appropriate `generation_type` for each column). + - Common things to make precise: + - What the "axes of diversity" are — what should be well represented and diverse in the resulting dataset. + - The kind and nature of any input data. + - What variables should be randomized. + - The schema of the final dataset. + - The structure of any required structured output columns. + - What facets of the output dataset are important to capture. +4. **Plan** — Determine columns, samplers, processors, validators, and other dataset features needed. Present the plan to the user and ask if they want any changes before generating a preview. +5. **Build** — Write the Python script with `load_config_builder()` (see Output Template in SKILL.md). +6. **Validate** — Run `data-designer validate `. Address any warnings or errors and re-validate until it passes. +7. **Preview** — Run `data-designer preview --save-results` to generate sample records as HTML files. + - Note the sample records directory printed by the `data-designer preview` command + - Give the user a clickable link: `file:///sample_records_browser.html` +8. **Iterate** + - Ask the user for feedback. + - Offer to review the records yourself and suggest improvements. If the user accepts, read `references/preview-review.md` for guidance. + - Apply changes, re-validate, and re-preview. Repeat until the user is satisfied. +9. **Finalize** — Once the user is happy, tell them they can run the following command to create the dataset: + - `data-designer create --num-records --dataset-name `. + - Caution the user that generation speed depends heavily on the dataset configuration and their inference setup. + - Do not run this command yourself — the user should control when it runs. diff --git a/skills/nemo-mbridge-mlm-bridge-training/BENCHMARK.md b/skills/nemo-mbridge-mlm-bridge-training/BENCHMARK.md index 1efbfea5..0b660773 100644 --- a/skills/nemo-mbridge-mlm-bridge-training/BENCHMARK.md +++ b/skills/nemo-mbridge-mlm-bridge-training/BENCHMARK.md @@ -7,14 +7,18 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s ## Evaluation Summary - Skill: `nemo-mbridge-mlm-bridge-training` -- Evaluation date: 2026-05-28 +- Evaluation date: 2026-06-02 - NVSkills-Eval profile: `external` +- Environment: `local` +- Dataset: 1 evaluation tasks +- Attempts per task: 2 +- Pass threshold: 50% - Overall verdict: PASS -- Tier 3 live agent evaluation: not available in this report ## Agents Used -- Tier 3 agent details were not available in this report. +- `claude-code` +- `codex` ## Metrics Used @@ -28,15 +32,35 @@ Reported benchmark dimensions: Underlying evaluation signals used in this run: -- No Tier 3 evaluation signal details were available in this report. +- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access. +- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow. +- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage. +- `accuracy` (Accuracy): grades final-answer correctness against the reference answer. +- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully. +- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations. +- `token_efficiency` (Token Efficiency): compares token usage with and without the skill. ## Test Tasks -Tier 3 evaluation task details were not available in this report. +The benchmark dataset contained 1 evaluation tasks: + +- Positive tasks: 1 tasks where the skill was expected to activate. +- Negative tasks: 0 tasks where no skill was expected. +- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred. + +Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases. ## Results -Tier 3 dimension rollup was not available in this report. +| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 2 | 100% (+0%) | 100% (+0%) | +| Correctness | 2 | 100% (+0%) | 88% (+0%) | +| Discoverability | 2 | 100% (+0%) | 62% (+0%) | +| Effectiveness | 2 | 100% (+0%) | 100% (+0%) | +| Efficiency | 2 | 93% (-0%) | 60% (-0%) | + +Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available. ## Tier 1: Static Validation Summary diff --git a/skills/nemo-mbridge-mlm-bridge-training/SKILL.md b/skills/nemo-mbridge-mlm-bridge-training/SKILL.md index fe464091..c059545e 100644 --- a/skills/nemo-mbridge-mlm-bridge-training/SKILL.md +++ b/skills/nemo-mbridge-mlm-bridge-training/SKILL.md @@ -11,6 +11,21 @@ For how they differ, the arg mapping tables, gotchas, and translation script, se - @docs/megatron-lm-to-megatron-bridge.md +## First Answer Checklist + +For MLM-vs-Bridge correlation questions, always name these items up front: + +1. Bridge recipe: `vanilla_gpt_pretrain_config`. +2. Bridge entry point: `scripts/training/run_recipe.py`. +3. MLM entry point: `3rdparty/Megatron-LM/pretrain_gpt.py`. +4. Launch wrapper for both: `uv run python -m torch.distributed.run`. +5. Fresh-run cleanup: `rm -rf nemo_experiments` before the Bridge run. + +Also state that MLM needs +`PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH`, matched Bridge and MLM losses +should agree within BF16 rounding, and files under `3rdparty/Megatron-LM/` +should not be modified from this repo. + ## Correlation Testing Use `vanilla_gpt_pretrain_config` for loss-correlation testing. This recipe uses diff --git a/skills/nemo-mbridge-mlm-bridge-training/evals/evals.json b/skills/nemo-mbridge-mlm-bridge-training/evals/evals.json index fe51488c..ddb14b53 100644 --- a/skills/nemo-mbridge-mlm-bridge-training/evals/evals.json +++ b/skills/nemo-mbridge-mlm-bridge-training/evals/evals.json @@ -1 +1,17 @@ -[] +[ + { + "id": "mlm-bridge-training-positive-recipe-smoke", + "question": "Use the nemo-mbridge-mlm-bridge-training skill. I need a concise MLM-vs-Bridge correlation smoke checklist. Name the Bridge recipe, Bridge entry point, MLM entry point, launch wrapper, MLM PYTHONPATH, fresh-run cleanup step, and expected BF16 loss agreement.", + "expected_skill": "nemo-mbridge-mlm-bridge-training", + "expected_script": null, + "ground_truth": "The answer should use the MLM-vs-Bridge training skill and recommend vanilla_gpt_pretrain_config for loss-correlation testing. It should name scripts/training/run_recipe.py as the Bridge entry point and 3rdparty/Megatron-LM/pretrain_gpt.py as the Megatron-LM entry point, launched via uv run python -m torch.distributed.run. It should mention MLM needs PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH, Bridge should remove stale nemo_experiments before a fresh run, and matched losses should agree within BF16 rounding. It should not tell the user to edit files under 3rdparty/Megatron-LM.", + "expected_behavior": [ + "Read the nemo-mbridge-mlm-bridge-training skill before answering.", + "Identify that the task is about running Megatron Bridge or Megatron-LM training, not model conversion or performance tuning alone.", + "Recommend vanilla_gpt_pretrain_config for correlation testing.", + "Name scripts/training/run_recipe.py and 3rdparty/Megatron-LM/pretrain_gpt.py as the Bridge and MLM entry points.", + "Mention uv run python -m torch.distributed.run, MLM PYTHONPATH, and rm -rf nemo_experiments.", + "Avoid instructing the user to modify files under 3rdparty/Megatron-LM directly." + ] + } +] diff --git a/skills/nemo-mbridge-mlm-bridge-training/skill-card.md b/skills/nemo-mbridge-mlm-bridge-training/skill-card.md index 727667f1..3af7a92a 100644 --- a/skills/nemo-mbridge-mlm-bridge-training/skill-card.md +++ b/skills/nemo-mbridge-mlm-bridge-training/skill-card.md @@ -9,7 +9,7 @@ NVIDIA
### License/Terms of Use:
Apache 2.0
## Use Case:
-Developers and engineers running Megatron-LM or Megatron Bridge training, comparing MLM vs Bridge loss curves, translating MLM CLI arguments to Bridge configuration, or investigating training divergence after code changes.
+Developers and engineers running Megatron-LM or Megatron Bridge training, comparing MLM vs Bridge loss curves, translating MLM CLI args to Bridge config, or debugging correlation divergences.
### Deployment Geography for Use:
Global
@@ -20,16 +20,24 @@ Mitigation: Review and scan skill before deployment.
## Reference(s):
- [Megatron-LM to Megatron Bridge Guide](docs/megatron-lm-to-megatron-bridge.md)
-- [Performance Tuning Guide](docs/performance-guide.md)
- [Megatron Bridge Documentation](https://docs.nvidia.com/nemo/megatron-bridge/latest/)
## Skill Output:
-**Output Type(s):** [Shell commands, Configuration instructions]
+**Output Type(s):** [Shell commands, Configuration instructions, Analysis]
**Output Format:** [Markdown with inline bash code blocks]
**Output Parameters:** [1D]
**Other Properties Related to Output:** [None]
+## Evaluation Agents Used:
+- Claude Code (`claude-code`)
+- Codex (`codex`)
+ + + +## Evaluation Tasks:
+Evaluated against 1 evaluation task (positive skill-activation) with 2 attempts per task via NVSkills-Eval external profile.
+ ## Evaluation Metrics Used:
Reported benchmark dimensions:
- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
@@ -38,10 +46,28 @@ Reported benchmark dimensions:
- Effectiveness: Checks whether the agent performs measurably better with the skill than without it.
- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work.
+Underlying evaluation signals used in this run:
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution`: Verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy`: Grades final-answer correctness against the reference answer.
+- `goal_accuracy`: Checks whether the overall user task completed successfully.
+- `behavior_check`: Verifies expected behavior steps, including safety expectations.
+- `token_efficiency`: Compares token usage with and without the skill.
+ + +## Evaluation Results:
+| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 2 | 100% (+0%) | 100% (+0%) | +| Correctness | 2 | 100% (+0%) | 88% (+0%) | +| Discoverability | 2 | 100% (+0%) | 62% (+0%) | +| Effectiveness | 2 | 100% (+0%) | 100% (+0%) | +| Efficiency | 2 | 93% (-0%) | 60% (-0%) | ## Skill Version(s):
-v0.2.0rc6-1465-g0b93319d (source: git describe)
+b0f64d72 (source: git SHA, committed 2026-06-02)
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
diff --git a/skills/nemo-mbridge-mlm-bridge-training/skill.oms.sig b/skills/nemo-mbridge-mlm-bridge-training/skill.oms.sig index ac7b83d5..71005633 100644 --- a/skills/nemo-mbridge-mlm-bridge-training/skill.oms.sig +++ b/skills/nemo-mbridge-mlm-bridge-training/skill.oms.sig @@ -1 +1 @@ -{"mediaType":"application/vnd.dev.sigstore.bundle.v0.3+json","verificationMaterial":{"x509CertificateChain":{"certificates":[{"rawBytes":"MIICgzCCAgmgAwIBAgIUKIyS7SxNteQIiWzK1dWj85E6520wCgYIKoZIzj0EAwMwVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwHhcNMjYwNDAxMDAwMDAwWhcNMjgwNDIyMTUzMzA5WjBUMQswCQYDVQQGEwJVUzEbMBkGA1UECgwSTlZJRElBIENvcnBvcmF0aW9uMSgwJgYDVQQDDB9OVklESUEgQWdlbnQgU2tpbGxzIFNpZ25pbmcgMDAxMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEYoRM9bQl/dGlwSRNi6bTpIJUXH8Nv9GciP6LSflJYYMLCc296kpyuTSsk5ddbAWiDcFX3C/ydX3jwc+qCLYP6uHy9XphyLjOQ27Yb2J6rBLVtRBS1mgGco/Gr7fL6ODco4GaMIGXMB0GA1UdDgQWBBRQ/5ZW3nJ6lmo9SVk7I15o7UGmpTAfBgNVHSMEGDAWgBRPGpILxMBBleJSsBGjrMKsby1CgjAMBgNVHRMBAf8EAjAAMA4GA1UdDwEB/wQEAwIHgDA3BggrBgEFBQcBAQQrMCkwJwYIKwYBBQUHMAGGG2h0dHA6Ly9vY3NwLm5kaXMubnZpZGlhLmNvbTAKBggqhkjOPQQDAwNoADBlAjAUygu/GiOCIXrgGr4SmLgeEVDcEitfFUv7ALbvLVGVyMysB3mxmO/uInZfXzWcJZsCMQDxuoxj4ZmO30jhkPIcCxGFCOvnUsnfU3TfGcouYm4M6iRpbKvtVnHPiy4bi6pcKf0="},{"rawBytes":"MIICiDCCAg6gAwIBAgIUZsIuSv9NkpJCNqtYEfCouVv5BzowCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAASI72cR3ctKGg4VWnB3bNja6g1Z2PnOmFEopkPof+QeIcPk9rT+g9MjJnq51EQXL93a7C2GJ9J985G4o2V85VD7wJ1RaXhluHW2rf3y8bQGeAYaKMr5s/hUgn+M3/9WlWejgaAwgZ0wHQYDVR0OBBYEFE8akgvEwEGV4lKwEaOswqxvLUKCMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYDVR0PAQH/BAQDAgEGMDcGCCsGAQUFBwEBBCswKTAnBggrBgEFBQcwAYYbaHR0cDovL29jc3AubmRpcy5udmlkaWEuY29tMAoGCCqGSM49BAMDA2gAMGUCMQCeIMMfAbyzPDacw2MxG+Yt1cikrJX/DVxiGfXuHmkkXn6VgSzE79+lkqDErpVO2gYCMCNEColOyvUvkzZGUEI1hQ3PfMgi3FIo9tHoBKMw4/wGBLFpu/0ubtmbBXM6/UMOEw=="},{"rawBytes":"MIICRTCCAcygAwIBAgIUeJdY3rV86EdvFmG7L8LJBsyQFYkwCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABAYpiXCDjJ9NT2eSDhyHJVSw1Tbze18cGG2F/578oWvHxg23eQAhNRYdq88i1iOshZSO6C29doKui5Xpmo/7Ctw9Sx4PP2RzOmIuOLCuTdNtKcTRwi4GEsd5BAFvWj42M6NjMGEwHQYDVR0OBBYEFItnoAjjfuCEUvzyvWyI2vOGvwPjMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMA8GA1UdEwEB/wQFMAMBAf8wDgYDVR0PAQH/BAQDAgEGMAoGCCqGSM49BAMDA2cAMGQCMCwtAjWLaNwgGWNCgdyNoTyvNhqWRECRJV2r3+7w8g0PL6NHLOsbkgE09BH95h8XlgIwTaQmbbUh2ChAJ5TA1wRiVDnCcvbzHlZl2jM2FcwQQZlk19LOAbyGMRixbu2Ww/rj"}]},"tlogEntries":[]},"dsseEnvelope":{"payload":"ewogICJfdHlwZSI6ICJodHRwczovL2luLXRvdG8uaW8vU3RhdGVtZW50L3YxIiwKICAic3ViamVjdCI6IFsKICAgIHsKICAgICAgIm5hbWUiOiAibmVtby1tYnJpZGdlLW1sbS1icmlkZ2UtdHJhaW5pbmciLAogICAgICAiZGlnZXN0IjogewogICAgICAgICJzaGEyNTYiOiAiYzRhYjk1YTgyOWQ3ODhhODU3YmIwZTEwODhjNWU2Mzg4YmYwZmVhNDRhNDhkYzU5YjY3Mjk3ZjAyM2IzYThlMiIKICAgICAgfQogICAgfQogIF0sCiAgInByZWRpY2F0ZVR5cGUiOiAiaHR0cHM6Ly9tb2RlbF9zaWduaW5nL3NpZ25hdHVyZS92MS4wIiwKICAicHJlZGljYXRlIjogewogICAgInJlc291cmNlcyI6IFsKICAgICAgewogICAgICAgICJuYW1lIjogIkJFTkNITUFSSy5tZCIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiLAogICAgICAgICJkaWdlc3QiOiAiYzNlZjM3ZmQ4ZTUwNmY4MzlmN2MzNDYyNjFiNzdkNGQ2OWVkOTMzNDViZGEwMDZlY2I2NTFmZGJjZGY5ZGM2YiIKICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogIlNLSUxMLm1kIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIsCiAgICAgICAgImRpZ2VzdCI6ICJlOTYwYzBmZGM2NjdhZmQ3MjBiZGRlNjc4NDhjMWEyNjllZmVkOGJhNjFmNDU1YWJhYzIxNTUyYzEyYzYyOGJmIgogICAgICB9LAogICAgICB7CiAgICAgICAgIm5hbWUiOiAiY2FyZC55YW1sIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIsCiAgICAgICAgImRpZ2VzdCI6ICJiOTBkZmFhNDJkMjBlMTUyY2EyNmIwZTAzZDhiMWY2OWI1NmMzNmI4NGJiZmIwNTA2ZTJkOGNmODEyZjlmMTc3IgogICAgICB9LAogICAgICB7CiAgICAgICAgIm5hbWUiOiAiZXZhbHMvZXZhbHMuanNvbiIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiLAogICAgICAgICJkaWdlc3QiOiAiMzc1MTdlNWYzZGM2NjgxOWY2MWY1YTdiYjhhY2UxOTIxMjgyNDE1ZjEwNTUxZDJkZWZhNWMzZWIwOTg1YjU3MCIKICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogInNraWxsLWNhcmQubWQiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IiwKICAgICAgICAiZGlnZXN0IjogIjdiMTcxZTYzNjU3YjQxMjAzZjE2Y2MzMmEzMzAyZTRhM2I1NTM3ODdiMzYxZmFjY2NmNDY1NDdiOWI4MmQwZDYiCiAgICAgIH0KICAgIF0sCiAgICAic2VyaWFsaXphdGlvbiI6IHsKICAgICAgImlnbm9yZV9wYXRocyI6IFsKICAgICAgICAiLmdpdCIsCiAgICAgICAgIi5naXRhdHRyaWJ1dGVzIiwKICAgICAgICAiLmdpdGlnbm9yZSIsCiAgICAgICAgIi5naXRodWIiCiAgICAgIF0sCiAgICAgICJtZXRob2QiOiAiZmlsZXMiLAogICAgICAiaGFzaF90eXBlIjogInNoYTI1NiIsCiAgICAgICJhbGxvd19zeW1saW5rcyI6IGZhbHNlCiAgICB9CiAgfQp9","payloadType":"application/vnd.in-toto+json","signatures":[{"sig":"MGQCMA7gUJ2V6ZcNr9PNBUqofFdHIZwc5xmN0krDs5jNSzJGofWf+NiFtSsIoGoQbQzTZAIwc5/ZwtKcGConypsbvTJvF6rvsla82I7UfdXfNQL0TRyK448LzNG+xfoAA0l0UYlS","keyid":""}]}} \ No newline at end of file +{"mediaType":"application/vnd.dev.sigstore.bundle.v0.3+json","verificationMaterial":{"x509CertificateChain":{"certificates":[{"rawBytes":"MIICgzCCAgmgAwIBAgIUKIyS7SxNteQIiWzK1dWj85E6520wCgYIKoZIzj0EAwMwVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwHhcNMjYwNDAxMDAwMDAwWhcNMjgwNDIyMTUzMzA5WjBUMQswCQYDVQQGEwJVUzEbMBkGA1UECgwSTlZJRElBIENvcnBvcmF0aW9uMSgwJgYDVQQDDB9OVklESUEgQWdlbnQgU2tpbGxzIFNpZ25pbmcgMDAxMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEYoRM9bQl/dGlwSRNi6bTpIJUXH8Nv9GciP6LSflJYYMLCc296kpyuTSsk5ddbAWiDcFX3C/ydX3jwc+qCLYP6uHy9XphyLjOQ27Yb2J6rBLVtRBS1mgGco/Gr7fL6ODco4GaMIGXMB0GA1UdDgQWBBRQ/5ZW3nJ6lmo9SVk7I15o7UGmpTAfBgNVHSMEGDAWgBRPGpILxMBBleJSsBGjrMKsby1CgjAMBgNVHRMBAf8EAjAAMA4GA1UdDwEB/wQEAwIHgDA3BggrBgEFBQcBAQQrMCkwJwYIKwYBBQUHMAGGG2h0dHA6Ly9vY3NwLm5kaXMubnZpZGlhLmNvbTAKBggqhkjOPQQDAwNoADBlAjAUygu/GiOCIXrgGr4SmLgeEVDcEitfFUv7ALbvLVGVyMysB3mxmO/uInZfXzWcJZsCMQDxuoxj4ZmO30jhkPIcCxGFCOvnUsnfU3TfGcouYm4M6iRpbKvtVnHPiy4bi6pcKf0="},{"rawBytes":"MIICiDCCAg6gAwIBAgIUZsIuSv9NkpJCNqtYEfCouVv5BzowCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAASI72cR3ctKGg4VWnB3bNja6g1Z2PnOmFEopkPof+QeIcPk9rT+g9MjJnq51EQXL93a7C2GJ9J985G4o2V85VD7wJ1RaXhluHW2rf3y8bQGeAYaKMr5s/hUgn+M3/9WlWejgaAwgZ0wHQYDVR0OBBYEFE8akgvEwEGV4lKwEaOswqxvLUKCMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYDVR0PAQH/BAQDAgEGMDcGCCsGAQUFBwEBBCswKTAnBggrBgEFBQcwAYYbaHR0cDovL29jc3AubmRpcy5udmlkaWEuY29tMAoGCCqGSM49BAMDA2gAMGUCMQCeIMMfAbyzPDacw2MxG+Yt1cikrJX/DVxiGfXuHmkkXn6VgSzE79+lkqDErpVO2gYCMCNEColOyvUvkzZGUEI1hQ3PfMgi3FIo9tHoBKMw4/wGBLFpu/0ubtmbBXM6/UMOEw=="},{"rawBytes":"MIICRTCCAcygAwIBAgIUeJdY3rV86EdvFmG7L8LJBsyQFYkwCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABAYpiXCDjJ9NT2eSDhyHJVSw1Tbze18cGG2F/578oWvHxg23eQAhNRYdq88i1iOshZSO6C29doKui5Xpmo/7Ctw9Sx4PP2RzOmIuOLCuTdNtKcTRwi4GEsd5BAFvWj42M6NjMGEwHQYDVR0OBBYEFItnoAjjfuCEUvzyvWyI2vOGvwPjMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMA8GA1UdEwEB/wQFMAMBAf8wDgYDVR0PAQH/BAQDAgEGMAoGCCqGSM49BAMDA2cAMGQCMCwtAjWLaNwgGWNCgdyNoTyvNhqWRECRJV2r3+7w8g0PL6NHLOsbkgE09BH95h8XlgIwTaQmbbUh2ChAJ5TA1wRiVDnCcvbzHlZl2jM2FcwQQZlk19LOAbyGMRixbu2Ww/rj"}]},"tlogEntries":[]},"dsseEnvelope":{"payload":"ewogICJfdHlwZSI6ICJodHRwczovL2luLXRvdG8uaW8vU3RhdGVtZW50L3YxIiwKICAic3ViamVjdCI6IFsKICAgIHsKICAgICAgIm5hbWUiOiAibmVtby1tYnJpZGdlLW1sbS1icmlkZ2UtdHJhaW5pbmciLAogICAgICAiZGlnZXN0IjogewogICAgICAgICJzaGEyNTYiOiAiMjA3Njg0OGM0NTQ3YzU0MTg0YjI0MjE2ZTM1Y2NmODkxYWQ1MTYxYTEzZjVhNjU0YWU3MjQ2NmIyMTc4YWM1ZCIKICAgICAgfQogICAgfQogIF0sCiAgInByZWRpY2F0ZVR5cGUiOiAiaHR0cHM6Ly9tb2RlbF9zaWduaW5nL3NpZ25hdHVyZS92MS4wIiwKICAicHJlZGljYXRlIjogewogICAgInJlc291cmNlcyI6IFsKICAgICAgewogICAgICAgICJuYW1lIjogIkJFTkNITUFSSy5tZCIsCiAgICAgICAgImRpZ2VzdCI6ICJlMDFmNjVmZjk1MGM1ZTBhMTM0YzM1Yzg0ZmI0ODQ0YjkxOTBlNDhmMTMyZWNhMTVkYWZiZDViNzkxNTA5ZDQ1IiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogIlNLSUxMLm1kIiwKICAgICAgICAiZGlnZXN0IjogIjJjYWVlMzk1NDA5YWNhNTZiNWMzZGJkZTkwZDE0MWFjMzc4YmFlMWE4ZTQ0NGUyM2Q3M2U2MWExMWRhODc5ZGQiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9LAogICAgICB7CiAgICAgICAgIm5hbWUiOiAiY2FyZC55YW1sIiwKICAgICAgICAiZGlnZXN0IjogImI5MGRmYWE0MmQyMGUxNTJjYTI2YjBlMDNkOGIxZjY5YjU2YzM2Yjg0YmJmYjA1MDZlMmQ4Y2Y4MTJmOWYxNzciLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9LAogICAgICB7CiAgICAgICAgIm5hbWUiOiAiZXZhbHMvZXZhbHMuanNvbiIsCiAgICAgICAgImRpZ2VzdCI6ICIyM2QyNmQzMWM0ZGQ0M2Y0NjQ2OTUyZDRiZjk5NDhlMTdhNjAwYWFkZTczN2MyYzM1N2YwZDdiYzA3ZDM5MDk4IiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogInNraWxsLWNhcmQubWQiLAogICAgICAgICJkaWdlc3QiOiAiMWY0YWJiMGUxNjZiODlhMzg2ZDBhMDY3NDU1M2M3OWNmYWQyMjUxNzA3ODI4OWNlNGNkZDM2MDMwMDZmOTZkOSIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiCiAgICAgIH0KICAgIF0sCiAgICAic2VyaWFsaXphdGlvbiI6IHsKICAgICAgImlnbm9yZV9wYXRocyI6IFsKICAgICAgICAiLmdpdGF0dHJpYnV0ZXMiLAogICAgICAgICIuZ2l0aHViIiwKICAgICAgICAiLmdpdCIsCiAgICAgICAgIi5naXRpZ25vcmUiCiAgICAgIF0sCiAgICAgICJhbGxvd19zeW1saW5rcyI6IGZhbHNlLAogICAgICAiaGFzaF90eXBlIjogInNoYTI1NiIsCiAgICAgICJtZXRob2QiOiAiZmlsZXMiCiAgICB9CiAgfQp9","payloadType":"application/vnd.in-toto+json","signatures":[{"sig":"MGQCMG3Jjn1qc0DtljajZCYqos3Hxo6d/dh6moV6VqpaH6jldahsi0Li7SGbF6w26zt3SwIwP9xszKA0NfyAI9ibT1gzuVegSJ6Z8vTvxV3LxvDU9lpuHryNb3QPn28ikUP1hGRO","keyid":""}]}} \ No newline at end of file diff --git a/skills/nemo-mbridge-multi-node-slurm/BENCHMARK.md b/skills/nemo-mbridge-multi-node-slurm/BENCHMARK.md index a709d129..b13d6908 100644 --- a/skills/nemo-mbridge-multi-node-slurm/BENCHMARK.md +++ b/skills/nemo-mbridge-multi-node-slurm/BENCHMARK.md @@ -7,14 +7,18 @@ This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the s ## Evaluation Summary - Skill: `nemo-mbridge-multi-node-slurm` -- Evaluation date: 2026-05-28 +- Evaluation date: 2026-06-02 - NVSkills-Eval profile: `external` -- Overall verdict: FAIL -- Tier 3 live agent evaluation: not available in this report +- Environment: `local` +- Dataset: 1 evaluation tasks +- Attempts per task: 2 +- Pass threshold: 50% +- Overall verdict: PASS ## Agents Used -- Tier 3 agent details were not available in this report. +- `claude-code` +- `codex` ## Metrics Used @@ -28,19 +32,39 @@ Reported benchmark dimensions: Underlying evaluation signals used in this run: -- No Tier 3 evaluation signal details were available in this report. +- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access. +- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow. +- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage. +- `accuracy` (Accuracy): grades final-answer correctness against the reference answer. +- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully. +- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations. +- `token_efficiency` (Token Efficiency): compares token usage with and without the skill. ## Test Tasks -Tier 3 evaluation task details were not available in this report. +The benchmark dataset contained 1 evaluation tasks: + +- Positive tasks: 1 tasks where the skill was expected to activate. +- Negative tasks: 0 tasks where no skill was expected. +- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred. + +Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases. ## Results -Tier 3 dimension rollup was not available in this report. +| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 2 | 100% (+0%) | 100% (+0%) | +| Correctness | 2 | 100% (+0%) | 88% (+5%) | +| Discoverability | 2 | 100% (+0%) | 62% (+0%) | +| Effectiveness | 2 | 97% (+3%) | 95% (+8%) | +| Efficiency | 2 | 92% (-0%) | 60% (+1%) | + +Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available. ## Tier 1: Static Validation Summary -Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 15 total findings. +Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 11 total findings. Top findings: @@ -52,21 +76,13 @@ Top findings: ## Tier 2: Deduplication Summary -Tier 2 validation reported findings. NVSkills-Eval ran 2 checks and found 3 total findings. +Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings. -Top findings: +Notable observations: -- HIGH DUPLICATE/duplicate: Duplicate content found across SKILL.md and references/templates.md: - "### Container" in SKILL.md (lines 16-22) - vs "# ── Container ────────────────────────────────────────────────────────────" in references/templates.md (lines 26-29) (`SKILL.md:16`) -- HIGH DUPLICATE/duplicate: Duplicate content found across SKILL.md and references/templates.md: - "# Phase 1: Single-process uv sync to build/populate the shared cache" in SKILL.md (lines 78-84) - vs "# Phase 1: Single-process uv sync to build/populate the shared cache" in SKILL.md (lines 171-177) - vs "# Phase 1: Single-process uv sync to build/populate the shared cache" in references/templates.md (lines 75-81) (`SKILL.md:78`) -- HIGH DUPLICATE/duplicate: Duplicate content found across SKILL.md and references/templates.md: - "### Tokens / Caches" in SKILL.md (lines 30-44) - vs "# ── Tokens / Caches ──────────────────────────────────────────────────────" in references/templates.md (lines 44-50) (`SKILL.md:30`) +- Context Deduplication: Collected 2 file(s) +- Inter-Skill Deduplication: Parsed skill 'nemo-mbridge-multi-node-slurm': 243 char description ## Publication Recommendation -The skill should be reviewed before NVSkills-Eval publication. Skill owners should address the findings above and rerun NVSkills-Eval to refresh this benchmark. +The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change. diff --git a/skills/nemo-mbridge-multi-node-slurm/SKILL.md b/skills/nemo-mbridge-multi-node-slurm/SKILL.md index 73d8c661..814c677e 100644 --- a/skills/nemo-mbridge-multi-node-slurm/SKILL.md +++ b/skills/nemo-mbridge-multi-node-slurm/SKILL.md @@ -9,6 +9,24 @@ when_to_use: Writing or converting Slurm sbatch scripts, scaling to multiple nod Convert single-node `uv run python -m torch.distributed.run` commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures. +## First Answer Checklist + +When converting or debugging Bridge multi-node jobs, answer in this order: + +1. Prefer the **srun-native** launch shape for Bridge scripts that reach + `initialize.py`: `#SBATCH --ntasks-per-node=8` and a direct `srun ... uv run + python