feat: make claude code agent runtime capabilities configurable by cwing-nvidia · Pull Request #1603 · NVIDIA-NeMo/Gym

cwing-nvidia · 2026-06-16T03:19:37Z

Summary

Replaces the hardcoded --bare flag in the Claude Code agent with config knobs, so skills / MCP servers / custom settings can be enabled via config instead of editing app.py. Defaults preserve today's isolated, reproducible behavior.

Adds bare (default true), mcp_config, and settings to ClaudeCodeAgentConfig.
Extracts testable _build_command, _build_settings, and _setup_config_dir helpers from _run_claude_code. --bare is only passed when bare is true; --mcp-config is explicit and applies regardless. The per-run CLAUDE_CONFIG_DIR is the staging seam reused by skills evaluation (feat: agent skill evaluation infrastructure #1256).
Unit tests for the new helpers and _run_claude_code wiring/timeout (subprocess mocked); README + config template document the knobs and the reproducibility trade-off.

A note on API shape: Claude Code's --bare is all-or-nothing for auto-discovery (skills/hooks/plugins/MCP/memory/CLAUDE.md), while --mcp-config and settings are explicit. So the minimal honest API is three knobs (bare, mcp_config, settings) rather than a flag per feature — bare: false enables auto-discovery as a group.

Implements #1602. Unblocks the agent-half of #1256.

Test plan

Unit tests pass (27 passed, ruff clean). Unit coverage of app.py raised 62% → 79%; all added/changed code is covered. The remaining gaps (responses/run/_resolve_base_url) are pre-existing methods covered by ng_test's live-server run.

Unit tests mock asyncio.create_subprocess_exec, so the real claude binary is not exercised. Smoke test before marking ready (pending API key):

Default (bare: true): ng_run the reasoning_gym claude_code_agent config, ng_collect_rollouts one task; confirm a non-error row with reward, populated response.output, and usage. Confirms the refactor works against the real CLI.
mcp_config: set it to a minimal MCP server config; run a task needing the tool; confirm a tool call to the MCP tool in the output (and that it coexists with default bare: true).
bare: false: confirm a task still completes with --bare dropped. (Not run live — only omits a flag; covered by unit tests test_bare_false_omits_flag.)
settings: set a settings JSON with a unique env var and have the model echo it via Bash; the value appears in the answer, proving the merged settings.json is read from the staged config dir.

Smoke test details (reproducible)

Run against an endpoint serving the Anthropic Messages API. env.yaml:

anthropic_api_key: <key>
anthropic_model_name: aws/anthropic/bedrock-claude-sonnet-4-6
anthropic_base_url: https://<host>   # no /v1 suffix; the CLI appends /v1/messages

1. Default (bare: true) — confirms the refactor works against the real CLI

ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]"

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=resources_servers/reasoning_gym/data/example.jsonl \
  +output_jsonl_fpath=/tmp/cc_default.jsonl +limit=1

Result: rollout completed end-to-end (2 turns, tokens reported, full verify pipeline ran). ✅

2. mcp_config with default bare: true — confirms --mcp-config is honored even with --bare

MCP config (/tmp/mcp.json) — the official reference "everything" server, which exposes an add tool:

{
  "mcpServers": {
    "everything": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-everything"]
    }
  }
}

Input task (/tmp/cc_mcp_input.jsonl, one row) instructs the model to use the MCP add tool (not Bash) and box the result; metadata.source_dataset=basic_arithmetic so the reasoning_gym verifier scores it:

{"responses_create_params": {"input": [{"role": "user", "content": "You have an MCP tool named 'add' (exposed as mcp__everything__add). You MUST use it to compute 12345 + 67890 — do not calculate it yourself or use Bash. Put only the final number inside <answer></answer> tags."}]}, "question": "Use the MCP add tool to compute 12345 + 67890.", "answer": "80235", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "12345 + 67890", "num_terms": 2, "num_digits": 5, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 5]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}}

Run with mcp_config set via a Hydra override (note bare stays at its default true):

ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \
  +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.mcp_config=/tmp/mcp.json

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=/tmp/cc_mcp_input.jsonl \
  +output_jsonl_fpath=/tmp/cc_mcp.jsonl +limit=1

grep -o 'mcp__[A-Za-z0-9_-]*' /tmp/cc_mcp.jsonl | sort -u

Result: the model invoked mcp__everything__add, reward=1.0, extracted_answer=80235, 3 turns. The MCP server loaded and its tool was used with --bare still active, validating the design choice to treat explicit --mcp-config as independent of the auto-discovery (bare) knob. ✅

3. settings — confirms a user settings JSON is merged into the staged CLAUDE_CONFIG_DIR and its env reaches tools

An env-var signal is used instead of a permissions.deny rule: the agent always passes --dangerously-skip-permissions, which would bypass deny rules and give a misleading result.

Settings file (/tmp/cc_settings.json):

{
  "env": {
    "NG_SMOKE_TOKEN": "ng-smoke-7f3a9"
  }
}

Input task (/tmp/cc_settings_input.jsonl, one row) tells the model to echo "$NG_SMOKE_TOKEN" via Bash and box the result; metadata.source_dataset=basic_arithmetic keeps the verifier happy:

{"responses_create_params": {"input": [{"role": "user", "content": "Use the Bash tool to run exactly this command: echo \"$NG_SMOKE_TOKEN\"  Then put the exact printed value inside <answer></answer> tags. Do not guess the value; obtain it by actually running the command."}]}, "question": "Echo the NG_SMOKE_TOKEN environment variable via Bash.", "answer": "ng-smoke-7f3a9", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "0 + 0", "num_terms": 2, "num_digits": 1, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 1]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}}

Run with settings set via a Hydra override:

ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \
  +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.settings=/tmp/cc_settings.json

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=/tmp/cc_settings_input.jsonl \
  +output_jsonl_fpath=/tmp/cc_settings.jsonl +limit=1

grep -o 'ng-smoke-7f3a9' /tmp/cc_settings.jsonl | head -1

Result: the answer contained ng-smoke-7f3a9, proving the merged settings.json was read and its env reached the Bash tool environment. ✅

Replace the hardcoded `--bare` flag with config knobs so skills, MCP servers, and custom settings can be enabled via config instead of editing app.py. Defaults (bare=true, no mcp_config/settings) preserve the current isolated, reproducible behavior. - Add `bare`, `mcp_config`, and `settings` to ClaudeCodeAgentConfig. - Extract testable `_build_command`, `_build_settings`, and `_setup_config_dir` helpers from `_run_claude_code`. `--bare` is only passed when `bare` is true; `--mcp-config` is explicit and applies regardless. The per-run CLAUDE_CONFIG_DIR is the staging seam reused by skills evaluation (#1256). - Cover the new helpers and `_run_claude_code` wiring/timeout paths with unit tests (subprocess mocked); document the knobs and the reproducibility trade-off in the README and config template. Relates to #1602. Signed-off-by: Chris Wing <cwing@nvidia.com>

copy-pr-bot · 2026-06-16T03:19:41Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cmunley1 · 2026-06-16T08:31:20Z

looks good to me

cmunley1 · 2026-06-16T08:35:26Z

btw, anthropic docs say

--bare is the recommended mode for scripted and SDK calls, and will become the default for -p in a future release.
https://code.claude.com/docs/en/headless#start-faster-with-bare-mode

maybe worth mentioning their recommendation in readme.

also, i think there is not much for agent to autodiscover as it gets a fresh tmpdir per run but not fully sure

Note that bare is the recommended mode for scripted/SDK calls (per Claude docs), reframe the Runtime capabilities section to reference the config knobs above, standardize the auto-discovery list, and tidy wording. Signed-off-by: Chris Wing <cwing@nvidia.com>

## Summary Replaces the hardcoded `--bare` flag in the Claude Code agent with config knobs, so skills / MCP servers / custom settings can be enabled via config instead of editing `app.py`. Defaults preserve today's isolated, reproducible behavior. - Adds `bare` (default `true`), `mcp_config`, and `settings` to `ClaudeCodeAgentConfig`. - Extracts testable `_build_command`, `_build_settings`, and `_setup_config_dir` helpers from `_run_claude_code`. `--bare` is only passed when `bare` is `true`; `--mcp-config` is explicit and applies regardless. The per-run `CLAUDE_CONFIG_DIR` is the staging seam reused by skills evaluation (#1256). - Unit tests for the new helpers and `_run_claude_code` wiring/timeout (subprocess mocked); README + config template document the knobs and the reproducibility trade-off. A note on API shape: Claude Code's `--bare` is all-or-nothing for *auto-discovery* (skills/hooks/plugins/MCP/memory/CLAUDE.md), while `--mcp-config` and settings are *explicit*. So the minimal honest API is three knobs (`bare`, `mcp_config`, `settings`) rather than a flag per feature — `bare: false` enables auto-discovery as a group. Implements #1602. Unblocks the agent-half of #1256. ## Test plan Unit tests pass (`27 passed`, ruff clean). Unit coverage of `app.py` raised 62% → 79%; all added/changed code is covered. The remaining gaps (`responses`/`run`/`_resolve_base_url`) are pre-existing methods covered by `ng_test`'s live-server run. Unit tests mock `asyncio.create_subprocess_exec`, so the real `claude` binary is not exercised. **Smoke test before marking ready** (pending API key): - [x] **Default (`bare: true`)**: `ng_run` the reasoning_gym claude_code_agent config, `ng_collect_rollouts` one task; confirm a non-error row with `reward`, populated `response.output`, and `usage`. Confirms the refactor works against the real CLI. - [x] **`mcp_config`**: set it to a minimal MCP server config; run a task needing the tool; confirm a tool call to the MCP tool in the output (and that it coexists with default `bare: true`). - [ ] **`bare: false`**: confirm a task still completes with `--bare` dropped. *(Not run live — only omits a flag; covered by unit tests `test_bare_false_omits_flag`.)* - [x] **`settings`**: set a settings JSON with a unique `env` var and have the model echo it via Bash; the value appears in the answer, proving the merged `settings.json` is read from the staged config dir. ### Smoke test details (reproducible) Run against an endpoint serving the Anthropic Messages API. `env.yaml`: ```yaml anthropic_api_key: <key> anthropic_model_name: aws/anthropic/bedrock-claude-sonnet-4-6 anthropic_base_url: https://<host> # no /v1 suffix; the CLI appends /v1/messages ``` **1. Default (`bare: true`) — confirms the refactor works against the real CLI** ```bash ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" ng_collect_rollouts \ +agent_name=reasoning_gym_claude_code_agent \ +input_jsonl_fpath=resources_servers/reasoning_gym/data/example.jsonl \ +output_jsonl_fpath=/tmp/cc_default.jsonl +limit=1 ``` Result: rollout completed end-to-end (2 turns, tokens reported, full verify pipeline ran). ✅ **2. `mcp_config` with default `bare: true` — confirms `--mcp-config` is honored even with `--bare`** MCP config (`/tmp/mcp.json`) — the official reference "everything" server, which exposes an `add` tool: ```json { "mcpServers": { "everything": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-everything"] } } } ``` Input task (`/tmp/cc_mcp_input.jsonl`, one row) instructs the model to use the MCP `add` tool (not Bash) and box the result; `metadata.source_dataset=basic_arithmetic` so the reasoning_gym verifier scores it: ```json {"responses_create_params": {"input": [{"role": "user", "content": "You have an MCP tool named 'add' (exposed as mcp__everything__add). You MUST use it to compute 12345 + 67890 — do not calculate it yourself or use Bash. Put only the final number inside <answer></answer> tags."}]}, "question": "Use the MCP add tool to compute 12345 + 67890.", "answer": "80235", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "12345 + 67890", "num_terms": 2, "num_digits": 5, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 5]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}} ``` Run with `mcp_config` set via a Hydra override (note `bare` stays at its default `true`): ```bash ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \ +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.mcp_config=/tmp/mcp.json ng_collect_rollouts \ +agent_name=reasoning_gym_claude_code_agent \ +input_jsonl_fpath=/tmp/cc_mcp_input.jsonl \ +output_jsonl_fpath=/tmp/cc_mcp.jsonl +limit=1 grep -o 'mcp__[A-Za-z0-9_-]*' /tmp/cc_mcp.jsonl | sort -u ``` Result: the model invoked `mcp__everything__add`, `reward=1.0`, `extracted_answer=80235`, 3 turns. The MCP server loaded and its tool was used **with `--bare` still active**, validating the design choice to treat explicit `--mcp-config` as independent of the auto-discovery (`bare`) knob. ✅ **3. `settings` — confirms a user settings JSON is merged into the staged `CLAUDE_CONFIG_DIR` and its `env` reaches tools** > An `env`-var signal is used instead of a `permissions.deny` rule: the agent always passes `--dangerously-skip-permissions`, which would bypass deny rules and give a misleading result. Settings file (`/tmp/cc_settings.json`): ```json { "env": { "NG_SMOKE_TOKEN": "ng-smoke-7f3a9" } } ``` Input task (`/tmp/cc_settings_input.jsonl`, one row) tells the model to `echo "$NG_SMOKE_TOKEN"` via Bash and box the result; `metadata.source_dataset=basic_arithmetic` keeps the verifier happy: ```json {"responses_create_params": {"input": [{"role": "user", "content": "Use the Bash tool to run exactly this command: echo \"$NG_SMOKE_TOKEN\" Then put the exact printed value inside <answer></answer> tags. Do not guess the value; obtain it by actually running the command."}]}, "question": "Echo the NG_SMOKE_TOKEN environment variable via Bash.", "answer": "ng-smoke-7f3a9", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "0 + 0", "num_terms": 2, "num_digits": 1, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 1]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}} ``` Run with `settings` set via a Hydra override: ```bash ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \ +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.settings=/tmp/cc_settings.json ng_collect_rollouts \ +agent_name=reasoning_gym_claude_code_agent \ +input_jsonl_fpath=/tmp/cc_settings_input.jsonl \ +output_jsonl_fpath=/tmp/cc_settings.jsonl +limit=1 grep -o 'ng-smoke-7f3a9' /tmp/cc_settings.jsonl | head -1 ``` Result: the answer contained `ng-smoke-7f3a9`, proving the merged `settings.json` was read and its `env` reached the Bash tool environment. ✅ --------- Signed-off-by: Chris Wing <cwing@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com> Signed-off-by: Rita Fernandes Neves <rfernandesne@nvidia.com>

cwing-nvidia requested review from Glorf and cmunley1 June 16, 2026 03:21

cwing-nvidia marked this pull request as ready for review June 16, 2026 04:07

cwing-nvidia changed the title ~~feat(claude-code-agent): make runtime capabilities configurable~~ feat: make claude code agent runtime capabilities configurable Jun 16, 2026

copy-pr-bot Bot temporarily deployed to public June 16, 2026 04:08 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 04:09 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 04:10 Inactive

cwing-nvidia linked an issue Jun 16, 2026 that may be closed by this pull request

feat: make Claude Code agent runtime configurable (replace hardcoded --bare) #1602

Closed

cwing-nvidia added this to NeMo Gym 0.4.0 - July 1 Jun 16, 2026

github-project-automation Bot moved this to Dev Todo in NeMo Gym 0.4.0 - July 1 Jun 16, 2026

cwing-nvidia added the agents label Jun 16, 2026

cmunley1 reviewed Jun 16, 2026

View reviewed changes

Comment thread responses_api_agents/claude_code_agent/README.md Outdated

cmunley1 reviewed Jun 16, 2026

View reviewed changes

Comment thread responses_api_agents/claude_code_agent/README.md

copy-pr-bot Bot temporarily deployed to public June 22, 2026 19:09 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 19:10 Inactive

cwing-nvidia requested a review from cmunley1 June 22, 2026 19:29

cmunley1 approved these changes Jun 22, 2026

View reviewed changes

Merge branch 'main' into cwing/claude-code-configurable-runtime

3fcc287

copy-pr-bot Bot temporarily deployed to public June 22, 2026 21:35 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 21:36 Inactive

cwing-nvidia merged commit 3dec25f into main Jun 22, 2026
24 checks passed

cwing-nvidia deleted the cwing/claude-code-configurable-runtime branch June 22, 2026 22:44

github-project-automation Bot moved this from Dev Todo to Done in NeMo Gym 0.4.0 - July 1 Jun 22, 2026

cwing-nvidia mentioned this pull request Jun 23, 2026

Built-in harness integration: Claude Code #1387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: make claude code agent runtime capabilities configurable#1603

feat: make claude code agent runtime capabilities configurable#1603
cwing-nvidia merged 3 commits into
mainfrom
cwing/claude-code-configurable-runtime

cwing-nvidia commented Jun 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

cmunley1 commented Jun 16, 2026

Uh oh!

cmunley1 commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cwing-nvidia commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Smoke test details (reproducible)

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

cmunley1 commented Jun 16, 2026

Uh oh!

cmunley1 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cwing-nvidia commented Jun 16, 2026 •

edited

Loading

cmunley1 commented Jun 16, 2026 •

edited

Loading