Skip to content

feat: make claude code agent runtime capabilities configurable#1603

Merged
cwing-nvidia merged 3 commits into
mainfrom
cwing/claude-code-configurable-runtime
Jun 22, 2026
Merged

feat: make claude code agent runtime capabilities configurable#1603
cwing-nvidia merged 3 commits into
mainfrom
cwing/claude-code-configurable-runtime

Conversation

@cwing-nvidia

@cwing-nvidia cwing-nvidia commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the hardcoded --bare flag in the Claude Code agent with config knobs, so skills / MCP servers / custom settings can be enabled via config instead of editing app.py. Defaults preserve today's isolated, reproducible behavior.

  • Adds bare (default true), mcp_config, and settings to ClaudeCodeAgentConfig.
  • Extracts testable _build_command, _build_settings, and _setup_config_dir helpers from _run_claude_code. --bare is only passed when bare is true; --mcp-config is explicit and applies regardless. The per-run CLAUDE_CONFIG_DIR is the staging seam reused by skills evaluation (feat: agent skill evaluation infrastructure #1256).
  • Unit tests for the new helpers and _run_claude_code wiring/timeout (subprocess mocked); README + config template document the knobs and the reproducibility trade-off.

A note on API shape: Claude Code's --bare is all-or-nothing for auto-discovery (skills/hooks/plugins/MCP/memory/CLAUDE.md), while --mcp-config and settings are explicit. So the minimal honest API is three knobs (bare, mcp_config, settings) rather than a flag per feature — bare: false enables auto-discovery as a group.

Implements #1602. Unblocks the agent-half of #1256.

Test plan

Unit tests pass (27 passed, ruff clean). Unit coverage of app.py raised 62% → 79%; all added/changed code is covered. The remaining gaps (responses/run/_resolve_base_url) are pre-existing methods covered by ng_test's live-server run.

Unit tests mock asyncio.create_subprocess_exec, so the real claude binary is not exercised. Smoke test before marking ready (pending API key):

  • Default (bare: true): ng_run the reasoning_gym claude_code_agent config, ng_collect_rollouts one task; confirm a non-error row with reward, populated response.output, and usage. Confirms the refactor works against the real CLI.
  • mcp_config: set it to a minimal MCP server config; run a task needing the tool; confirm a tool call to the MCP tool in the output (and that it coexists with default bare: true).
  • bare: false: confirm a task still completes with --bare dropped. (Not run live — only omits a flag; covered by unit tests test_bare_false_omits_flag.)
  • settings: set a settings JSON with a unique env var and have the model echo it via Bash; the value appears in the answer, proving the merged settings.json is read from the staged config dir.

Smoke test details (reproducible)

Run against an endpoint serving the Anthropic Messages API. env.yaml:

anthropic_api_key: <key>
anthropic_model_name: aws/anthropic/bedrock-claude-sonnet-4-6
anthropic_base_url: https://<host>   # no /v1 suffix; the CLI appends /v1/messages

1. Default (bare: true) — confirms the refactor works against the real CLI

ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]"

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=resources_servers/reasoning_gym/data/example.jsonl \
  +output_jsonl_fpath=/tmp/cc_default.jsonl +limit=1

Result: rollout completed end-to-end (2 turns, tokens reported, full verify pipeline ran). ✅

2. mcp_config with default bare: true — confirms --mcp-config is honored even with --bare

MCP config (/tmp/mcp.json) — the official reference "everything" server, which exposes an add tool:

{
  "mcpServers": {
    "everything": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-everything"]
    }
  }
}

Input task (/tmp/cc_mcp_input.jsonl, one row) instructs the model to use the MCP add tool (not Bash) and box the result; metadata.source_dataset=basic_arithmetic so the reasoning_gym verifier scores it:

{"responses_create_params": {"input": [{"role": "user", "content": "You have an MCP tool named 'add' (exposed as mcp__everything__add). You MUST use it to compute 12345 + 67890 — do not calculate it yourself or use Bash. Put only the final number inside <answer></answer> tags."}]}, "question": "Use the MCP add tool to compute 12345 + 67890.", "answer": "80235", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "12345 + 67890", "num_terms": 2, "num_digits": 5, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 5]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}}

Run with mcp_config set via a Hydra override (note bare stays at its default true):

ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \
  +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.mcp_config=/tmp/mcp.json

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=/tmp/cc_mcp_input.jsonl \
  +output_jsonl_fpath=/tmp/cc_mcp.jsonl +limit=1

grep -o 'mcp__[A-Za-z0-9_-]*' /tmp/cc_mcp.jsonl | sort -u

Result: the model invoked mcp__everything__add, reward=1.0, extracted_answer=80235, 3 turns. The MCP server loaded and its tool was used with --bare still active, validating the design choice to treat explicit --mcp-config as independent of the auto-discovery (bare) knob. ✅

3. settings — confirms a user settings JSON is merged into the staged CLAUDE_CONFIG_DIR and its env reaches tools

An env-var signal is used instead of a permissions.deny rule: the agent always passes --dangerously-skip-permissions, which would bypass deny rules and give a misleading result.

Settings file (/tmp/cc_settings.json):

{
  "env": {
    "NG_SMOKE_TOKEN": "ng-smoke-7f3a9"
  }
}

Input task (/tmp/cc_settings_input.jsonl, one row) tells the model to echo "$NG_SMOKE_TOKEN" via Bash and box the result; metadata.source_dataset=basic_arithmetic keeps the verifier happy:

{"responses_create_params": {"input": [{"role": "user", "content": "Use the Bash tool to run exactly this command: echo \"$NG_SMOKE_TOKEN\"  Then put the exact printed value inside <answer></answer> tags. Do not guess the value; obtain it by actually running the command."}]}, "question": "Echo the NG_SMOKE_TOKEN environment variable via Bash.", "answer": "ng-smoke-7f3a9", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "0 + 0", "num_terms": 2, "num_digits": 1, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 1]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}}

Run with settings set via a Hydra override:

ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \
  +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.settings=/tmp/cc_settings.json

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=/tmp/cc_settings_input.jsonl \
  +output_jsonl_fpath=/tmp/cc_settings.jsonl +limit=1

grep -o 'ng-smoke-7f3a9' /tmp/cc_settings.jsonl | head -1

Result: the answer contained ng-smoke-7f3a9, proving the merged settings.json was read and its env reached the Bash tool environment. ✅

Replace the hardcoded `--bare` flag with config knobs so skills, MCP
servers, and custom settings can be enabled via config instead of editing
app.py. Defaults (bare=true, no mcp_config/settings) preserve the current
isolated, reproducible behavior.

- Add `bare`, `mcp_config`, and `settings` to ClaudeCodeAgentConfig.
- Extract testable `_build_command`, `_build_settings`, and
  `_setup_config_dir` helpers from `_run_claude_code`. `--bare` is only
  passed when `bare` is true; `--mcp-config` is explicit and applies
  regardless. The per-run CLAUDE_CONFIG_DIR is the staging seam reused by
  skills evaluation (#1256).
- Cover the new helpers and `_run_claude_code` wiring/timeout paths with
  unit tests (subprocess mocked); document the knobs and the
  reproducibility trade-off in the README and config template.

Relates to #1602.

Signed-off-by: Chris Wing <cwing@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cwing-nvidia cwing-nvidia requested review from Glorf and cmunley1 June 16, 2026 03:21
@cwing-nvidia cwing-nvidia marked this pull request as ready for review June 16, 2026 04:07
@cwing-nvidia cwing-nvidia changed the title feat(claude-code-agent): make runtime capabilities configurable feat: make claude code agent runtime capabilities configurable Jun 16, 2026
Comment thread responses_api_agents/claude_code_agent/README.md Outdated
Comment thread responses_api_agents/claude_code_agent/README.md
@cmunley1

Copy link
Copy Markdown
Contributor

looks good to me

@cmunley1

cmunley1 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

btw, anthropic docs say

--bare is the recommended mode for scripted and SDK calls, and will become the default for -p in a future release.
https://code.claude.com/docs/en/headless#start-faster-with-bare-mode

maybe worth mentioning their recommendation in readme.

also, i think there is not much for agent to autodiscover as it gets a fresh tmpdir per run but not fully sure

Note that bare is the recommended mode for scripted/SDK calls (per Claude
docs), reframe the Runtime capabilities section to reference the config
knobs above, standardize the auto-discovery list, and tidy wording.

Signed-off-by: Chris Wing <cwing@nvidia.com>
@cwing-nvidia cwing-nvidia merged commit 3dec25f into main Jun 22, 2026
24 checks passed
@cwing-nvidia cwing-nvidia deleted the cwing/claude-code-configurable-runtime branch June 22, 2026 22:44
@github-project-automation github-project-automation Bot moved this from Dev Todo to Done in NeMo Gym 0.4.0 - July 1 Jun 22, 2026
ritaneves pushed a commit that referenced this pull request Jun 25, 2026
## Summary

Replaces the hardcoded `--bare` flag in the Claude Code agent with
config knobs, so skills / MCP servers / custom settings can be enabled
via config instead of editing `app.py`. Defaults preserve today's
isolated, reproducible behavior.

- Adds `bare` (default `true`), `mcp_config`, and `settings` to
`ClaudeCodeAgentConfig`.
- Extracts testable `_build_command`, `_build_settings`, and
`_setup_config_dir` helpers from `_run_claude_code`. `--bare` is only
passed when `bare` is `true`; `--mcp-config` is explicit and applies
regardless. The per-run `CLAUDE_CONFIG_DIR` is the staging seam reused
by skills evaluation (#1256).
- Unit tests for the new helpers and `_run_claude_code` wiring/timeout
(subprocess mocked); README + config template document the knobs and the
reproducibility trade-off.

A note on API shape: Claude Code's `--bare` is all-or-nothing for
*auto-discovery* (skills/hooks/plugins/MCP/memory/CLAUDE.md), while
`--mcp-config` and settings are *explicit*. So the minimal honest API is
three knobs (`bare`, `mcp_config`, `settings`) rather than a flag per
feature — `bare: false` enables auto-discovery as a group.

Implements #1602. Unblocks the agent-half of #1256.

## Test plan

Unit tests pass (`27 passed`, ruff clean). Unit coverage of `app.py`
raised 62% → 79%; all added/changed code is covered. The remaining gaps
(`responses`/`run`/`_resolve_base_url`) are pre-existing methods covered
by `ng_test`'s live-server run.

Unit tests mock `asyncio.create_subprocess_exec`, so the real `claude`
binary is not exercised. **Smoke test before marking ready** (pending
API key):

- [x] **Default (`bare: true`)**: `ng_run` the reasoning_gym
claude_code_agent config, `ng_collect_rollouts` one task; confirm a
non-error row with `reward`, populated `response.output`, and `usage`.
Confirms the refactor works against the real CLI.
- [x] **`mcp_config`**: set it to a minimal MCP server config; run a
task needing the tool; confirm a tool call to the MCP tool in the output
(and that it coexists with default `bare: true`).
- [ ] **`bare: false`**: confirm a task still completes with `--bare`
dropped. *(Not run live — only omits a flag; covered by unit tests
`test_bare_false_omits_flag`.)*
- [x] **`settings`**: set a settings JSON with a unique `env` var and
have the model echo it via Bash; the value appears in the answer,
proving the merged `settings.json` is read from the staged config dir.

### Smoke test details (reproducible)

Run against an endpoint serving the Anthropic Messages API. `env.yaml`:

```yaml
anthropic_api_key: <key>
anthropic_model_name: aws/anthropic/bedrock-claude-sonnet-4-6
anthropic_base_url: https://<host>   # no /v1 suffix; the CLI appends /v1/messages
```

**1. Default (`bare: true`) — confirms the refactor works against the
real CLI**

```bash
ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]"

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=resources_servers/reasoning_gym/data/example.jsonl \
  +output_jsonl_fpath=/tmp/cc_default.jsonl +limit=1
```

Result: rollout completed end-to-end (2 turns, tokens reported, full
verify pipeline ran). ✅

**2. `mcp_config` with default `bare: true` — confirms `--mcp-config` is
honored even with `--bare`**

MCP config (`/tmp/mcp.json`) — the official reference "everything"
server, which exposes an `add` tool:

```json
{
  "mcpServers": {
    "everything": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-everything"]
    }
  }
}
```

Input task (`/tmp/cc_mcp_input.jsonl`, one row) instructs the model to
use the MCP `add` tool (not Bash) and box the result;
`metadata.source_dataset=basic_arithmetic` so the reasoning_gym verifier
scores it:

```json
{"responses_create_params": {"input": [{"role": "user", "content": "You have an MCP tool named 'add' (exposed as mcp__everything__add). You MUST use it to compute 12345 + 67890 — do not calculate it yourself or use Bash. Put only the final number inside <answer></answer> tags."}]}, "question": "Use the MCP add tool to compute 12345 + 67890.", "answer": "80235", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "12345 + 67890", "num_terms": 2, "num_digits": 5, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 5]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}}
```

Run with `mcp_config` set via a Hydra override (note `bare` stays at its
default `true`):

```bash
ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \
  +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.mcp_config=/tmp/mcp.json

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=/tmp/cc_mcp_input.jsonl \
  +output_jsonl_fpath=/tmp/cc_mcp.jsonl +limit=1

grep -o 'mcp__[A-Za-z0-9_-]*' /tmp/cc_mcp.jsonl | sort -u
```

Result: the model invoked `mcp__everything__add`, `reward=1.0`,
`extracted_answer=80235`, 3 turns. The MCP server loaded and its tool
was used **with `--bare` still active**, validating the design choice to
treat explicit `--mcp-config` as independent of the auto-discovery
(`bare`) knob. ✅

**3. `settings` — confirms a user settings JSON is merged into the
staged `CLAUDE_CONFIG_DIR` and its `env` reaches tools**

> An `env`-var signal is used instead of a `permissions.deny` rule: the
agent always passes `--dangerously-skip-permissions`, which would bypass
deny rules and give a misleading result.

Settings file (`/tmp/cc_settings.json`):

```json
{
  "env": {
    "NG_SMOKE_TOKEN": "ng-smoke-7f3a9"
  }
}
```

Input task (`/tmp/cc_settings_input.jsonl`, one row) tells the model to
`echo "$NG_SMOKE_TOKEN"` via Bash and box the result;
`metadata.source_dataset=basic_arithmetic` keeps the verifier happy:

```json
{"responses_create_params": {"input": [{"role": "user", "content": "Use the Bash tool to run exactly this command: echo \"$NG_SMOKE_TOKEN\"  Then put the exact printed value inside <answer></answer> tags. Do not guess the value; obtain it by actually running the command."}]}, "question": "Echo the NG_SMOKE_TOKEN environment variable via Bash.", "answer": "ng-smoke-7f3a9", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "0 + 0", "num_terms": 2, "num_digits": 1, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 1]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}}
```

Run with `settings` set via a Hydra override:

```bash
ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \
  +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.settings=/tmp/cc_settings.json

ng_collect_rollouts \
  +agent_name=reasoning_gym_claude_code_agent \
  +input_jsonl_fpath=/tmp/cc_settings_input.jsonl \
  +output_jsonl_fpath=/tmp/cc_settings.jsonl +limit=1

grep -o 'ng-smoke-7f3a9' /tmp/cc_settings.jsonl | head -1
```

Result: the answer contained `ng-smoke-7f3a9`, proving the merged
`settings.json` was read and its `env` reached the Bash tool
environment. ✅

---------

Signed-off-by: Chris Wing <cwing@nvidia.com>
Co-authored-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Rita Fernandes Neves <rfernandesne@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

feat: make Claude Code agent runtime configurable (replace hardcoded --bare)

2 participants