feat: make claude code agent runtime capabilities configurable#1603
Merged
Conversation
Replace the hardcoded `--bare` flag with config knobs so skills, MCP servers, and custom settings can be enabled via config instead of editing app.py. Defaults (bare=true, no mcp_config/settings) preserve the current isolated, reproducible behavior. - Add `bare`, `mcp_config`, and `settings` to ClaudeCodeAgentConfig. - Extract testable `_build_command`, `_build_settings`, and `_setup_config_dir` helpers from `_run_claude_code`. `--bare` is only passed when `bare` is true; `--mcp-config` is explicit and applies regardless. The per-run CLAUDE_CONFIG_DIR is the staging seam reused by skills evaluation (#1256). - Cover the new helpers and `_run_claude_code` wiring/timeout paths with unit tests (subprocess mocked); document the knobs and the reproducibility trade-off in the README and config template. Relates to #1602. Signed-off-by: Chris Wing <cwing@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cmunley1
reviewed
Jun 16, 2026
cmunley1
reviewed
Jun 16, 2026
Contributor
|
looks good to me |
Contributor
|
btw, anthropic docs say
maybe worth mentioning their recommendation in readme. also, i think there is not much for agent to autodiscover as it gets a fresh tmpdir per run but not fully sure |
Note that bare is the recommended mode for scripted/SDK calls (per Claude docs), reframe the Runtime capabilities section to reference the config knobs above, standardize the auto-discovery list, and tidy wording. Signed-off-by: Chris Wing <cwing@nvidia.com>
cmunley1
approved these changes
Jun 22, 2026
ritaneves
pushed a commit
that referenced
this pull request
Jun 25, 2026
## Summary Replaces the hardcoded `--bare` flag in the Claude Code agent with config knobs, so skills / MCP servers / custom settings can be enabled via config instead of editing `app.py`. Defaults preserve today's isolated, reproducible behavior. - Adds `bare` (default `true`), `mcp_config`, and `settings` to `ClaudeCodeAgentConfig`. - Extracts testable `_build_command`, `_build_settings`, and `_setup_config_dir` helpers from `_run_claude_code`. `--bare` is only passed when `bare` is `true`; `--mcp-config` is explicit and applies regardless. The per-run `CLAUDE_CONFIG_DIR` is the staging seam reused by skills evaluation (#1256). - Unit tests for the new helpers and `_run_claude_code` wiring/timeout (subprocess mocked); README + config template document the knobs and the reproducibility trade-off. A note on API shape: Claude Code's `--bare` is all-or-nothing for *auto-discovery* (skills/hooks/plugins/MCP/memory/CLAUDE.md), while `--mcp-config` and settings are *explicit*. So the minimal honest API is three knobs (`bare`, `mcp_config`, `settings`) rather than a flag per feature — `bare: false` enables auto-discovery as a group. Implements #1602. Unblocks the agent-half of #1256. ## Test plan Unit tests pass (`27 passed`, ruff clean). Unit coverage of `app.py` raised 62% → 79%; all added/changed code is covered. The remaining gaps (`responses`/`run`/`_resolve_base_url`) are pre-existing methods covered by `ng_test`'s live-server run. Unit tests mock `asyncio.create_subprocess_exec`, so the real `claude` binary is not exercised. **Smoke test before marking ready** (pending API key): - [x] **Default (`bare: true`)**: `ng_run` the reasoning_gym claude_code_agent config, `ng_collect_rollouts` one task; confirm a non-error row with `reward`, populated `response.output`, and `usage`. Confirms the refactor works against the real CLI. - [x] **`mcp_config`**: set it to a minimal MCP server config; run a task needing the tool; confirm a tool call to the MCP tool in the output (and that it coexists with default `bare: true`). - [ ] **`bare: false`**: confirm a task still completes with `--bare` dropped. *(Not run live — only omits a flag; covered by unit tests `test_bare_false_omits_flag`.)* - [x] **`settings`**: set a settings JSON with a unique `env` var and have the model echo it via Bash; the value appears in the answer, proving the merged `settings.json` is read from the staged config dir. ### Smoke test details (reproducible) Run against an endpoint serving the Anthropic Messages API. `env.yaml`: ```yaml anthropic_api_key: <key> anthropic_model_name: aws/anthropic/bedrock-claude-sonnet-4-6 anthropic_base_url: https://<host> # no /v1 suffix; the CLI appends /v1/messages ``` **1. Default (`bare: true`) — confirms the refactor works against the real CLI** ```bash ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" ng_collect_rollouts \ +agent_name=reasoning_gym_claude_code_agent \ +input_jsonl_fpath=resources_servers/reasoning_gym/data/example.jsonl \ +output_jsonl_fpath=/tmp/cc_default.jsonl +limit=1 ``` Result: rollout completed end-to-end (2 turns, tokens reported, full verify pipeline ran). ✅ **2. `mcp_config` with default `bare: true` — confirms `--mcp-config` is honored even with `--bare`** MCP config (`/tmp/mcp.json`) — the official reference "everything" server, which exposes an `add` tool: ```json { "mcpServers": { "everything": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-everything"] } } } ``` Input task (`/tmp/cc_mcp_input.jsonl`, one row) instructs the model to use the MCP `add` tool (not Bash) and box the result; `metadata.source_dataset=basic_arithmetic` so the reasoning_gym verifier scores it: ```json {"responses_create_params": {"input": [{"role": "user", "content": "You have an MCP tool named 'add' (exposed as mcp__everything__add). You MUST use it to compute 12345 + 67890 — do not calculate it yourself or use Bash. Put only the final number inside <answer></answer> tags."}]}, "question": "Use the MCP add tool to compute 12345 + 67890.", "answer": "80235", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "12345 + 67890", "num_terms": 2, "num_digits": 5, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 5]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}} ``` Run with `mcp_config` set via a Hydra override (note `bare` stays at its default `true`): ```bash ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \ +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.mcp_config=/tmp/mcp.json ng_collect_rollouts \ +agent_name=reasoning_gym_claude_code_agent \ +input_jsonl_fpath=/tmp/cc_mcp_input.jsonl \ +output_jsonl_fpath=/tmp/cc_mcp.jsonl +limit=1 grep -o 'mcp__[A-Za-z0-9_-]*' /tmp/cc_mcp.jsonl | sort -u ``` Result: the model invoked `mcp__everything__add`, `reward=1.0`, `extracted_answer=80235`, 3 turns. The MCP server loaded and its tool was used **with `--bare` still active**, validating the design choice to treat explicit `--mcp-config` as independent of the auto-discovery (`bare`) knob. ✅ **3. `settings` — confirms a user settings JSON is merged into the staged `CLAUDE_CONFIG_DIR` and its `env` reaches tools** > An `env`-var signal is used instead of a `permissions.deny` rule: the agent always passes `--dangerously-skip-permissions`, which would bypass deny rules and give a misleading result. Settings file (`/tmp/cc_settings.json`): ```json { "env": { "NG_SMOKE_TOKEN": "ng-smoke-7f3a9" } } ``` Input task (`/tmp/cc_settings_input.jsonl`, one row) tells the model to `echo "$NG_SMOKE_TOKEN"` via Bash and box the result; `metadata.source_dataset=basic_arithmetic` keeps the verifier happy: ```json {"responses_create_params": {"input": [{"role": "user", "content": "Use the Bash tool to run exactly this command: echo \"$NG_SMOKE_TOKEN\" Then put the exact printed value inside <answer></answer> tags. Do not guess the value; obtain it by actually running the command."}]}, "question": "Echo the NG_SMOKE_TOKEN environment variable via Bash.", "answer": "ng-smoke-7f3a9", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "0 + 0", "num_terms": 2, "num_digits": 1, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 1]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}} ``` Run with `settings` set via a Hydra override: ```bash ng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" \ +reasoning_gym_claude_code_agent.responses_api_agents.claude_code_agent.settings=/tmp/cc_settings.json ng_collect_rollouts \ +agent_name=reasoning_gym_claude_code_agent \ +input_jsonl_fpath=/tmp/cc_settings_input.jsonl \ +output_jsonl_fpath=/tmp/cc_settings.jsonl +limit=1 grep -o 'ng-smoke-7f3a9' /tmp/cc_settings.jsonl | head -1 ``` Result: the answer contained `ng-smoke-7f3a9`, proving the merged `settings.json` was read and its `env` reached the Bash tool environment. ✅ --------- Signed-off-by: Chris Wing <cwing@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com> Signed-off-by: Rita Fernandes Neves <rfernandesne@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the hardcoded
--bareflag in the Claude Code agent with config knobs, so skills / MCP servers / custom settings can be enabled via config instead of editingapp.py. Defaults preserve today's isolated, reproducible behavior.bare(defaulttrue),mcp_config, andsettingstoClaudeCodeAgentConfig._build_command,_build_settings, and_setup_config_dirhelpers from_run_claude_code.--bareis only passed whenbareistrue;--mcp-configis explicit and applies regardless. The per-runCLAUDE_CONFIG_DIRis the staging seam reused by skills evaluation (feat: agent skill evaluation infrastructure #1256)._run_claude_codewiring/timeout (subprocess mocked); README + config template document the knobs and the reproducibility trade-off.A note on API shape: Claude Code's
--bareis all-or-nothing for auto-discovery (skills/hooks/plugins/MCP/memory/CLAUDE.md), while--mcp-configand settings are explicit. So the minimal honest API is three knobs (bare,mcp_config,settings) rather than a flag per feature —bare: falseenables auto-discovery as a group.Implements #1602. Unblocks the agent-half of #1256.
Test plan
Unit tests pass (
27 passed, ruff clean). Unit coverage ofapp.pyraised 62% → 79%; all added/changed code is covered. The remaining gaps (responses/run/_resolve_base_url) are pre-existing methods covered byng_test's live-server run.Unit tests mock
asyncio.create_subprocess_exec, so the realclaudebinary is not exercised. Smoke test before marking ready (pending API key):bare: true):ng_runthe reasoning_gym claude_code_agent config,ng_collect_rolloutsone task; confirm a non-error row withreward, populatedresponse.output, andusage. Confirms the refactor works against the real CLI.mcp_config: set it to a minimal MCP server config; run a task needing the tool; confirm a tool call to the MCP tool in the output (and that it coexists with defaultbare: true).bare: false: confirm a task still completes with--baredropped. (Not run live — only omits a flag; covered by unit teststest_bare_false_omits_flag.)settings: set a settings JSON with a uniqueenvvar and have the model echo it via Bash; the value appears in the answer, proving the mergedsettings.jsonis read from the staged config dir.Smoke test details (reproducible)
Run against an endpoint serving the Anthropic Messages API.
env.yaml:1. Default (
bare: true) — confirms the refactor works against the real CLIng_run "+config_paths=[resources_servers/reasoning_gym/configs/reasoning_gym_claude_code_agent.yaml]" ng_collect_rollouts \ +agent_name=reasoning_gym_claude_code_agent \ +input_jsonl_fpath=resources_servers/reasoning_gym/data/example.jsonl \ +output_jsonl_fpath=/tmp/cc_default.jsonl +limit=1Result: rollout completed end-to-end (2 turns, tokens reported, full verify pipeline ran). ✅
2.
mcp_configwith defaultbare: true— confirms--mcp-configis honored even with--bareMCP config (
/tmp/mcp.json) — the official reference "everything" server, which exposes anaddtool:{ "mcpServers": { "everything": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-everything"] } } }Input task (
/tmp/cc_mcp_input.jsonl, one row) instructs the model to use the MCPaddtool (not Bash) and box the result;metadata.source_dataset=basic_arithmeticso the reasoning_gym verifier scores it:{"responses_create_params": {"input": [{"role": "user", "content": "You have an MCP tool named 'add' (exposed as mcp__everything__add). You MUST use it to compute 12345 + 67890 — do not calculate it yourself or use Bash. Put only the final number inside <answer></answer> tags."}]}, "question": "Use the MCP add tool to compute 12345 + 67890.", "answer": "80235", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "12345 + 67890", "num_terms": 2, "num_digits": 5, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 5]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}}Run with
mcp_configset via a Hydra override (notebarestays at its defaulttrue):Result: the model invoked
mcp__everything__add,reward=1.0,extracted_answer=80235, 3 turns. The MCP server loaded and its tool was used with--barestill active, validating the design choice to treat explicit--mcp-configas independent of the auto-discovery (bare) knob. ✅3.
settings— confirms a user settings JSON is merged into the stagedCLAUDE_CONFIG_DIRand itsenvreaches toolsSettings file (
/tmp/cc_settings.json):{ "env": { "NG_SMOKE_TOKEN": "ng-smoke-7f3a9" } }Input task (
/tmp/cc_settings_input.jsonl, one row) tells the model toecho "$NG_SMOKE_TOKEN"via Bash and box the result;metadata.source_dataset=basic_arithmetickeeps the verifier happy:{"responses_create_params": {"input": [{"role": "user", "content": "Use the Bash tool to run exactly this command: echo \"$NG_SMOKE_TOKEN\" Then put the exact printed value inside <answer></answer> tags. Do not guess the value; obtain it by actually running the command."}]}, "question": "Echo the NG_SMOKE_TOKEN environment variable via Bash.", "answer": "ng-smoke-7f3a9", "metadata": {"source_dataset": "basic_arithmetic", "source_index": 0, "expression": "0 + 0", "num_terms": 2, "num_digits": 1, "difficulty": {"num_terms": [2, 6], "num_digits": [1, 1]}}, "agent_ref": {"type": "responses_api_agents", "name": "reasoning_gym_claude_code_agent"}}Run with
settingsset via a Hydra override:Result: the answer contained
ng-smoke-7f3a9, proving the mergedsettings.jsonwas read and itsenvreached the Bash tool environment. ✅