Skip to content

Commit d4911f0

Browse files
authored
Add external public API integration runner (#69)
* Capture install command output in tests Capture install CLI child output in the dual-backend integration tests instead of inheriting the test runner streams. The rejection tests now assert the exact stderr messages, and the success tests assert their stdout status messages without printing them during passing runs. Validation: cargo check cargo build cargo clippy --all-targets --all-features -- -D warnings cargo test cargo +nightly fmt * Add external public API runner smoke slice Add scripts/public_api_suite.py as a Python MCP stdio runner that accepts a prebuilt mcp-repl binary and verifies the R console smoke path through the public repl tool. Document the runner, record the active migration plan, and remove duplicate raw Rust R console smoke coverage now covered externally. Validation: - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - python3 -m py_compile scripts/public_api_suite.py - cargo test --test server_smoke - cargo test --test docs_contracts docs_index_lists_main_docs - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt - cargo test --test docs_contracts plans_layout_exists - git diff --staged --check * Migrate timeout recovery smoke to public API runner Add an external public API case that verifies R timeout, busy-input discard, and later recovery through the real MCP stdio surface. Remove the duplicate Rust snapshot smoke test now covered by the external runner, and update testing docs plus the active runner plan. Validation: - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl --case r-timeout-busy-recovers - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - python3 -m py_compile scripts/public_api_suite.py - cargo test --test write_stdin_batch - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt - cargo insta pending-snapshots - git diff --check * Migrate reset state check to public API runner Move the R repl_reset state-clearing check into scripts/public_api_suite.py so it runs against a built server over MCP stdio. Drop the duplicate Rust public surface test and update testing docs/plan to point at the runner coverage. Validation: - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl --case r-reset-clears-state - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - cargo test --test repl_surface - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt * Run public API suite in CI Add the external Python public API suite as its own post-build CI step on every matrix target. Use the platform-specific debug binary path so the suite runs against the binary Cargo just built. Update testing docs and the active migration plan to record that CI now runs this external public API check. Validation: - cargo build - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt * Reduce slow timeout waits in tests Shorten artificial timeout sleeps in write_stdin coverage and batch compatible echo-prefix checks into one public session so the slow behavior suite avoids extra fixed waits and real server startups. Also tighten the external public API timeout recovery case while preserving the busy-discard and recovery assertions. Validation: - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt * Silence routine server startup stderr Remove the routine server startup stderr notice so passing real-server tests do not inherit status noise. Add a CLI regression that starts the real server with closed stdin, asserts the routine notice is absent, and verifies stderr diagnostics remain. Validation: - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt * Add quiet nextest CI profile Add a checked-in nextest ci profile with quiet passing-test output and a default filter that keeps real client integration binaries out of the ordinary suite. Switch CI's ordinary Rust test step to `cargo nextest run --profile ci --show-progress none`, preserving the Windows serial constraint and the separate real Codex integration `cargo test` step. Validation: - cargo nextest run --profile ci --show-progress none - cargo test --test docs_contracts ci_uses_quiet_nextest_profile_for_routine_suite - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt * Serialize REPL integration tests under nextest Add a nextest `repl-integration` group with `max-threads = 1` and assign the process-backed REPL integration binaries to it. Document that the group is for timing-sensitive server/worker transitions, and extend the docs contract test so the CI profile keeps this scheduling rule. Review finding addressed: - [P1] Serialize timing-sensitive tests under nextest — /Users/tomasz/github/posit-dev/mcp-repl/.github/workflows/ci.yml:98-98 On non-Windows this new nextest step runs individual tests concurrently, but these integration tests still use in-process `lock_test_mutex()` to serialize timing-sensitive REPL sessions; nextest launches each test in a separate process, so the lock no longer protects them. I can reproduce `cargo nextest run --profile ci --show-progress none --test write_stdin_behavior` failing on macOS in `follow_up_after_timeout_spills_when_prefix_and_reply_exceed_threshold`, while the same run with `--test-threads 1` passes. Please serialize these tests globally or with nextest test groups before switching CI. Response: The CI profile now serializes the affected REPL integration binaries with a nextest test group, preserving per-test server isolation while limiting host concurrency for the timing-sensitive suites. Validation: - cargo test --test docs_contracts ci_uses_quiet_nextest_profile_for_routine_suite - cargo nextest show-config test-groups --profile ci --groups repl-integration - cargo nextest run --profile ci --show-progress none --test write_stdin_behavior - cargo +nightly fmt - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test * Move pager smoke into public API runner Add a pager command scenario to the external Python runner, including per-case server args and environment so it starts mcp-repl in pager mode with a small page size. Remove the duplicated Rust pager smoke test and update the testing docs. Validation: - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - cargo +nightly fmt - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test * Move interrupt prefix coverage to public runner Add an external public API case that exercises R control-D restart state clearing and control-C interruption with remaining input against the built MCP server. The interrupt path waits for an explicit marker before sending the control prefix, so the check does not rely on a tiny sleep race. Remove the matching Rust prefix tests and update the public runner docs. Validation: - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl --case r-interrupt-restart-prefixes --timeout 45 - cargo test --test interrupt - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl --timeout 45 * Move output bundle coverage to public runner Add Python public API scenarios for files-mode output bundles, including text-only bundles, count pruning, timeout backfill through a gate, and size-cap omission. Remove duplicated broad Rust integration tests while keeping narrower Rust edge coverage in place. Validation: - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl --case r-output-bundle-files --case r-output-bundle-size-limit - cargo test --test write_stdin_behavior timeout_output_bundle_is_disclosed_only_after_poll_crosses_hard_spill_threshold -- --exact - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - python3 -m py_compile scripts/public_api_suite.py * Narrow nextest REPL residue group Limit the CI nextest `repl-integration` group to the remaining timing-sensitive Rust residue now that pager smoke, interrupt/restart prefixes, and broad files-mode output bundle coverage have moved into the Python public runner. Update the testing docs, active migration plan, and docs contract so pager flag, R manual, and R vignette binaries stay in the ordinary nextest pool. Validation: - cargo test --test docs_contracts ci_uses_quiet_nextest_profile_for_routine_suite - cargo nextest show-config test-groups --profile ci --groups repl-integration - cargo nextest run --profile ci --show-progress none --test pager_flags --test r_manuals --test r_vignettes - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt - cargo nextest run --profile ci --show-progress none * Pin R public API suite cases Pin every r-* public API suite case with --interpreter r so an inherited MCP_REPL_INTERPRETER value cannot redirect those cases to another backend. Add a regression test that checks all R suite case metadata carries the explicit R interpreter flag. Review finding: [P2] Pin the R interpreter for R public API cases — /Users/tomasz/github/posit-dev/mcp-repl/scripts/public_api_suite.py:782-782 When `MCP_REPL_INTERPRETER` is present in the environment, these `r-*` cases inherit it because the command only adds `--sandbox` and per-case args. Setting `MCP_REPL_INTERPRETER=python` makes the suite launch the Python backend (`r-console-basic` can still pass on `1+1`, while later R-specific cases fail), so the new public API suite no longer reliably exercises R; pass `--interpreter r` for these cases or scrub that env var. Response: Added an R suite-case constructor that prepends --interpreter r to every r-* case, including cases with additional per-case server args. Validation: - python3 -m unittest scripts/test_public_api_suite.py (failed before the fix, passes after) - MCP_REPL_INTERPRETER=python python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - cargo check - cargo build - cargo clippy --all-targets --all-features -- -D warnings - cargo test - cargo +nightly fmt * Separate local and CI nextest profiles Keep the default nextest profile as the local full test loop so authenticated Codex and Claude integration binaries are included. Add a CI profile filter for unauthenticated workflow runs and remove the CI Codex install/integration step. Refresh agent/testing docs around the public API suite, nextest usage, validation surfaces, and snapshot workflow. Pin the Codex integration fixture to the Spark model while keeping Claude on haiku, and lock the contract in docs_contracts. Validation: - cargo check - cargo build - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo nextest run --show-progress none - cargo test - cargo +nightly fmt - cargo test --test docs_contracts - git diff --check * Detect client auth before live integrations Codex and Claude live integration tests now preflight CLI launch/auth state and print explicit skip banners when a client cannot run. The local nextest profile shows successful output for those binaries so skip reasons are visible during normal local runs. Claude preserves the minimal host CLI environment needed for first-party auth probes while keeping test config isolated. Validation: - cargo check - cargo build - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo nextest run --show-progress none - cargo test - cargo +nightly fmt - git diff --check * Remove obsolete nextest serial scheduling Drop the repl-integration nextest test group so local runs use normal nextest scheduling. Update the docs and docs contract to describe the CI profile as only filtering unauthenticated real-client integrations. Normalize optional Codex workspace metadata from the wire snapshot so the snapshot is stable under normal parallel scheduling. Validation: - cargo test --test docs_contracts - cargo test --test codex_approvals_tui unix_impl::normalize_wire_snapshot_drops_volatile_turn_metadata_fields - cargo check - cargo build - python3 scripts/public_api_suite.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo nextest run --show-progress none - cargo test - cargo +nightly fmt - git diff --check * Rename integration test runner Move the external public API runner under tests/run_integration_tests.py so the real-binary suite lives with the test harnesses. Update CI, AGENTS.md, and testing docs to reference the new path, and extend docs contracts to reject the stale script path. Validation: - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo nextest run --show-progress none - cargo test - cargo +nightly fmt - git diff --check * Split cargo test from explicit Rust test suite Make plain cargo test a small Cargo compatibility check by opting the binary unit-test target and integration targets out of default Cargo discovery. Add an explicit Python wrapper around nextest to run those Rust targets, including a CI profile that excludes live client integrations and a clippy mode for the opted out targets. Also stabilize the interrupt prompt-shaped-output test with an explicit marker instead of a sleep-only fixture, and update CI, AGENTS, docs, and docs-contract coverage for the new split test entry points. Validation: - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - python3 tests/run_rust_tests.py --clippy - python3 tests/run_rust_tests.py --profile default - cargo test - cargo +nightly fmt --all -- --check * Restore real Codex CI integration coverage Run CI's ordinary Rust suite with nextest directly while keeping all Rust test targets discoverable by Cargo. Remove the transitional explicit Rust test wrapper and the Cargo.toml opt-outs that made `cargo test` incomplete. Install the real Codex CLI in CI and run the Codex integration binary separately. The Codex smoke test now defaults to backend auto-selection: use live Spark when `codex login status`, model availability, and local auth are present; otherwise use the mocked provider. The forced mock and live paths are documented for targeted validation. Validation: - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo nextest run --show-progress none - cargo test - cargo +nightly fmt --all -- --check - MCP_REPL_CODEX_BACKEND=mock cargo test -j 1 --test codex_approvals_tui codex_exec_auto_backend_smoke -- --test-threads=1 * Use cargo test for CI Rust suite Rename the Codex client integration target from codex_approvals_tui to codex_integration so the name matches its broader exec, install, sandbox metadata, and mock-provider coverage. Remove the nextest profile and CI dependency. CI now installs Codex before the Rust suite, forces MCP_REPL_CODEX_BACKEND=mock, and runs the standard cargo test path with quiet output. This keeps Rust test discovery on Cargo's default path instead of maintaining two Rust runners. Validation: - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - MCP_REPL_CODEX_BACKEND=mock cargo test -j 1 --test codex_integration codex_exec_auto_backend_smoke -- --test-threads=1 - cargo test --test docs_contracts - cargo test --quiet - cargo +nightly fmt --all -- --check * Shorten interrupt drain setup timeout Use a 5ms timeout for the prompt-shaped child-output interrupt drain setup request. This keeps the setup call focused on entering the busy state before the child output is likely to be drained into the initial timeout response. Validation: - MCP_REPL_CODEX_BACKEND=mock cargo test --quiet --test interrupt files_interrupt_drain_preserves_prompt_shaped_child_stdout -- --nocapture - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - MCP_REPL_CODEX_BACKEND=mock cargo test --quiet - cargo +nightly fmt * Stabilize timeout bundle CI tests Shorten the initial timeout in the hidden timeout bundle spill tests so slow first-call startup cannot let the initial reply reach the oversized output disclosure path before the busy follow-up. Update the Linux Codex wire snapshot for the current Codex CLI metadata, which no longer includes x-codex-turn-metadata.workspaces. Validation: - MCP_REPL_CODEX_BACKEND=mock cargo test --test write_stdin_behavior busy_follow_up_reuses_hidden_timeout_bundle_when_it_first_spills -- --nocapture - MCP_REPL_CODEX_BACKEND=mock cargo test --test codex_integration codex_exec_wire_sandbox_state_meta -- --nocapture - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - MCP_REPL_CODEX_BACKEND=mock cargo test --quiet - cargo +nightly fmt * Clarify public API runner sandbox plan Clarify that the external Python runner is not itself sandboxed, but each spawned mcp-repl binary still owns and must exercise the sandbox contract. The next slice now calls for workspace-write sandbox coverage in the Python runner before continuing representative real-binary migration out of Rust. Validation: cargo test --test docs_contracts
1 parent 23da2b7 commit d4911f0

28 files changed

Lines changed: 2041 additions & 968 deletions

.github/workflows/ci.yml

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -78,16 +78,17 @@ jobs:
7878
- name: cargo build
7979
run: cargo build
8080

81-
- name: cargo clippy
82-
run: cargo clippy --all-targets --all-features -- -D warnings
83-
84-
- name: cargo test
81+
- name: Python public API suite
8582
if: matrix.os != 'windows-2022'
86-
run: cargo test
83+
run: python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
8784

88-
- name: cargo test (windows serial)
85+
- name: Python public API suite (windows)
8986
if: matrix.os == 'windows-2022'
90-
run: cargo test -j 1 -- --test-threads=1
87+
shell: pwsh
88+
run: python tests/run_integration_tests.py --binary target/debug/mcp-repl.exe
89+
90+
- name: cargo clippy
91+
run: cargo clippy --all-targets --all-features -- -D warnings
9192

9293
- name: Install Codex CLI
9394
if: matrix.os != 'windows-2022'
@@ -110,13 +111,17 @@ jobs:
110111
$env:PATH = "$npmPrefix;$env:PATH"
111112
& (Join-Path $npmPrefix "codex.cmd") --version
112113
113-
- name: cargo test (real codex integrations)
114+
- name: cargo test
114115
if: matrix.os != 'windows-2022'
115-
run: cargo test -j 1 --test codex_approvals_tui -- --test-threads=1
116+
env:
117+
MCP_REPL_CODEX_BACKEND: mock
118+
run: cargo test --quiet
116119

117-
- name: cargo test (real codex integrations, windows serial)
120+
- name: cargo test (windows serial)
118121
if: matrix.os == 'windows-2022'
119-
run: cargo test -j 1 --test codex_approvals_tui -- --test-threads=1
122+
env:
123+
MCP_REPL_CODEX_BACKEND: mock
124+
run: cargo test -j 1 --quiet -- --test-threads=1
120125

121126
- name: cargo +nightly fmt
122127
run: cargo +nightly fmt --all -- --check

AGENTS.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,26 @@ Keep this file short. It is a table of contents, not the full manual.
77
- If you modified code, run all required checks before replying:
88
- `cargo check`
99
- `cargo build`
10+
- `python3 tests/run_integration_tests.py --binary target/debug/mcp-repl`
1011
- `cargo clippy --all-targets --all-features -- -D warnings`
11-
- `cargo test`
12+
- `cargo test --quiet`
1213
- `cargo +nightly fmt`
14+
- For docs-only changes, run the narrow docs validation that covers the edited
15+
files, usually `cargo test --test docs_contracts`.
16+
- When changing Codex backend selection or CI real-client wiring, also run:
17+
- `MCP_REPL_CODEX_BACKEND=mock cargo test -j 1 --test codex_integration codex_exec_auto_backend_smoke -- --test-threads=1`
1318
- Treat all clippy warnings as failures. Do not leave warning cleanup for later.
1419
- Never pass `--vanilla` to `R` or `Rscript` unless the user explicitly asks for it.
1520

1621
## Start Here
1722

1823
- `docs/index.md`: source-of-truth map for repository docs.
19-
- `docs/architecture.md`: subsystem map for the binary, worker, sandbox, and eval surfaces.
24+
- `docs/architecture.md`: subsystem map for the CLI, server, worker, sandbox, output, and validation surfaces.
2025
- `docs/testing.md`: public verification surface and snapshot workflow.
2126
- `docs/debugging.md`: debug logs, `--debug-repl`, and stdio tracing.
2227
- `docs/sandbox.md`: sandbox modes and writable-root policy.
28+
- `docs/output_timeline.md`: visible output ordering across sideband and raw streams.
29+
- `docs/worker_sideband_protocol.md`: current server/worker IPC contract.
2330
- `docs/plans/AGENTS.md`: when to create checked-in execution plans.
2431

2532
## Glossary
@@ -41,8 +48,8 @@ Keep this file short. It is a table of contents, not the full manual.
4148
- Sandbox metadata: Codex per-tool-call `_meta["codex/sandbox-state-meta"]` used by `--sandbox inherit` to choose the effective worker sandbox for that call.
4249
- Writable root: An absolute path that a `workspace-write` worker may write, subject to forced read-only subpaths like `.git`, `.codex`, and `.agents`.
4350
- Session temp directory: The server-allocated per-session temp path exposed to the worker as `TMPDIR` and `MCP_REPL_R_SESSION_TMPDIR`.
44-
- Sideband IPC: The JSON-lines server/worker pipe for structural facts such as `readline_start`, `readline_result`, `plot_image`, `request_end`, and `session_end`.
45-
- stdout/stderr pipes: The normal process output streams captured by the server. They are the authoritative visible text source; sideband only helps interpret them.
51+
- Sideband IPC: The JSON-lines server/worker pipe for structural facts such as `readline_start`, `readline_input`, `readline_discard`, `output_text`, `plot_image`, and `session_end`.
52+
- Raw output capture: The stdout/stderr pipes or PTY stream captured by the server for unowned visible text. Sideband carries worker-owned text and structural facts.
4653
- Output timeline: The server-side reconstruction of visible output order from captured stdout/stderr plus sideband facts.
4754
- Server-owned: State, files, or notices created and retained by the main server process, not by the runtime or the worker. Use this for output bundles, response finalization, debug logs, and server temp roots.
4855
- Worker-originated text: Text that came from the worker REPL or worker child processes and can be written to `transcript.txt`.
@@ -60,7 +67,10 @@ Keep this file short. It is a table of contents, not the full manual.
6067
- `cargo insta test`
6168
- `cargo insta pending-snapshots`
6269
- `cargo insta review` or `cargo insta accept` / `cargo insta reject`
63-
- CI-style validation: `cargo insta test --check --unreferenced=reject`
70+
- CI-style validation: `cargo insta test --check`
71+
- Do not add `--unreferenced=reject` to the general snapshot check; this
72+
repository keeps valid platform-specific snapshots that are unreferenced on
73+
other platforms.
6474
- For broad intentional snapshot migrations: `cargo insta test --force-update-snapshots --accept`
6575
- Do not delete `tests/snapshots/*.snap.new` manually. Use `cargo insta reject`.
6676

docs/architecture.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,15 @@ The repository is organized around a few concrete subsystems rather than deep pa
6363

6464
### Validation harnesses
6565

66-
- `tests/` is the primary public validation surface. The tests exercise tool behavior, snapshots, sandboxing, and client integrations through the exposed MCP interface.
66+
- `tests/run_integration_tests.py` starts an already-built `mcp-repl` binary and
67+
exercises public MCP tools over stdio. It covers representative real-binary
68+
behavior that should not depend on Rust internals.
69+
- `tests/` contains the Rust public API, snapshot, sandbox, backend, install,
70+
protocol-worker, and client-integration suites. Most tests exercise behavior
71+
through the exposed MCP interface using the shared harness in `tests/common/`.
72+
- CI uses Cargo's standard Rust test runner after installing the real Codex CLI,
73+
with the Codex backend forced to the mocked provider. The tests should not
74+
depend on special local scheduling.
6775

6876
## Design Constraints
6977

docs/debugging.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ Useful environment variables:
8484

8585
## External wire trace proxy
8686

87-
The built-in event log only sees what reaches `mcp-repl` after startup. If you need the exact stdio traffic between an MCP client and the server, use the external proxy in [scripts/mcp-stdio-trace.py](/Users/tomasz/github/t-kalinowski/mcp-repl/scripts/mcp-stdio-trace.py).
87+
The built-in event log only sees what reaches `mcp-repl` after startup. If you need the exact stdio traffic between an MCP client and the server, use the external proxy in [scripts/mcp-stdio-trace.py](../scripts/mcp-stdio-trace.py).
8888

8989
What it does:
9090

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# External Public API Runner
2+
3+
## Summary
4+
5+
- Move public MCP behavior checks, including sandbox-visible real-binary behavior, toward an external Python runner that starts a built `mcp-repl` binary over stdio.
6+
- Keep Rust tests for unit contracts, snapshot normalization, protocol-worker conformance, platform-specific mechanics, and behavior that is not yet covered externally.
7+
8+
## Status
9+
10+
- State: active
11+
- Last updated: 2026-05-18
12+
- Current phase: implementation
13+
14+
## Current Direction
15+
16+
- Grow the minimal Python runner with small, real-client scenarios that speak MCP directly with newline-delimited JSON-RPC.
17+
- Treat sandboxing as product behavior for the external suite. The test runner process is outside the sandbox, but each case starts a built `mcp-repl` binary with an explicit sandbox state and verifies the worker is launched inside that policy through public MCP calls.
18+
- Reintroduce sandbox coverage in the Python runner now, starting with the default `workspace-write` behavior and then adding read-only or full-access contrasts where they prove public behavior.
19+
- Keep each migrated case focused enough that matching Rust integration coverage can be removed or reduced in the same change.
20+
- Use `danger-full-access` only for individual external cases whose purpose is unrelated to sandboxing and where disabling sandbox enforcement does not hide the product behavior under test.
21+
- Keep existing Rust tests discoverable by `cargo test` until their scenario is migrated or removed in the same change that adds equivalent Python coverage.
22+
23+
## Long-Term Direction
24+
25+
- Migrate representative public API integration scenarios out of Rust when the Python runner covers the same real-binary behavior, including sandbox behavior that is observable through public MCP tool calls.
26+
- Keep protocol-worker conformance tests, Rust-only contract tests, and deeply platform-specific sandbox launch mechanics in Rust unless there is a clearer public external scenario for the same contract.
27+
28+
## Phase Status
29+
30+
- Phase 0: completed - add the runner shell and first R console smoke case.
31+
- Phase 1: completed - migrate another small real-client scenario with timeout or busy-worker behavior.
32+
- Phase 2: completed - run the external suite in CI after the debug binary is built.
33+
- Phase 3: pending - reintroduce sandbox scenarios in the Python runner and continue migrating duplicate real-binary Rust integration coverage case by case.
34+
35+
## Locked Decisions
36+
37+
- The external suite must accept a prebuilt binary path instead of building the binary itself.
38+
- The runner should call MCP tools over stdio and avoid internal Rust helpers.
39+
- CI runs the external suite as its own step after `cargo build` on each matrix target.
40+
- Do not opt Rust test targets out of Cargo discovery in anticipation of future migration work.
41+
42+
## Open Questions
43+
44+
- Which sandbox scenarios have public external equivalents and which should remain Rust-only launch or platform-mechanics coverage.
45+
- Which additional public scenarios should migrate into the external suite before the parent migration is complete.
46+
47+
## Next Safe Slice
48+
49+
- Add a Python-runner sandbox case that starts the binary under `workspace-write`, proves an in-workspace write succeeds, and proves an out-of-policy write is blocked through the public `repl` tool.
50+
- In the same or next small slice, migrate another representative real-binary Rust integration scenario to the Python runner and remove or reduce only the matching Rust coverage.
51+
52+
## Stop Conditions
53+
54+
- Stop if a migrated scenario requires internal server state inspection instead of public MCP requests.
55+
- Stop if runner behavior needs platform-specific process supervision beyond the simple stdio client.
56+
57+
## Decision Log
58+
59+
- 2026-05-17: Chose a narrow first slice with one R `repl` smoke case to prove the runner can initialize the real binary and call public tools before moving more complex scenarios.
60+
- 2026-05-17: Added an R timeout/busy/recovery case to the external runner and removed the matching Rust snapshot smoke test.
61+
- 2026-05-17: Added an R `repl_reset` state-clearing case to the external runner and removed the duplicate Rust public surface test.
62+
- 2026-05-17: Added the external public API suite to the cross-platform CI workflow as a separate post-build step.
63+
- 2026-05-17: Added an R interrupt/restart-prefix scenario with explicit interrupt readiness polling and removed duplicate Rust prefix tests.
64+
- 2026-05-17: Added files-mode output-bundle scenarios for text bundles, pruning, timeout backfill, and size-cap omission, then removed duplicate broad Rust integration coverage.
65+
- 2026-05-17: Removed obsolete serial scheduling after verifying the remaining Rust REPL binaries pass under normal Cargo test scheduling.
66+
- 2026-05-18: Reaffirmed that unmigrated Rust scenarios must remain discoverable by `cargo test`; migrations should replace Rust coverage with equivalent Python coverage in the same change, not disable tests ahead of time.
67+
- 2026-05-18: Clarified that the external runner itself is not sandboxed, but the spawned `mcp-repl` binary still owns the sandbox contract; the next slice should restore sandbox coverage in the Python runner starting with `workspace-write`.

docs/testing.md

Lines changed: 104 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,14 @@ This file is the entrypoint for deciding how to verify a change.
55

66
## Core Test Surface
77

8-
- `tests/repl_surface.rs`: basic `repl` and `repl_reset` behavior.
9-
- `tests/repl_surface.rs` and `tests/python_backend.rs`: IPC ownership coverage. Only the main worker may own sideband fds; user-spawned children must not. `tests/python_backend.rs` also covers detached-idle oversized-output behavior, Unix Python PTY-backed C stdio, CPython `input()` through the readline path, and the absence of direct-fd stdin shims through the public `repl` API.
10-
- `tests/server_smoke.rs`: end-to-end MCP session smoke coverage.
11-
- `tests/write_stdin_behavior.rs`: timeout polling, oversized text replies, and transcript-file behavior through the public `repl` API.
8+
- `tests/run_integration_tests.py`: external real-binary checks over MCP stdio, including basic R `repl`, pager command handling, files-mode output bundles, timeout/busy recovery, interrupt/restart prefixes, and `repl_reset` state clearing.
9+
- `tests/common/`: shared Rust MCP harness for public tool calls, transcript snapshots, sandbox assertions, and client-install fixtures.
10+
- `tests/repl_surface.rs`, `tests/server_smoke.rs`, `tests/mcp_transcripts.rs`, and `tests/write_stdin_*.rs`: core `repl`/`repl_reset` behavior, timeout polling, oversized text replies, transcript-file behavior, and snapshot coverage through the public tool API.
11+
- `tests/pager*.rs` and `tests/oversized_output_cli.rs`: pager mode, files mode, and oversized-output CLI behavior.
12+
- `tests/python_*.rs`, `tests/r_*.rs`, `tests/plot_images.rs`, and `tests/python_plot_images.rs`: backend-specific public behavior, help/manual surfaces, PTY-backed Python readline behavior, and image output.
1213
- `tests/zod_protocol.rs`: protocol-worker conformance, including PTY launch with sideband IPC kept separate from visible PTY output.
1314
- `tests/sandbox.rs` and `tests/sandbox_state_updates.rs`: sandbox policy behavior and Codex per-tool-call sandbox metadata.
14-
- `tests/plot_images.rs` and `tests/python_plot_images.rs`: plot/image behavior through the public tool surface.
15-
- `tests/codex_approvals_tui.rs` and `tests/claude_integration.rs`: client integration coverage.
15+
- `tests/install_*.rs`, `tests/codex_integration.rs`, and `tests/claude_integration.rs`: install-path and real client integration coverage.
1616
- `tests/docs_contracts.rs`: docs map and snapshot-facing documentation contracts.
1717

1818
## Snapshot Workflow
@@ -22,18 +22,115 @@ This file is the entrypoint for deciding how to verify a change.
2222
- `cargo insta test`
2323
- `cargo insta pending-snapshots`
2424
- `cargo insta review` or `cargo insta accept` / `cargo insta reject`
25+
- CI-style validation: `cargo insta test --check`
26+
- Do not add `--unreferenced=reject` to the general snapshot check; this
27+
repository keeps valid platform-specific snapshots that are unreferenced on
28+
other platforms.
2529
- Do not delete `tests/snapshots/*.snap.new` manually. Use `cargo insta reject`.
2630

31+
## External Public API Suite
32+
33+
Build the binary first, then run the Python suite:
34+
35+
```sh
36+
cargo build
37+
python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
38+
```
39+
40+
The runner starts the real server over MCP stdio and calls public tools only. It
41+
uses `--sandbox danger-full-access` by default so the suite stays focused on
42+
client protocol behavior rather than sandbox policy.
43+
44+
Use `--case <name>` to run one public API case while iterating.
45+
46+
CI runs this suite after `cargo build` in the main cross-platform workflow,
47+
using the debug binary built for each matrix target.
48+
49+
## Rust Suite
50+
51+
Use Cargo's standard Rust test runner:
52+
53+
```sh
54+
cargo test
55+
```
56+
57+
The Rust suite uses plain `cargo test` as its single runner. Plain `cargo test`
58+
remains the full Cargo compatibility path. It must continue to discover the
59+
binary unit tests and Rust integration targets. CI passes Cargo's `--quiet`
60+
flag to keep successful logs compact.
61+
62+
```sh
63+
cargo test --quiet
64+
```
65+
66+
CI installs Codex before `cargo test` and sets `MCP_REPL_CODEX_BACKEND=mock`,
67+
so the Codex integration target runs through the mocked provider as part of the
68+
ordinary Rust suite. Windows keeps the Rust suite fully serial with `-j 1` and
69+
`--test-threads=1`.
70+
71+
Do not opt Rust test targets out of Cargo discovery in anticipation of a future
72+
Python migration; migrate a scenario only when the Rust coverage is deleted or
73+
reduced in the same change that adds equivalent external coverage.
74+
75+
## Real Client Integrations
76+
77+
CI installs Codex before the Rust suite. The Codex CI integration does not
78+
require OpenAI authentication because the test config points Codex at a local
79+
mock provider.
80+
81+
By default, the Codex integration uses `MCP_REPL_CODEX_BACKEND=auto`: it checks
82+
whether Codex is logged in, checks whether `gpt-5.3-codex-spark` is available,
83+
and uses that live backend when both checks pass. Otherwise it uses the mocked
84+
provider. Set `MCP_REPL_CODEX_BACKEND=live` or `MCP_REPL_CODEX_BACKEND=mock`
85+
to force one path.
86+
87+
When changing Codex backend selection or CI real-client wiring, run the forced
88+
mock path explicitly:
89+
90+
```sh
91+
MCP_REPL_CODEX_BACKEND=mock cargo test -j 1 --test codex_integration codex_exec_auto_backend_smoke -- --test-threads=1
92+
```
93+
94+
To validate the authenticated live path directly on a machine with Spark access:
95+
96+
```sh
97+
MCP_REPL_CODEX_BACKEND=live cargo test -j 1 --test codex_integration codex_exec_auto_backend_smoke -- --test-threads=1
98+
```
99+
100+
Local full verification includes the Codex and Claude integration binaries when
101+
those clients are installed. Codex uses the Spark model
102+
(`gpt-5.3-codex-spark`) in its isolated test config. Claude uses `haiku`.
103+
If a required client binary is unavailable, the matching integration test prints
104+
a skip banner with the reason. Codex backend selection prints a `CODEX` banner
105+
showing whether the test selected live Spark or the mocked provider.
106+
107+
To run only those integrations:
108+
109+
```sh
110+
cargo test --quiet --test codex_integration --test claude_integration
111+
```
112+
113+
CI runs the Codex integration target as part of `cargo test`; Claude integration
114+
remains local because provider authentication is unavailable in CI.
115+
27116
## Full Verification Before Replying
28117

29118
If you modify code, run:
30119

31120
- `cargo check`
32121
- `cargo build`
122+
- `python3 tests/run_integration_tests.py --binary target/debug/mcp-repl`
33123
- `cargo clippy --all-targets --all-features -- -D warnings`
34-
- `cargo test`
124+
- `cargo test --quiet`
35125
- `cargo +nightly fmt`
36126

127+
For docs-only changes, run the narrow validation that covers the edited docs.
128+
For agent-facing docs, that is usually:
129+
130+
```sh
131+
cargo test --test docs_contracts
132+
```
133+
37134
## Debug-Then-Validate Loop
38135

39136
When behavior is unclear:

src/server.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -785,7 +785,6 @@ pub async fn run(
785785
sandbox_plan: SandboxCliPlan,
786786
oversized_output: OversizedOutputMode,
787787
) -> Result<(), Box<dyn std::error::Error>> {
788-
eprintln!("starting mcp-repl server");
789788
let backend = worker_launch.builtin_backend().unwrap_or(Backend::R);
790789
crate::event_log::log(
791790
"server_run_begin",

0 commit comments

Comments
 (0)