Skip to content

Migrate integration tests to exact MCP responses#70

Merged
t-kalinowski merged 13 commits into
mainfrom
golden-mcp-integration-responses
May 18, 2026
Merged

Migrate integration tests to exact MCP responses#70
t-kalinowski merged 13 commits into
mainfrom
golden-mcp-integration-responses

Conversation

@t-kalinowski
Copy link
Copy Markdown
Member

@t-kalinowski t-kalinowski commented May 18, 2026

Summary

This PR tightens the public integration runner so R integration cases assert exact MCP tools/call response shapes instead of searching loosely through concatenated text. It adds normalized golden expectations for timeout/busy replies, output bundle paths, capped output previews, reset behavior, interrupt handling, and pager commands.

The sandbox-focused tests are also made stricter and more representative: sandbox availability now fails fast where coverage depends on it, reticulate/keras JAX setup prewarms host caches before the sandboxed run, and Python discovery sandbox coverage now observes the failed write from the sandboxed Python side before verifying no marker file was created.

Public-facing changes

None intended. This changes test coverage for the public repl and repl_reset MCP tool responses, but does not change runtime behavior.

Internal-only changes

  • Refactors tests/run_integration_tests.py around repl() / repl_reset() helpers, exact tool_result() expectations, and response normalization.
  • Adds unit coverage for the integration runner helpers and transient busy interrupt polling.
  • Tightens sandbox test setup and skip behavior across Rust integration tests.
  • Tightens Python discovery sandbox coverage with a fake .venv/bin/python that attempts to write a marker in $HOME, asserts the sandboxed Python exception appears in the public reply, and verifies the marker file was not created.
  • Stabilizes timeout, output-bundle, pager, and idle protocol tests by replacing timing sleeps with workspace-visible gates and deterministic delayed protocol errors.
  • Updates the Codex TUI permissions test to use /permissions and probe outside-workspace writes under $HOME.
  • Keeps the direct /dev/stdout fallback test skippable when opening the fd path fails.

Diff composition

Measured against origin/main, this PR is 803 insertions and 437 deletions across 11 files.

  • runtime src/: +2/-0 (0.2% of churn)
  • tests in tests/: +801/-437 (99.8% of churn)

Largest files:

  • tests/run_integration_tests.py: +393/-291
  • tests/sandbox.rs: +102/-89
  • tests/write_stdin_behavior.rs: +115/-23
  • tests/test_run_integration_tests.py: +129/-0
  • tests/python_backend.rs: +42/-10

Testing

  • cargo +nightly fmt
  • cargo check
  • cargo build
  • python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --quiet

Restructure the real-binary Python integration harness so each test calls the explicit MCP client methods and compares parsed tool responses against expected dictionaries built with small helpers. This keeps assertions on the public MCP result shape instead of regexing or loosely parsing response text.

Validation:
- python3 -m unittest tests/test_run_integration_tests.py
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
- cargo +nightly fmt
- git diff --check
The timeout-dependent Python integration cases no longer assert exact output captured before a fixed one-second deadline. The interrupt case now polls busy responses until INTERRUPT_READY is observed, and the gated bundle case now checks stable busy/no-bundle facts plus final transcript contents.

Validation:
- python3 -m unittest tests/test_run_integration_tests.py
- python3 -m py_compile tests/run_integration_tests.py tests/test_run_integration_tests.py
- git diff --check
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl --case r-interrupt-restart-prefixes --case r-output-bundle-files
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
- cargo +nightly fmt

Review finding addressed:
- [P2] Wait for interrupt readiness before exact checks — /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:576-577
  When this runs on a slower or loaded runner, the 1s timeout can expire before R has echoed all setup lines or printed `INTERRUPT_READY`; the previous helper polled up to 20s for that marker. The new exact assertion then fails even though the worker is correctly still busy, so keep the readiness polling or assert the exact response only after the marker is observed.
  Response: added busy-response polling for `INTERRUPT_READY` before sending the interrupt, with helper coverage for polling and early-finish failure.

Review finding addressed:
- [P2] Avoid pinning output split across timeout — /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:665-666
  When the gated bundle case runs on a slow runner, this 1s timeout can fire before or after a different amount of the pre-gate output has been captured. The exact assertion here, and the hard-coded poll preview below, then fail although the behavior is correct; assert stable facts like busy/no bundle and final transcript contents instead of requiring a fixed split across the timeout boundary.
  Response: replaced the timeout-bound exact setup and poll-preview assertions with success, busy/no-bundle, settled, and full-transcript content checks.
Handle transient busy responses after a Ctrl-C prefixed tail in the
r-interrupt-restart-prefixes public API case by polling until the expected
interrupt output settles. Add a Python regression test for the runner path that
returns busy for the Ctrl-C tail before producing interrupt received and
AFTER_INTERRUPT on the next empty poll.

Review finding:
- [P2] Poll after Ctrl-C busy replies -- /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:600-608
  When the Ctrl-C follow-up returns a transient busy response, this exact assertion fails immediately instead of draining until the worker settles. This scenario is already handled in the Windows-specific interrupt test by polling after `\u0003...`, and the Python public API suite runs on Windows CI, so `r-interrupt-restart-prefixes` can fail even though the worker later produces `interrupt received` and `AFTER_INTERRUPT`.

Addressed by polling after busy Ctrl-C-tail replies before asserting the settled public API output.

Validation:
- python3 -m unittest tests.test_run_integration_tests.RunIntegrationTestsCaseTests.test_r_interrupt_restart_prefixes_polls_after_transient_busy_interrupt
- python3 -m unittest tests.test_run_integration_tests
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
- cargo +nightly fmt
- git diff --check
Format long R snippets and busy-response fixtures with textwrap.dedent() so expected test inputs stay indented with the surrounding Python code.

Validation:
- python3 tests/test_run_integration_tests.py
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
- cargo +nightly fmt
Stop hiding macOS sandbox availability behind CODEX_SANDBOX and probe sandbox-exec with a policy that can actually execute the test command. Convert the sandbox availability skips to assertions, remove the Codex full-access TUI env gate, and make Python/Codex outside-workspace probes target home-directory paths instead of temp paths that the sandbox intentionally permits.

Allow macOS sandboxed workers to write inherited stdout/stderr through /dev/stdout, /dev/stderr, and /dev/fd/[12], which surfaced once the default sandboxed test paths were enabled.

Validation:
- cargo +nightly fmt
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
- MCP_REPL_CODEX_BACKEND=mock cargo test -j 1 --test codex_integration codex_exec_auto_backend_smoke -- --test-threads=1
Run the reticulate/keras JAX probe once on the host before entering the
network-disabled sandbox. If the host probe cannot materialize the backend
because Rscript, reticulate, keras3, or the backend cache/download setup is
unavailable, skip the sandbox test before the sandboxed assertion.

Reuse the same probe inside the sandbox so the test checks the cached backend
path without depending on network access inside the worker sandbox.

Also remove the macOS Seatbelt /dev/stdout, /dev/stderr, and /dev/fd alias
allowance from this PR. There is no real package or user workflow here that
justifies reopening inherited stdio by path, so keep that denied and make the
artificial direct-fd surface test skip when the sandbox rejects it.

Review finding:
[P2] Preserve skip for missing reticulate cache — /Users/tomasz/github/posit-dev/mcp-repl/tests/sandbox.rs:897-899
When `reticulate` and `keras3` are installed but the JAX backend is not already present in the reticulate/uv cache, this test now fails because the sandbox disables network and `use_backend("jax")` reports cache/download errors; those outcomes were previously skipped. Please keep detecting the offline-cache-missing case so `cargo test` does not depend on a prepopulated user cache.

Response:
The test now pre-populates the reticulate/keras JAX cache with a host-side
Rscript probe and skips before sandbox execution when that warm-up cannot
complete. The sandboxed test then validates the cached path with network access
disabled.

Validation:
- cargo test --test sandbox sandbox_reticulate_keras_backend -- --nocapture
- cargo test --test repl_surface direct_stdout_fd_write_falls_back_to_raw_stream -- --nocapture
- cargo +nightly fmt
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
Change python_discovery_keeps_venv_probe_inside_sandbox to skip when
MCP_REPL_PYTHON_EXECUTABLE is set, matching the adjacent Python discovery
tests. The explicit executable override bypasses discovery, so the sandbox
probe coverage is not meaningful in that environment.

Review finding:
- [P2] Skip discovery test under explicit Python override — /Users/tomasz/github/posit-dev/mcp-repl/tests/python_backend.rs:238-241
  When `MCP_REPL_PYTHON_EXECUTABLE` is set, this assertion makes `cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox` fail before exercising sandbox discovery. That variable is a supported override and the adjacent Python discovery tests still skip because it bypasses discovery, so this test should skip or clear the env for the spawned server instead of panicking.

Response:
- Replaced the assertion with an explicit skip when the supported Python executable override is present.

Validation:
- MCP_REPL_PYTHON_EXECUTABLE=/bin/false cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox -- --exact --nocapture
- cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox -- --exact --nocapture
- cargo +nightly fmt
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
Replace the passive outside-workspace marker probe with a fake venv Python that attempts the write from inside the sandbox and reports the caught exception through the public reply. Force discovery to use that fake candidate by clearing PATH to a temporary empty directory.

The test still verifies that the marker file was not created, but now also proves the sandboxed Python side observed the write denial.

Validation: cargo +nightly fmt; cargo check; cargo build; python3 tests/run_integration_tests.py --binary target/debug/mcp-repl; cargo clippy --all-targets --all-features -- -D warnings; cargo test --quiet
Replace sleep-based synchronization in the timeout spill and pager follow-up tests with host gate files and R-side marker files. This keeps the same public API coverage while making the macOS timing boundary deterministic under CI load.

Validation:
- cargo check
- cargo build
- python3 tests/run_integration_tests.py --binary target/debug/mcp-repl
- cargo clippy --all-targets --all-features -- -D warnings
- cargo test --quiet
- cargo +nightly fmt
@t-kalinowski t-kalinowski marked this pull request as ready for review May 18, 2026 23:06
@t-kalinowski t-kalinowski merged commit 2dbc6ac into main May 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant