Skip to content

Commit 2dbc6ac

Browse files
authored
Migrate integration tests to exact MCP responses (#70)
* Refactor Python integration tests around golden tool results Restructure the real-binary Python integration harness so each test calls the explicit MCP client methods and compares parsed tool responses against expected dictionaries built with small helpers. This keeps assertions on the public MCP result shape instead of regexing or loosely parsing response text. Validation: - python3 -m unittest tests/test_run_integration_tests.py - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt - git diff --check * Stabilize timeout-dependent integration checks The timeout-dependent Python integration cases no longer assert exact output captured before a fixed one-second deadline. The interrupt case now polls busy responses until INTERRUPT_READY is observed, and the gated bundle case now checks stable busy/no-bundle facts plus final transcript contents. Validation: - python3 -m unittest tests/test_run_integration_tests.py - python3 -m py_compile tests/run_integration_tests.py tests/test_run_integration_tests.py - git diff --check - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl --case r-interrupt-restart-prefixes --case r-output-bundle-files - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt Review finding addressed: - [P2] Wait for interrupt readiness before exact checks — /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:576-577 When this runs on a slower or loaded runner, the 1s timeout can expire before R has echoed all setup lines or printed `INTERRUPT_READY`; the previous helper polled up to 20s for that marker. The new exact assertion then fails even though the worker is correctly still busy, so keep the readiness polling or assert the exact response only after the marker is observed. Response: added busy-response polling for `INTERRUPT_READY` before sending the interrupt, with helper coverage for polling and early-finish failure. Review finding addressed: - [P2] Avoid pinning output split across timeout — /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:665-666 When the gated bundle case runs on a slow runner, this 1s timeout can fire before or after a different amount of the pre-gate output has been captured. The exact assertion here, and the hard-coded poll preview below, then fail although the behavior is correct; assert stable facts like busy/no bundle and final transcript contents instead of requiring a fixed split across the timeout boundary. Response: replaced the timeout-bound exact setup and poll-preview assertions with success, busy/no-bundle, settled, and full-transcript content checks. * Poll interrupt-prefix busy replies in public API runner Handle transient busy responses after a Ctrl-C prefixed tail in the r-interrupt-restart-prefixes public API case by polling until the expected interrupt output settles. Add a Python regression test for the runner path that returns busy for the Ctrl-C tail before producing interrupt received and AFTER_INTERRUPT on the next empty poll. Review finding: - [P2] Poll after Ctrl-C busy replies -- /Users/tomasz/github/posit-dev/mcp-repl/tests/run_integration_tests.py:600-608 When the Ctrl-C follow-up returns a transient busy response, this exact assertion fails immediately instead of draining until the worker settles. This scenario is already handled in the Windows-specific interrupt test by polling after `\u0003...`, and the Python public API suite runs on Windows CI, so `r-interrupt-restart-prefixes` can fail even though the worker later produces `interrupt received` and `AFTER_INTERRUPT`. Addressed by polling after busy Ctrl-C-tail replies before asserting the settled public API output. Validation: - python3 -m unittest tests.test_run_integration_tests.RunIntegrationTestsCaseTests.test_r_interrupt_restart_prefixes_polls_after_transient_busy_interrupt - python3 -m unittest tests.test_run_integration_tests - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt - git diff --check * Increase timeout duration in busy recovery test for improved stability * Use dedent for long integration test strings Format long R snippets and busy-response fixtures with textwrap.dedent() so expected test inputs stay indented with the surrounding Python code. Validation: - python3 tests/test_run_integration_tests.py - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt * Enable sandbox coverage in integration tests Stop hiding macOS sandbox availability behind CODEX_SANDBOX and probe sandbox-exec with a policy that can actually execute the test command. Convert the sandbox availability skips to assertions, remove the Codex full-access TUI env gate, and make Python/Codex outside-workspace probes target home-directory paths instead of temp paths that the sandbox intentionally permits. Allow macOS sandboxed workers to write inherited stdout/stderr through /dev/stdout, /dev/stderr, and /dev/fd/[12], which surfaced once the default sandboxed test paths were enabled. Validation: - cargo +nightly fmt - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - MCP_REPL_CODEX_BACKEND=mock cargo test -j 1 --test codex_integration codex_exec_auto_backend_smoke -- --test-threads=1 * Prewarm reticulate JAX sandbox cache Run the reticulate/keras JAX probe once on the host before entering the network-disabled sandbox. If the host probe cannot materialize the backend because Rscript, reticulate, keras3, or the backend cache/download setup is unavailable, skip the sandbox test before the sandboxed assertion. Reuse the same probe inside the sandbox so the test checks the cached backend path without depending on network access inside the worker sandbox. Also remove the macOS Seatbelt /dev/stdout, /dev/stderr, and /dev/fd alias allowance from this PR. There is no real package or user workflow here that justifies reopening inherited stdio by path, so keep that denied and make the artificial direct-fd surface test skip when the sandbox rejects it. Review finding: [P2] Preserve skip for missing reticulate cache — /Users/tomasz/github/posit-dev/mcp-repl/tests/sandbox.rs:897-899 When `reticulate` and `keras3` are installed but the JAX backend is not already present in the reticulate/uv cache, this test now fails because the sandbox disables network and `use_backend("jax")` reports cache/download errors; those outcomes were previously skipped. Please keep detecting the offline-cache-missing case so `cargo test` does not depend on a prepopulated user cache. Response: The test now pre-populates the reticulate/keras JAX cache with a host-side Rscript probe and skips before sandbox execution when that warm-up cannot complete. The sandboxed test then validates the cached path with network access disabled. Validation: - cargo test --test sandbox sandbox_reticulate_keras_backend -- --nocapture - cargo test --test repl_surface direct_stdout_fd_write_falls_back_to_raw_stream -- --nocapture - cargo +nightly fmt - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet * Skip Python discovery sandbox test under override Change python_discovery_keeps_venv_probe_inside_sandbox to skip when MCP_REPL_PYTHON_EXECUTABLE is set, matching the adjacent Python discovery tests. The explicit executable override bypasses discovery, so the sandbox probe coverage is not meaningful in that environment. Review finding: - [P2] Skip discovery test under explicit Python override — /Users/tomasz/github/posit-dev/mcp-repl/tests/python_backend.rs:238-241 When `MCP_REPL_PYTHON_EXECUTABLE` is set, this assertion makes `cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox` fail before exercising sandbox discovery. That variable is a supported override and the adjacent Python discovery tests still skip because it bypasses discovery, so this test should skip or clear the env for the spawned server instead of panicking. Response: - Replaced the assertion with an explicit skip when the supported Python executable override is present. Validation: - MCP_REPL_PYTHON_EXECUTABLE=/bin/false cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox -- --exact --nocapture - cargo test --test python_backend python_discovery_keeps_venv_probe_inside_sandbox -- --exact --nocapture - cargo +nightly fmt - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet * Assert Python discovery sandbox write failure Replace the passive outside-workspace marker probe with a fake venv Python that attempts the write from inside the sandbox and reports the caught exception through the public reply. Force discovery to use that fake candidate by clearing PATH to a temporary empty directory. The test still verifies that the marker file was not created, but now also proves the sandboxed Python side observed the write denial. Validation: cargo +nightly fmt; cargo check; cargo build; python3 tests/run_integration_tests.py --binary target/debug/mcp-repl; cargo clippy --all-targets --all-features -- -D warnings; cargo test --quiet * Stabilize timeout spill tests Replace sleep-based synchronization in the timeout spill and pager follow-up tests with host gate files and R-side marker files. This keeps the same public API coverage while making the macOS timing boundary deterministic under CI load. Validation: - cargo check - cargo build - python3 tests/run_integration_tests.py --binary target/debug/mcp-repl - cargo clippy --all-targets --all-features -- -D warnings - cargo test --quiet - cargo +nightly fmt * Use workspace-visible timeout test gates * Stabilize idle protocol error test * Gate output bundle cleanup tests
1 parent d4911f0 commit 2dbc6ac

11 files changed

Lines changed: 815 additions & 449 deletions

src/sandbox/seatbelt_base_policy.sbpl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@
2222
(require-all
2323
(path "/dev/null")
2424
(vnode-type CHARACTER-DEVICE)))
25+
; Do not allow /dev/stdout, /dev/stderr, or /dev/fd aliases without a real
26+
; package or user workflow that requires reopening inherited stdio by path.
2527
; Allow uv's SystemConfiguration probe to touch /dev/dtracehelper on macOS.
2628
(allow file-write-data
2729
(require-all

tests/codex_integration.rs

Lines changed: 13 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,6 @@ mod unix_impl {
4040
const MOCK_TOOL_INPUT: &str = "cat(\"CODEX_MOCK_MCP_OK\\n\")\n";
4141
const LIVE_FINAL_MARKER: &str = "CODEX_LIVE_DONE";
4242
const INSTALL_SCRIPTED_TOOL_CALL_MARKER: &str = "INSTALL_SCRIPTED_TOOL_CALL";
43-
#[cfg(any(target_os = "macos", target_os = "linux"))]
44-
const FULL_ACCESS_TEST_ENV: &str = "MCP_REPL_ENABLE_FULL_ACCESS_TUI_TEST";
45-
4643
fn codex_exec_test_mutex() -> &'static tokio::sync::Mutex<()> {
4744
static TEST_MUTEX: OnceLock<tokio::sync::Mutex<()>> = OnceLock::new();
4845
TEST_MUTEX.get_or_init(|| tokio::sync::Mutex::new(()))
@@ -426,17 +423,10 @@ mod unix_impl {
426423
#[cfg(any(target_os = "macos", target_os = "linux"))]
427424
pub(super) async fn run_codex_tui_full_access_sandbox_update() -> TestResult<()> {
428425
const TEST_NAME: &str = "codex_tui_full_access_sandbox_update";
429-
if !full_access_test_enabled() {
430-
print_skip_banner(TEST_NAME, &format!("{FULL_ACCESS_TEST_ENV} is not set"));
431-
return Ok(());
432-
}
433426
if !codex_client_ready(TEST_NAME) {
434427
return Ok(());
435428
}
436-
if !common::sandbox_exec_available() {
437-
print_skip_banner(TEST_NAME, "sandbox-exec unavailable");
438-
return Ok(());
439-
}
429+
assert!(common::sandbox_exec_available(), "sandbox-exec unavailable");
440430
if !loopback_bind_available().await {
441431
print_skip_banner(TEST_NAME, "loopback TCP bind unavailable");
442432
return Ok(());
@@ -459,34 +449,33 @@ mod unix_impl {
459449
driver.drain(Duration::from_millis(800));
460450
driver.ensure_running("after startup")?;
461451
driver.wait_for_warmup(Duration::from_secs(10))?;
462-
wait_for_log_contains(&env.debug_dir, "workspace-write", Duration::from_secs(10))?;
463452

464453
driver.send_line(&format!(
465454
"{FULL_ACCESS_MARKER}: probe write before full access"
466455
))?;
467456
driver.wait_for_contains("WRITE_ERROR:", Duration::from_secs(20))?;
468457
driver.wait_for_contains("Tool call 1 completed", Duration::from_secs(20))?;
458+
wait_for_log_contains(&env.debug_dir, "workspace-write", Duration::from_secs(10))?;
469459

470-
driver.send_line("/approvals")?;
460+
driver.send_line("/permissions")?;
471461
driver.wait_for_contains("Update Model Permissions", Duration::from_secs(15))?;
472462
driver.send("3")?;
473463
driver.wait_for_contains(
474464
"Permissions updated to Full Access",
475465
Duration::from_secs(15),
476466
)?;
477467

468+
driver.send_line(&format!(
469+
"{FULL_ACCESS_MARKER}: probe write after full access"
470+
))?;
471+
driver.wait_for_contains("WRITE_OK", Duration::from_secs(20))?;
472+
driver.wait_for_contains("Tool call 2 completed", Duration::from_secs(20))?;
478473
wait_for_log_contains(
479474
&env.debug_dir,
480475
"danger-full-access",
481476
Duration::from_secs(20),
482477
)?;
483478
wait_for_log_contains(&env.debug_dir, "tool-call-meta", Duration::from_secs(20))?;
484-
485-
driver.send_line(&format!(
486-
"{FULL_ACCESS_MARKER}: probe write after full access"
487-
))?;
488-
driver.wait_for_contains("WRITE_OK", Duration::from_secs(20))?;
489-
driver.wait_for_contains("Tool call 2 completed", Duration::from_secs(20))?;
490479
driver.kill()?;
491480

492481
let outputs = mock_server.function_call_outputs().await;
@@ -538,11 +527,6 @@ mod unix_impl {
538527
Ok(())
539528
}
540529

541-
#[cfg(any(target_os = "macos", target_os = "linux"))]
542-
fn full_access_test_enabled() -> bool {
543-
std::env::var_os(FULL_ACCESS_TEST_ENV).is_some()
544-
}
545-
546530
async fn loopback_bind_available() -> bool {
547531
TcpListener::bind("127.0.0.1:0").await.is_ok()
548532
}
@@ -1513,7 +1497,11 @@ trust_level = "trusted"
15131497
let nanos = std::time::SystemTime::now()
15141498
.duration_since(std::time::UNIX_EPOCH)?
15151499
.as_nanos();
1516-
let target = std::env::temp_dir().join(format!("mcp-repl-codex-probe-{nanos}.txt"));
1500+
let home = std::env::var_os("HOME")
1501+
.or_else(|| std::env::var_os("USERPROFILE"))
1502+
.map(PathBuf::from)
1503+
.ok_or_else(|| "missing HOME/USERPROFILE for outside-workspace probe".to_string())?;
1504+
let target = home.join(format!(".mcp-repl-codex-probe-{nanos}.txt"));
15171505
let target_literal = serde_json::to_string(&target.to_string_lossy().to_string())
15181506
.map_err(|err| format!("failed to encode target path: {err}"))?;
15191507
Ok(r#"target <- __TARGET__

tests/common/mod.rs

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -212,11 +212,13 @@ impl Drop for SuiteServerLockToken {
212212
pub fn sandbox_exec_available() -> bool {
213213
static AVAILABLE: OnceLock<bool> = OnceLock::new();
214214
*AVAILABLE.get_or_init(|| {
215-
if std::env::var_os("CODEX_SANDBOX").is_some() {
216-
return false;
217-
}
218215
std::process::Command::new("/usr/bin/sandbox-exec")
219-
.args(["-p", "(version 1)", "--", "/usr/bin/true"])
216+
.args([
217+
"-p",
218+
"(version 1)(allow process-exec)(allow file-read*)",
219+
"--",
220+
"/usr/bin/true",
221+
])
220222
.stdin(std::process::Stdio::null())
221223
.stdout(std::process::Stdio::null())
222224
.stderr(std::process::Stdio::null())

tests/debug_repl_prompt.rs

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,13 @@ type TestResult<T> = Result<T, Box<dyn std::error::Error + Send + Sync>>;
1212
#[cfg(target_os = "macos")]
1313
fn sandbox_exec_available() -> bool {
1414
// Mirror tests/common/mod.rs: sandbox-exec may exist but be unusable (status 71).
15-
if std::env::var_os("CODEX_SANDBOX").is_some() {
16-
return false;
17-
}
1815
Command::new("/usr/bin/sandbox-exec")
19-
.args(["-p", "(version 1)", "--", "/usr/bin/true"])
16+
.args([
17+
"-p",
18+
"(version 1)(allow process-exec)(allow file-read*)",
19+
"--",
20+
"/usr/bin/true",
21+
])
2022
.stdin(Stdio::null())
2123
.stdout(Stdio::null())
2224
.stderr(Stdio::null())

tests/python_backend.rs

Lines changed: 42 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -234,19 +234,26 @@ fn real_python_executable() -> TestResult<String> {
234234
async fn python_discovery_keeps_venv_probe_inside_sandbox() -> TestResult<()> {
235235
use std::os::unix::fs::PermissionsExt;
236236

237-
if !common::sandbox_exec_available() {
238-
eprintln!("sandbox not available; skipping");
239-
return Ok(());
240-
}
241237
if std::env::var_os("MCP_REPL_PYTHON_EXECUTABLE").is_some() {
242-
eprintln!("explicit Python executable set; skipping discovery test");
238+
eprintln!("explicit Python executable set; skipping discovery sandbox coverage test");
243239
return Ok(());
244240
}
241+
assert!(common::sandbox_exec_available(), "sandbox unavailable");
245242

246243
let _guard = lock_test_mutex();
244+
let real_python = real_python_executable()?;
247245
let workspace = tempdir()?;
248-
let outside = tempdir()?;
249-
let marker = outside.path().join("python-discovery-marker");
246+
let empty_bin = workspace.path().join("empty-bin");
247+
fs::create_dir_all(&empty_bin)?;
248+
let home = std::env::var_os("HOME")
249+
.or_else(|| std::env::var_os("USERPROFILE"))
250+
.map(PathBuf::from)
251+
.ok_or("missing HOME/USERPROFILE for Python discovery sandbox marker")?;
252+
let marker = home.join(format!(
253+
".mcp-repl-python-discovery-marker-{}",
254+
std::process::id()
255+
));
256+
let _ = fs::remove_file(&marker);
250257
let marker_text = marker
251258
.to_str()
252259
.ok_or("marker path must be valid utf-8")?
@@ -256,7 +263,22 @@ async fn python_discovery_keeps_venv_probe_inside_sandbox() -> TestResult<()> {
256263
let shim = venv_bin.join("python");
257264
fs::write(
258265
&shim,
259-
"#!/bin/sh\nprintf probe > \"$MCP_REPL_TEST_PYTHON_PROBE_MARKER\"\nexit 1\n",
266+
concat!(
267+
"#!/bin/sh\n",
268+
"exec \"$MCP_REPL_REAL_PYTHON\" - <<'PY'\n",
269+
"import os\n",
270+
"import sys\n",
271+
"from pathlib import Path\n",
272+
"\n",
273+
"try:\n",
274+
" Path(os.environ['MCP_REPL_TEST_PYTHON_PROBE_MARKER']).write_text('probe')\n",
275+
"except Exception as err:\n",
276+
" print(f'MCP_REPL_TEST_PYTHON_PROBE_WRITE_ERROR:{type(err).__name__}', file=sys.stderr)\n",
277+
"else:\n",
278+
" print('MCP_REPL_TEST_PYTHON_PROBE_WRITE_OK', file=sys.stderr)\n",
279+
"raise SystemExit(1)\n",
280+
"PY\n",
281+
),
260282
)?;
261283
let mut permissions = fs::metadata(&shim)?.permissions();
262284
permissions.set_mode(0o755);
@@ -269,16 +291,26 @@ async fn python_discovery_keeps_venv_probe_inside_sandbox() -> TestResult<()> {
269291
"--sandbox".to_string(),
270292
"read-only".to_string(),
271293
],
272-
vec![("MCP_REPL_TEST_PYTHON_PROBE_MARKER".to_string(), marker_text)],
294+
vec![
295+
("PATH".to_string(), empty_bin.display().to_string()),
296+
("MCP_REPL_REAL_PYTHON".to_string(), real_python),
297+
("MCP_REPL_TEST_PYTHON_PROBE_MARKER".to_string(), marker_text),
298+
],
273299
Some(workspace.path().to_path_buf()),
274300
)
275301
.await?;
276302
let result = session.write_stdin_raw_with("1+1", Some(5.0)).await?;
277303
let text = result_text(&result);
278304
session.cancel().await?;
305+
let marker_exists = marker.exists();
306+
let _ = fs::remove_file(&marker);
279307

280308
assert!(
281-
!marker.exists(),
309+
text.contains("MCP_REPL_TEST_PYTHON_PROBE_WRITE_ERROR:"),
310+
"expected Python discovery probe write failure in reply, got: {text:?}"
311+
);
312+
assert!(
313+
!marker_exists,
282314
"Python discovery probe wrote outside the sandbox; reply was: {text:?}"
283315
);
284316
Ok(())

tests/repl_surface.rs

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -336,10 +336,14 @@ async fn direct_stdout_fd_write_falls_back_to_raw_stream() -> TestResult<()> {
336336
"if (!file.exists('/dev/stdout')) { ",
337337
"cat('SKIP_DIRECT_FD\\n') ",
338338
"} else { ",
339-
"con <- suppressWarnings(file('/dev/stdout', open = 'wb')); ",
339+
"con <- tryCatch(suppressWarnings(file('/dev/stdout', open = 'wb')), error = function(e) NULL); ",
340+
"if (is.null(con)) { ",
341+
"cat('SKIP_DIRECT_FD\\n') ",
342+
"} else { ",
340343
"writeBin(charToRaw('direct-fd\\n'), con); ",
341344
"flush(con); close(con); ",
342345
"cat('r-owned\\n') ",
346+
"} ",
343347
"}",
344348
);
345349
let result = session.write_stdin_raw_with(input, Some(30.0)).await?;

0 commit comments

Comments
 (0)