Skip to content

test: poll until volume is visible before mounting it#97

Closed
crowlbot wants to merge 4 commits into
mainfrom
fix/sandbox-volume-test-flake
Closed

test: poll until volume is visible before mounting it#97
crowlbot wants to merge 4 commits into
mainfrom
fix/sandbox-volume-test-flake

Conversation

@crowlbot
Copy link
Copy Markdown
Contributor

Summary

sandbox volumes create returns the volume id immediately, but the backend may take a moment to make the volume visible to subsequent operations (mount, list). The `sandbox with volume mount` test hits this race on most CI runs, surfacing as:

```
✗ An error occurred:
The requested volume 'vol_ord_...' was not found, or you do not have access to view it. (status: 404, code: VOLUME_NOT_FOUND, ...)
```

immediately after the `sandbox volumes create ...` step. Recent main CI history shows the test passes most of the time but fails frequently enough that it's currently masking real signal on stacked PRs.

What's in this PR

Adds a `waitForVolumeReady(volumeId, { timeoutMs = 15_000, intervalMs = 500 })` helper that polls `sandbox volumes list` until the new volume's id appears (or times out after 15s with a clear error). Inserts one call between the volume creation and the sandbox-with-mount step in the affected test.

This is a test-only fix. No CLI / backend code changes.

Test plan

  • `deno fmt` / `deno lint` / `deno check` clean.
  • The poll uses the existing `sandbox volumes list` output (string match on the id) so it doesn't depend on JSON support or any other in-flight CLI change.
  • 15s timeout is generous (typical propagation is sub-second); the safety net surfaces a clear error if it ever stalls.

🤖 Generated with Claude Code

crowlbot and others added 4 commits May 13, 2026 14:29
`sandbox volumes create` returns the volume id immediately, but the
backend may take a moment to make the volume visible to subsequent
operations (mount, list). The `sandbox with volume mount` test hit
this race on every CI run, surfacing as `VOLUME_NOT_FOUND` when the
follow-up `sandbox create --volume <id>:path` call ran before the
volume was queryable.

Adds a `waitForVolumeReady` helper that polls `volumes list` for up to
15s (500ms interval) after creation, and inserts a call between the
volume creation and the mount in the affected test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Initial fix polled `volumes list` and stopped when the new volume
appeared, but the mount still raced — the cluster takes another beat
after the volume is queryable before it's actually mountable. Add a
configurable `postListSleepMs` (default 5s) after the list confirms,
and bump the overall timeout to 30s. Each sandbox-with-volume run now
spends ~5-15s waiting, which is well below the 60s sandbox timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Even with the post-list sleep extension (1 commit back), the
sandbox-side volume lookup propagates separately from the deployng
list endpoint. Retrying the mount-bearing `sandbox create` call up to
6 times (5s apart, ~30s budget) handles the residual race without
papering over genuine backend failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Even with retries and the post-list sleep, the sandbox-side volume
lookup was returning 404 deterministically (6/6 attempts, ~30s apart).
The volume is created in `ord` but the sandbox create call didn't
specify a region, so it landed in a different cluster that doesn't
know about the volume. Pin the sandbox create to `--region ord` to
match the volume's region.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@crowlbot
Copy link
Copy Markdown
Contributor Author

Closing — after four iterations (one-shot list-poll → post-list sleep → mount-retry → region-pin), the `sandbox with volume mount` test fails identically on every CI run with `VOLUME_NOT_FOUND` from the sandbox-side lookup. Six 5s-spaced retries against a region-pinned (`--region ord`) sandbox create all return 404 for a volume id (`vol_ord_*`) that's already visible via `volumes list` and was just created in the same region.

This isn't a propagation race the CLI can wait out — it's a real backend issue where the sandbox cluster's view of the volumes service doesn't see the new volume even ~30s after it's created and listable. Surfacing as a separate issue to the deployng / sandbox team rather than burying behind retries.

Recent history on `main` shows the test was passing ~most of the time, so it likely regressed on the backend side in the last week or two.

In the meantime, the agent-ergonomics stack (#91-#96, #98) doesn't depend on this test passing; all of those pass `fmt` / `lint` / `check` / `jsr` on Deno 2.7.8 and only the same flake fails their `deno test` job (including on the docs-only PR #96).

@crowlbot crowlbot closed this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant