[CI Fix] Skip test_batch_diffusion on Azure and Nebius (no L4 GPUs) by lloyd-brown · Pull Request #9657 · skypilot-org/skypilot

lloyd-brown · 2026-05-19T18:34:19Z

Summary

Test: test_batch_diffusion on azure (and nebius)
Classification: Resource catalog mismatch
Confidence: HIGH

Root Cause

examples/batch/diffusion/pool.yaml hardcodes L4:1 accelerators. Neither Azure nor Nebius catalogs offer L4 GPUs (Azure offers T4/A10/A100/H100; Nebius offers L40S). The pool apply step fails with sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request: 1x Azure({'L4': 1}). The same test already carries @pytest.mark.no_kubernetes for the identical reason.

Fix

Add @pytest.mark.no_azure and @pytest.mark.no_nebius markers to test_batch_diffusion so the test is skipped on clouds whose catalogs do not contain L4 GPUs. This mirrors the existing @no_kubernetes decorator on the same test.

Buildkite Failure

https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/116#019e3d93-14b7-47fc-bba5-291608d6809d

Note: This fix was auto-generated. Confidence level: HIGH.

Generated with Claude Code

Root cause: examples/batch/diffusion/pool.yaml hardcodes L4:1 GPUs, but neither Azure nor Nebius catalogs offer L4 accelerators (Azure has T4/A10/ A100/H100; Nebius offers L40S). The test already carries @no_kubernetes for the same reason; this PR adds @no_azure and @no_nebius to match. Buildkite failure: https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/116#019e3d93-14b7-47fc-bba5-291608d6809d Classification: UNKNOWN (resource catalog mismatch) Confidence: HIGH Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

lloyd-brown · 2026-05-19T18:34:24Z

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

gemini-code-assist

Code Review

This pull request updates the smoke tests in tests/smoke_tests/test_batch.py by adding decorators to skip test_batch_diffusion on Azure and Nebius. These exclusions are necessary because the test configuration hardcodes L4 GPUs, which are not available in the Azure or Nebius catalogs. I have no feedback to provide as there are no review comments.

…iffusion Instead of skipping test_batch_diffusion on Azure, Nebius, and Kubernetes (which don't have L4 GPUs), use the same per-cloud accelerator substitution pattern already established in test_min_gpt and test_vllm_pool. This rewrites pool.yaml with the appropriate GPU for each cloud (T4 for Azure, L40S for Nebius, runtime detection for K8s/Slurm, L4 default for others). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lloyd-brown · 2026-05-19T21:52:38Z

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure --nebius

devin-ai-integration

Devin Review found 1 new potential issue.

View 4 additional findings in Devin Review.

devin-ai-integration · 2026-05-19T21:54:45Z

+        accelerator_str = smoke_tests_utils.get_available_gpus(
+            infra=generic_cloud)
+        if not accelerator_str:
+            pytest.fail(f'No GPUs available for {generic_cloud}.')


🟡 pytest.fail used instead of pytest.skip when GPUs are unavailable, inconsistent with codebase pattern

When no GPUs are available for kubernetes/slurm, the test calls pytest.fail() (line 152) which marks the test as FAILED. The established pattern throughout the test suite (e.g., tests/smoke_tests/test_cluster_job.py:63-64) is to use pytest.skip() in this scenario, since the test's prerequisites aren't met rather than the test itself being broken. This will cause CI false-failures on k8s/slurm environments without GPUs.

Was this helpful? React with 👍 or 👎 to provide feedback.

_storage_cmds() was falling through to bare AWS S3 commands for Nebius, missing the required --profile=nebius flag that all other Nebius storage tests use. Also fix SKY_BATCH_STORE env var to use 'nebius' prefix instead of 's3' for Nebius across all batch tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

lloyd-brown · 2026-05-19T22:15:53Z

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure --nebius

lloyd-brown · 2026-05-19T22:19:41Z

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure

lloyd-brown · 2026-05-19T22:19:42Z

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --nebius

devin-ai-integration

Devin Review found 2 new potential issues.

View 8 additional findings in Devin Review.

devin-ai-integration · 2026-05-19T22:22:24Z

 # option to not allow shared env tests.


+_CLOUD_TO_STORE = {'gcp': 'gs', 'nebius': 'nebius'}


🔴 _CLOUD_TO_STORE maps nebius to 'nebius' but batch IO classes only accept s3:// and gs:// prefixes

Setting 'nebius': 'nebius' in _CLOUD_TO_STORE causes the test to set SKY_BATCH_STORE='nebius'. The Python example scripts (e.g., examples/batch/simple/double_text.py:81) then construct paths like nebius://bucket/test.jsonl. However, JsonReader.__post_init__ (sky/batch/io_formats.py:223-227), JsonWriter.__post_init__ (sky/batch/io_formats.py:281-285), and ImageWriter.__post_init__ (sky/batch/io_formats.py:327-331) all explicitly reject any prefix other than s3:// or gs://, raising ValueError. Additionally, parse_cloud_path (sky/batch/utils.py:179-187) also only handles s3:// and gs://. This means every batch test (test_batch_simple, test_batch_diffusion, test_batch_cancel, test_batch_custom_formats, test_batch_ha_kill_running) will fail immediately on nebius with a ValueError when the example script creates a JsonReader or JsonWriter with a nebius:// path.

Prompt for agents

The _CLOUD_TO_STORE mapping sets nebius store to 'nebius', which causes example scripts to construct nebius://bucket/... paths. But the batch IO format classes (JsonReader, JsonWriter, ImageWriter in sky/batch/io_formats.py) and utility functions (parse_cloud_path in sky/batch/utils.py) only support s3:// and gs:// prefixes. There are two possible approaches: 1. Remove 'nebius' from _CLOUD_TO_STORE so it falls through to the default 's3', since Nebius uses S3-compatible storage. This requires verifying that the batch framework's internal S3 operations (using aws.client('s3') via sky/adaptors/aws) work correctly with Nebius credentials on Nebius worker machines. 2. Update the batch IO format classes and utility functions to support nebius:// paths, translating them to s3:// internally (similar to how sky/cloud_stores.py:589-590 does source.replace('nebius://', 's3://')). Option 1 is simpler if Nebius workers already have proper S3-compatible credentials configured.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-19T22:22:25Z

                'gsutil rm', 'gsutil ls', lambda p: f'gsutil rm -r {p}')
+    if generic_cloud == 'nebius':
+        # Nebius uses S3-compatible storage but requires --profile=nebius
+        from sky.adaptors import nebius


🟡 In-function import of nebius module without explanatory comment violates AGENTS.md rule

AGENTS.md states: "Always place imports at the top of the file, not inside function definitions. [...] Only as a last resort, place the import inside the function with a comment explaining why." The from sky.adaptors import nebius import at line 46 is inside the _storage_cmds function with no comment. The module sky/adaptors/nebius.py only imports lightweight SkyPilot internals (no heavy third-party SDK at module level), so there's no practical reason this can't be at the top of the file — indeed, tests/smoke_tests/test_mount_and_storage.py:47 already imports it at the top level.

Prompt for agents

Move the import 'from sky.adaptors import nebius' from inside _storage_cmds() (line 46) to the top-level imports section of the file (around line 18, with the other sky imports). The nebius adaptor module only imports lightweight SkyPilot internals at module level, so top-level import is safe. See tests/smoke_tests/test_mount_and_storage.py:47 for the existing pattern.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration Bot reviewed May 19, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI Fix] Skip test_batch_diffusion on Azure and Nebius (no L4 GPUs)#9657

[CI Fix] Skip test_batch_diffusion on Azure and Nebius (no L4 GPUs)#9657
lloyd-brown wants to merge 3 commits into
masterfrom
lloyd/fix-test-batch-diffusion-skip-l4-unavailable-clouds

lloyd-brown commented May 19, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 19, 2026 •

edited

Loading

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 19, 2026

Uh oh!

devin-ai-integration Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# option to not allow shared env tests.


		_CLOUD_TO_STORE = {'gcp': 'gs', 'nebius': 'nebius'}

Conversation

lloyd-brown commented May 19, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Buildkite Failure

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

lloyd-brown commented May 19, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lloyd-brown commented May 19, 2026 •

edited by devin-ai-integration Bot

Loading

devin-ai-integration Bot May 19, 2026 •

edited

Loading