[CI Fix] Skip test_batch_diffusion on Azure and Nebius (no L4 GPUs)#9657
[CI Fix] Skip test_batch_diffusion on Azure and Nebius (no L4 GPUs)#9657lloyd-brown wants to merge 3 commits into
Conversation
Root cause: examples/batch/diffusion/pool.yaml hardcodes L4:1 GPUs, but neither Azure nor Nebius catalogs offer L4 accelerators (Azure has T4/A10/ A100/H100; Nebius offers L40S). The test already carries @no_kubernetes for the same reason; this PR adds @no_azure and @no_nebius to match. Buildkite failure: https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/116#019e3d93-14b7-47fc-bba5-291608d6809d Classification: UNKNOWN (resource catalog mismatch) Confidence: HIGH Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure |
There was a problem hiding this comment.
Code Review
This pull request updates the smoke tests in tests/smoke_tests/test_batch.py by adding decorators to skip test_batch_diffusion on Azure and Nebius. These exclusions are necessary because the test configuration hardcodes L4 GPUs, which are not available in the Azure or Nebius catalogs. I have no feedback to provide as there are no review comments.
…iffusion Instead of skipping test_batch_diffusion on Azure, Nebius, and Kubernetes (which don't have L4 GPUs), use the same per-cloud accelerator substitution pattern already established in test_min_gpt and test_vllm_pool. This rewrites pool.yaml with the appropriate GPU for each cloud (T4 for Azure, L40S for Nebius, runtime detection for K8s/Slurm, L4 default for others). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure --nebius |
| accelerator_str = smoke_tests_utils.get_available_gpus( | ||
| infra=generic_cloud) | ||
| if not accelerator_str: | ||
| pytest.fail(f'No GPUs available for {generic_cloud}.') |
There was a problem hiding this comment.
🟡 pytest.fail used instead of pytest.skip when GPUs are unavailable, inconsistent with codebase pattern
When no GPUs are available for kubernetes/slurm, the test calls pytest.fail() (line 152) which marks the test as FAILED. The established pattern throughout the test suite (e.g., tests/smoke_tests/test_cluster_job.py:63-64) is to use pytest.skip() in this scenario, since the test's prerequisites aren't met rather than the test itself being broken. This will cause CI false-failures on k8s/slurm environments without GPUs.
Was this helpful? React with 👍 or 👎 to provide feedback.
_storage_cmds() was falling through to bare AWS S3 commands for Nebius, missing the required --profile=nebius flag that all other Nebius storage tests use. Also fix SKY_BATCH_STORE env var to use 'nebius' prefix instead of 's3' for Nebius across all batch tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure --nebius |
|
FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure |
|
FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --nebius |
| # option to not allow shared env tests. | ||
|
|
||
|
|
||
| _CLOUD_TO_STORE = {'gcp': 'gs', 'nebius': 'nebius'} |
There was a problem hiding this comment.
🔴 _CLOUD_TO_STORE maps nebius to 'nebius' but batch IO classes only accept s3:// and gs:// prefixes
Setting 'nebius': 'nebius' in _CLOUD_TO_STORE causes the test to set SKY_BATCH_STORE='nebius'. The Python example scripts (e.g., examples/batch/simple/double_text.py:81) then construct paths like nebius://bucket/test.jsonl. However, JsonReader.__post_init__ (sky/batch/io_formats.py:223-227), JsonWriter.__post_init__ (sky/batch/io_formats.py:281-285), and ImageWriter.__post_init__ (sky/batch/io_formats.py:327-331) all explicitly reject any prefix other than s3:// or gs://, raising ValueError. Additionally, parse_cloud_path (sky/batch/utils.py:179-187) also only handles s3:// and gs://. This means every batch test (test_batch_simple, test_batch_diffusion, test_batch_cancel, test_batch_custom_formats, test_batch_ha_kill_running) will fail immediately on nebius with a ValueError when the example script creates a JsonReader or JsonWriter with a nebius:// path.
Prompt for agents
The _CLOUD_TO_STORE mapping sets nebius store to 'nebius', which causes example scripts to construct nebius://bucket/... paths. But the batch IO format classes (JsonReader, JsonWriter, ImageWriter in sky/batch/io_formats.py) and utility functions (parse_cloud_path in sky/batch/utils.py) only support s3:// and gs:// prefixes. There are two possible approaches:
1. Remove 'nebius' from _CLOUD_TO_STORE so it falls through to the default 's3', since Nebius uses S3-compatible storage. This requires verifying that the batch framework's internal S3 operations (using aws.client('s3') via sky/adaptors/aws) work correctly with Nebius credentials on Nebius worker machines.
2. Update the batch IO format classes and utility functions to support nebius:// paths, translating them to s3:// internally (similar to how sky/cloud_stores.py:589-590 does source.replace('nebius://', 's3://')).
Option 1 is simpler if Nebius workers already have proper S3-compatible credentials configured.
Was this helpful? React with 👍 or 👎 to provide feedback.
| 'gsutil rm', 'gsutil ls', lambda p: f'gsutil rm -r {p}') | ||
| if generic_cloud == 'nebius': | ||
| # Nebius uses S3-compatible storage but requires --profile=nebius | ||
| from sky.adaptors import nebius |
There was a problem hiding this comment.
🟡 In-function import of nebius module without explanatory comment violates AGENTS.md rule
AGENTS.md states: "Always place imports at the top of the file, not inside function definitions. [...] Only as a last resort, place the import inside the function with a comment explaining why." The from sky.adaptors import nebius import at line 46 is inside the _storage_cmds function with no comment. The module sky/adaptors/nebius.py only imports lightweight SkyPilot internals (no heavy third-party SDK at module level), so there's no practical reason this can't be at the top of the file — indeed, tests/smoke_tests/test_mount_and_storage.py:47 already imports it at the top level.
Prompt for agents
Move the import 'from sky.adaptors import nebius' from inside _storage_cmds() (line 46) to the top-level imports section of the file (around line 18, with the other sky imports). The nebius adaptor module only imports lightweight SkyPilot internals at module level, so top-level import is safe. See tests/smoke_tests/test_mount_and_storage.py:47 for the existing pattern.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
test_batch_diffusiononazure(andnebius)Root Cause
examples/batch/diffusion/pool.yamlhardcodesL4:1accelerators. Neither Azure nor Nebius catalogs offer L4 GPUs (Azure offers T4/A10/A100/H100; Nebius offers L40S). The pool apply step fails withsky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request: 1x Azure({'L4': 1}). The same test already carries@pytest.mark.no_kubernetesfor the identical reason.Fix
Add
@pytest.mark.no_azureand@pytest.mark.no_nebiusmarkers totest_batch_diffusionso the test is skipped on clouds whose catalogs do not contain L4 GPUs. This mirrors the existing@no_kubernetesdecorator on the same test.Buildkite Failure
https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/116#019e3d93-14b7-47fc-bba5-291608d6809d
Generated with Claude Code