Skip to content

[CI Fix] Skip test_batch_diffusion on Azure and Nebius (no L4 GPUs)#9657

Open
lloyd-brown wants to merge 3 commits into
masterfrom
lloyd/fix-test-batch-diffusion-skip-l4-unavailable-clouds
Open

[CI Fix] Skip test_batch_diffusion on Azure and Nebius (no L4 GPUs)#9657
lloyd-brown wants to merge 3 commits into
masterfrom
lloyd/fix-test-batch-diffusion-skip-l4-unavailable-clouds

Conversation

@lloyd-brown
Copy link
Copy Markdown
Collaborator

@lloyd-brown lloyd-brown commented May 19, 2026

Summary

  • Test: test_batch_diffusion on azure (and nebius)
  • Classification: Resource catalog mismatch
  • Confidence: HIGH

Root Cause

examples/batch/diffusion/pool.yaml hardcodes L4:1 accelerators. Neither Azure nor Nebius catalogs offer L4 GPUs (Azure offers T4/A10/A100/H100; Nebius offers L40S). The pool apply step fails with sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request: 1x Azure({'L4': 1}). The same test already carries @pytest.mark.no_kubernetes for the identical reason.

Fix

Add @pytest.mark.no_azure and @pytest.mark.no_nebius markers to test_batch_diffusion so the test is skipped on clouds whose catalogs do not contain L4 GPUs. This mirrors the existing @no_kubernetes decorator on the same test.

Buildkite Failure

https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/116#019e3d93-14b7-47fc-bba5-291608d6809d


Note: This fix was auto-generated. Confidence level: HIGH.

Generated with Claude Code


Open in Devin Review

Root cause: examples/batch/diffusion/pool.yaml hardcodes L4:1 GPUs, but
neither Azure nor Nebius catalogs offer L4 accelerators (Azure has T4/A10/
A100/H100; Nebius offers L40S). The test already carries @no_kubernetes
for the same reason; this PR adds @no_azure and @no_nebius to match.

Buildkite failure: https://buildkite.com/skypilot-1/full-smoke-tests-run/builds/116#019e3d93-14b7-47fc-bba5-291608d6809d
Classification: UNKNOWN (resource catalog mismatch)
Confidence: HIGH

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@lloyd-brown
Copy link
Copy Markdown
Collaborator Author

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the smoke tests in tests/smoke_tests/test_batch.py by adding decorators to skip test_batch_diffusion on Azure and Nebius. These exclusions are necessary because the test configuration hardcodes L4 GPUs, which are not available in the Azure or Nebius catalogs. I have no feedback to provide as there are no review comments.

…iffusion

Instead of skipping test_batch_diffusion on Azure, Nebius, and Kubernetes
(which don't have L4 GPUs), use the same per-cloud accelerator substitution
pattern already established in test_min_gpt and test_vllm_pool. This rewrites
pool.yaml with the appropriate GPU for each cloud (T4 for Azure, L40S for
Nebius, runtime detection for K8s/Slurm, L4 default for others).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lloyd-brown
Copy link
Copy Markdown
Collaborator Author

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure --nebius

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

accelerator_str = smoke_tests_utils.get_available_gpus(
infra=generic_cloud)
if not accelerator_str:
pytest.fail(f'No GPUs available for {generic_cloud}.')
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 pytest.fail used instead of pytest.skip when GPUs are unavailable, inconsistent with codebase pattern

When no GPUs are available for kubernetes/slurm, the test calls pytest.fail() (line 152) which marks the test as FAILED. The established pattern throughout the test suite (e.g., tests/smoke_tests/test_cluster_job.py:63-64) is to use pytest.skip() in this scenario, since the test's prerequisites aren't met rather than the test itself being broken. This will cause CI false-failures on k8s/slurm environments without GPUs.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

_storage_cmds() was falling through to bare AWS S3 commands for Nebius,
missing the required --profile=nebius flag that all other Nebius storage
tests use. Also fix SKY_BATCH_STORE env var to use 'nebius' prefix
instead of 's3' for Nebius across all batch tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lloyd-brown
Copy link
Copy Markdown
Collaborator Author

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure --nebius

@lloyd-brown
Copy link
Copy Markdown
Collaborator Author

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --azure

@lloyd-brown
Copy link
Copy Markdown
Collaborator Author

FILE_PATTERN=test_batch.py /smoke-test -k test_batch_diffusion --nebius

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 8 additional findings in Devin Review.

Open in Devin Review

# option to not allow shared env tests.


_CLOUD_TO_STORE = {'gcp': 'gs', 'nebius': 'nebius'}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 _CLOUD_TO_STORE maps nebius to 'nebius' but batch IO classes only accept s3:// and gs:// prefixes

Setting 'nebius': 'nebius' in _CLOUD_TO_STORE causes the test to set SKY_BATCH_STORE='nebius'. The Python example scripts (e.g., examples/batch/simple/double_text.py:81) then construct paths like nebius://bucket/test.jsonl. However, JsonReader.__post_init__ (sky/batch/io_formats.py:223-227), JsonWriter.__post_init__ (sky/batch/io_formats.py:281-285), and ImageWriter.__post_init__ (sky/batch/io_formats.py:327-331) all explicitly reject any prefix other than s3:// or gs://, raising ValueError. Additionally, parse_cloud_path (sky/batch/utils.py:179-187) also only handles s3:// and gs://. This means every batch test (test_batch_simple, test_batch_diffusion, test_batch_cancel, test_batch_custom_formats, test_batch_ha_kill_running) will fail immediately on nebius with a ValueError when the example script creates a JsonReader or JsonWriter with a nebius:// path.

Prompt for agents
The _CLOUD_TO_STORE mapping sets nebius store to 'nebius', which causes example scripts to construct nebius://bucket/... paths. But the batch IO format classes (JsonReader, JsonWriter, ImageWriter in sky/batch/io_formats.py) and utility functions (parse_cloud_path in sky/batch/utils.py) only support s3:// and gs:// prefixes. There are two possible approaches:

1. Remove 'nebius' from _CLOUD_TO_STORE so it falls through to the default 's3', since Nebius uses S3-compatible storage. This requires verifying that the batch framework's internal S3 operations (using aws.client('s3') via sky/adaptors/aws) work correctly with Nebius credentials on Nebius worker machines.

2. Update the batch IO format classes and utility functions to support nebius:// paths, translating them to s3:// internally (similar to how sky/cloud_stores.py:589-590 does source.replace('nebius://', 's3://')).

Option 1 is simpler if Nebius workers already have proper S3-compatible credentials configured.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

'gsutil rm', 'gsutil ls', lambda p: f'gsutil rm -r {p}')
if generic_cloud == 'nebius':
# Nebius uses S3-compatible storage but requires --profile=nebius
from sky.adaptors import nebius
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In-function import of nebius module without explanatory comment violates AGENTS.md rule

AGENTS.md states: "Always place imports at the top of the file, not inside function definitions. [...] Only as a last resort, place the import inside the function with a comment explaining why." The from sky.adaptors import nebius import at line 46 is inside the _storage_cmds function with no comment. The module sky/adaptors/nebius.py only imports lightweight SkyPilot internals (no heavy third-party SDK at module level), so there's no practical reason this can't be at the top of the file — indeed, tests/smoke_tests/test_mount_and_storage.py:47 already imports it at the top level.

Prompt for agents
Move the import 'from sky.adaptors import nebius' from inside _storage_cmds() (line 46) to the top-level imports section of the file (around line 18, with the other sky imports). The nebius adaptor module only imports lightweight SkyPilot internals at module level, so top-level import is safe. See tests/smoke_tests/test_mount_and_storage.py:47 for the existing pattern.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants