ci: add nightly integration tests for transformers, accelerate, peft#1923
Open
Titus-von-Koeller wants to merge 16 commits intomainfrom
Open
ci: add nightly integration tests for transformers, accelerate, peft#1923Titus-von-Koeller wants to merge 16 commits intomainfrom
Titus-von-Koeller wants to merge 16 commits intomainfrom
Conversation
Adds a new nightly workflow that runs each downstream library's own bnb-specific test suite against the latest main bnb wheel (from the continuous-release_main pre-release). Catches breakage in HF downstream integrations before it reaches users. Architecture: - Three parallel test jobs (transformers on T4, accelerate and peft on A10 to match each project's own CI) - Each produces JUnit XML uploaded as an artifact - Consolidated report job downloads all XMLs, generates a markdown summary, writes to $GITHUB_STEP_SUMMARY, and uploads artifacts for inspection - Slack posting is stubbed out until SLACK_API_TOKEN is provisioned Triggers: - workflow_dispatch (manual, available after merge) - pull_request (runs automatically when workflow/report script changes) - schedule (commented out; enable in follow-up PR once stable) The report script (scripts/integration_test_report.py) parses JUnit XML, produces a markdown summary, and can post Slack Block Kit messages with threaded per-suite failure details (diffusers-style consolidated report). Full design rationale and implementation plan in agents/integration_tests_guide.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…steps First run showed tests silently "passed" in 1-2 min — pytest wasn't installed (--force-reinstall --no-deps stripped bnb's test extras) and | tee masked the resulting exit code. Align with the existing test-runner.yml pattern: install via the [test] extras so pytest and other test deps come along. Also add `shell: bash -o pipefail` to the pytest steps so pipe failures are surfaced. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… peft Key changes after digging into each downstream project's own CI: Runner updates: - transformers: T4 → A10G (bandb-aws-g5-4xlarge-plus). Current upstream transformers quantization CI runs on g5.4xlarge (A10G); our earlier T4 choice came from a stale Feb-2024 fork. - peft (single GPU): A10 → L4 (bandb-aws-g6-4xlarge-plus). Matches peft's aws-g6-4xlarge-plus runner group exactly. PEFT filter: - Switched from `-m "single_gpu_tests and bitsandbytes"` (both test files) to Benjamin Bossan's recommendation: `-m single_gpu_tests -k PeftBnbGPUExampleTests tests/test_gpu_examples.py`. Narrower scope (20 vs 86 tests) focused on the end-to-end QLoRA-style integration signal, less noise from tests where bnb is incidental. New multi-GPU peft job: - Uses bandb-aws-g6-12xlarge-plus (4× L4, CUDA_VISIBLE_DEVICES=0,1) — mirroring the legacy peft nightly-bnb.yml deleted in peft#2858. - Filter: `-m multi_gpu_tests -k PeftBnbGPUExampleTests`. - Note: this runner is being provisioned by infra; job will fail to pick up a runner until that's done. Accelerate: - Added `-rs` to surface skip reasons. Previous run showed 26 silent skips that produced a false "pass"; -rs will print the reason for each. Report job's `needs:` updated to include test-peft-multigpu. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The legacy peft nightly-bnb.yml (deleted in peft#2858) ran the full transformers bnb suite on both its single- and multi-GPU jobs, so the multi-GPU-marked tests in transformers' test_4bit.py / test_mixed_int8.py actually executed on the 2-GPU runner. This commit restores that coverage. New job: test-transformers-multigpu - Runner: bandb-aws-g6-12xlarge-plus (same multi-GPU runner as peft) - CUDA_VISIBLE_DEVICES=0,1 (2 of the 4 L4s) - Filter: `-k "MultiGpu or multi_gpu"` — runs ONLY the multi-GPU-marked tests, avoiding duplication with the single-GPU transformers job. Report job's `needs:` updated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uide - peft install: switch to `pip install "peft[test]"` which pulls in parameterized, datasets, scipy, etc. via peft's own test extras. Previous run failed collection with ModuleNotFoundError: parameterized. - Slack: enable posting via SLACK_CIFEEDBACK_BOT_TOKEN (bot token provisioned by infra, same secret name as transformers / diffusers). Posts to #bnb-daily-ci-collab. Uses our existing diffusers-style consolidated report script with threaded per-suite failure details. - Update agents/integration_tests_guide.md: add current-state section documenting the 5-job workflow, Benjamin's filter rationale, legacy peft bnb CI reference, build-reuse strategy, and Slack setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bandb-aws-g6-12xlarge-plus runner doesn't exist yet. Queued jobs with no runner hang indefinitely, blocking the report job (which needs all jobs to complete). Set `if: false` on both multi-GPU jobs so the single-GPU tests and Slack reporting can run while infra provisions the multi-GPU runner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Following the transformers quantization-CI pattern: - HF_HOME=/mnt/cache: if the runner has persistent storage at this path, model downloads are cached across runs. If not, HF hub falls back to the default ~/.cache/huggingface. - CUDA_VISIBLE_DEVICES=0,1 at workflow level: on single-GPU runners, device 1 simply doesn't exist so only GPU 0 is used. On multi-GPU runners (g6.12xlarge), both are visible. This eliminates per-step env overrides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the accelerate job (which had 26/27 tests silently skipping) with diffusers, a much higher-value integration target for bnb. Diffusers has two layers of bnb test coverage: - tests/quantization/bnb/ — standalone 4-bit/8-bit quantization tests - tests/models/transformers/ — 9 model classes (Flux, HunyuanVideo, Wan, etc.) that inherit BitsAndBytesTesterMixin, testing quantization on real diffusion model architectures All are selected via `pytest -m bitsandbytes tests/` which matches the @is_bitsandbytes decorator that applies pytest.mark.bitsandbytes. Runner: L40S (bandb-aws-g6e-4xlarge-plus) matching diffusers' own CI runner (aws-g6e-xlarge-plus). L40S provides 48GB VRAM needed for larger diffusion models like Flux and HunyuanVideo. Also sets CUBLAS_WORKSPACE_CONFIG=:16:8 for determinism, matching the diffusers CI convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All tests failed with PermissionError at /mnt/cache. Unlike transformers' Docker-based CI which mounts a persistent cache volume, our runners run bare-metal without /mnt/cache. Removing HF_HOME falls back to the default ~/.cache/huggingface which is always writable. Models download fresh each nightly run — acceptable for a smoke test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Axolotl has no pytest marker for bnb, so we select the 3 relevant test targets explicitly: - tests/e2e/kernels/test_quantize.py (4 tests — dequantize + QuantState) - tests/e2e/kernels/test_lora.py (10 tests — LoRA autograd with QuantState) - tests/e2e/kernels/test_lora_features.py::TestQuantizedModels (2 tests — NF4 QLoRA forward/backward) Runner: A10G (bandb-aws-g5-4xlarge-plus). These are kernel-level tests that don't need large VRAM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use diffusers/diffusers-pytorch-cuda container for the diffusers test job — matches their own CI setup and has all system deps (libGL, etc.) pre-installed. Only override bnb with our continuous-release wheel. Previous bare-metal approach failed with libGL.so.1 missing during pytest collection (opencv-python needs it, and bare-metal runners don't have it). If the runner doesn't have Docker/nvidia-container-toolkit, the job will fail fast with a clear error and we'll fall back to installing opencv-python-headless instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Docker image provides system deps (libGL, etc.) but ships a dev version from main with no matching tag. Override with PyPI release inside the container so we test against what real users have, and the tag clone works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pip install diffusers inside the container was a no-op: the image has 0.38.0.dev0 which pip considers higher than the latest PyPI release (0.37.1), so it skips the install. Uninstall first to force the PyPI version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndant deps The Docker image already has transformers/accelerate/peft. Only diffusers needs overriding (image has dev version > latest PyPI release). --force-reinstall --no-deps does this in one line. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Transformers v5.x is released and our integration tests were stuck on v4.57.6 due to the <5 cap in [test] extras. Some test failures in the transformers bnb suite are already fixed in v5.x (see transformers#44604). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The continuous-release wheel is built from main which still has transformers<5 in its metadata. pip install without -U sees 4.57.6 already satisfied and won't upgrade to v5.x. Adding -U forces the upgrade past the baked-in constraint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new nightly workflow that runs each downstream library's own bnb-specific test suite against the latest main bnb wheel (from
continuous-release_mainpre-release). Catches breakage in HF downstream integrations before it reaches users.Architecture
test-transformerson T4 (matches transformers CI)test-accelerateon A10 (closest match to accelerate CI's L4)test-pefton A10 (closest match to peft CI's L4)\$GITHUB_STEP_SUMMARY, and uploads the full report + raw XMLs as artifacts for inspectionSLACK_API_TOKENis provisionedTriggers
Scheduled cron is commented out. This PR should auto-run the workflow via the
pull_requesttrigger so we can validate end-to-end before enabling nightly.What gets tested
Uses the existing test suites from each downstream repo (not newly-written tests):
tests/quantization/bnb/(4-bit + 8-bit load/inference/serialization/training)tests/test_quantization.py(load_and_quantize_model, skip/keep modules, generation)tests/test_gpu_examples.py+tests/test_common_gpu.pyfiltered to-m "single_gpu_tests and bitsandbytes"(QLoRA training, all tuner types with bnb, merge/unmerge, DoRA, etc.)Multi-GPU tests are excluded (
-k "not MultiGpu and not multi_device").Downstream lib versions are pinned to the latest PyPI release, with matching git tag cloned for the test files — so we test what real users install, not bleeding-edge
main.Follow-up PRs
SLACK_API_TOKENsecret is provisioned (already commented TODO in the workflow)Design doc
Full research and design rationale in
agents/integration_tests_guide.md— covers GPU selection, Slack reporting approaches evaluated across HF repos, iteration strategy, and known potential issues.Test plan
pull_requesttrigger fires on this PRtest-transformersjob passes on T4test-acceleratejob passes on A10test-peftjob passes on A10reportjob consolidates all three XMLs correctlyconsolidated-reportartifact and confirm markdown + Slack payload look right🤖 Generated with Claude Code