Skip to content

ci: add nightly integration tests for transformers, accelerate, peft#1923

Open
Titus-von-Koeller wants to merge 16 commits intomainfrom
ci/nightly-integration-tests
Open

ci: add nightly integration tests for transformers, accelerate, peft#1923
Titus-von-Koeller wants to merge 16 commits intomainfrom
ci/nightly-integration-tests

Conversation

@Titus-von-Koeller
Copy link
Copy Markdown
Collaborator

Summary

Adds a new nightly workflow that runs each downstream library's own bnb-specific test suite against the latest main bnb wheel (from continuous-release_main pre-release). Catches breakage in HF downstream integrations before it reaches users.

Architecture

  • Three parallel test jobs, each running the downstream library's existing bnb test suite:
    • test-transformers on T4 (matches transformers CI)
    • test-accelerate on A10 (closest match to accelerate CI's L4)
    • test-peft on A10 (closest match to peft CI's L4)
  • Each job produces a JUnit XML uploaded as an artifact
  • Consolidated report job downloads all XMLs, generates a markdown summary, writes to \$GITHUB_STEP_SUMMARY, and uploads the full report + raw XMLs as artifacts for inspection
  • Slack posting is stubbed out until SLACK_API_TOKEN is provisioned

Triggers

on:
  workflow_dispatch:                # manual trigger (after merge)
  pull_request:                     # auto-runs when these files change
    paths:
      - '.github/workflows/tests-integration-nightly.yml'
      - 'scripts/integration_test_report.py'
  # schedule:                       # enable in follow-up PR once stable
  #   - cron: "30 3 * * *"

Scheduled cron is commented out. This PR should auto-run the workflow via the pull_request trigger so we can validate end-to-end before enabling nightly.

What gets tested

Uses the existing test suites from each downstream repo (not newly-written tests):

  • transformers: tests/quantization/bnb/ (4-bit + 8-bit load/inference/serialization/training)
  • accelerate: tests/test_quantization.py (load_and_quantize_model, skip/keep modules, generation)
  • peft: tests/test_gpu_examples.py + tests/test_common_gpu.py filtered to -m "single_gpu_tests and bitsandbytes" (QLoRA training, all tuner types with bnb, merge/unmerge, DoRA, etc.)

Multi-GPU tests are excluded (-k "not MultiGpu and not multi_device").

Downstream lib versions are pinned to the latest PyPI release, with matching git tag cloned for the test files — so we test what real users install, not bleeding-edge main.

Follow-up PRs

  1. Enable Slack posting once SLACK_API_TOKEN secret is provisioned (already commented TODO in the workflow)
  2. Enable the nightly cron schedule once the workflow is stable

Design doc

Full research and design rationale in agents/integration_tests_guide.md — covers GPU selection, Slack reporting approaches evaluated across HF repos, iteration strategy, and known potential issues.

Test plan

  • Verify pull_request trigger fires on this PR
  • test-transformers job passes on T4
  • test-accelerate job passes on A10
  • test-peft job passes on A10
  • report job consolidates all three XMLs correctly
  • Download consolidated-report artifact and confirm markdown + Slack payload look right
  • Iterate on the Slack payload format against real data before provisioning the bot token

🤖 Generated with Claude Code

Adds a new nightly workflow that runs each downstream library's own
bnb-specific test suite against the latest main bnb wheel (from the
continuous-release_main pre-release). Catches breakage in HF downstream
integrations before it reaches users.

Architecture:
- Three parallel test jobs (transformers on T4, accelerate and peft on A10
  to match each project's own CI)
- Each produces JUnit XML uploaded as an artifact
- Consolidated report job downloads all XMLs, generates a markdown summary,
  writes to $GITHUB_STEP_SUMMARY, and uploads artifacts for inspection
- Slack posting is stubbed out until SLACK_API_TOKEN is provisioned

Triggers:
- workflow_dispatch (manual, available after merge)
- pull_request (runs automatically when workflow/report script changes)
- schedule (commented out; enable in follow-up PR once stable)

The report script (scripts/integration_test_report.py) parses JUnit XML,
produces a markdown summary, and can post Slack Block Kit messages with
threaded per-suite failure details (diffusers-style consolidated report).

Full design rationale and implementation plan in
agents/integration_tests_guide.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Titus-von-Koeller and others added 4 commits April 13, 2026 11:28
…steps

First run showed tests silently "passed" in 1-2 min — pytest wasn't installed
(--force-reinstall --no-deps stripped bnb's test extras) and | tee masked
the resulting exit code.

Align with the existing test-runner.yml pattern: install via the [test]
extras so pytest and other test deps come along. Also add `shell: bash -o
pipefail` to the pytest steps so pipe failures are surfaced.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… peft

Key changes after digging into each downstream project's own CI:

Runner updates:
- transformers: T4 → A10G (bandb-aws-g5-4xlarge-plus). Current upstream
  transformers quantization CI runs on g5.4xlarge (A10G); our earlier T4
  choice came from a stale Feb-2024 fork.
- peft (single GPU): A10 → L4 (bandb-aws-g6-4xlarge-plus). Matches peft's
  aws-g6-4xlarge-plus runner group exactly.

PEFT filter:
- Switched from `-m "single_gpu_tests and bitsandbytes"` (both test files)
  to Benjamin Bossan's recommendation:
  `-m single_gpu_tests -k PeftBnbGPUExampleTests tests/test_gpu_examples.py`.
  Narrower scope (20 vs 86 tests) focused on the end-to-end QLoRA-style
  integration signal, less noise from tests where bnb is incidental.

New multi-GPU peft job:
- Uses bandb-aws-g6-12xlarge-plus (4× L4, CUDA_VISIBLE_DEVICES=0,1) —
  mirroring the legacy peft nightly-bnb.yml deleted in peft#2858.
- Filter: `-m multi_gpu_tests -k PeftBnbGPUExampleTests`.
- Note: this runner is being provisioned by infra; job will fail to pick
  up a runner until that's done.

Accelerate:
- Added `-rs` to surface skip reasons. Previous run showed 26 silent skips
  that produced a false "pass"; -rs will print the reason for each.

Report job's `needs:` updated to include test-peft-multigpu.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The legacy peft nightly-bnb.yml (deleted in peft#2858) ran the full
transformers bnb suite on both its single- and multi-GPU jobs, so the
multi-GPU-marked tests in transformers' test_4bit.py / test_mixed_int8.py
actually executed on the 2-GPU runner. This commit restores that coverage.

New job: test-transformers-multigpu
- Runner: bandb-aws-g6-12xlarge-plus (same multi-GPU runner as peft)
- CUDA_VISIBLE_DEVICES=0,1 (2 of the 4 L4s)
- Filter: `-k "MultiGpu or multi_gpu"` — runs ONLY the multi-GPU-marked
  tests, avoiding duplication with the single-GPU transformers job.

Report job's `needs:` updated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uide

- peft install: switch to `pip install "peft[test]"` which pulls in
  parameterized, datasets, scipy, etc. via peft's own test extras.
  Previous run failed collection with ModuleNotFoundError: parameterized.

- Slack: enable posting via SLACK_CIFEEDBACK_BOT_TOKEN (bot token provisioned
  by infra, same secret name as transformers / diffusers). Posts to
  #bnb-daily-ci-collab. Uses our existing diffusers-style consolidated
  report script with threaded per-suite failure details.

- Update agents/integration_tests_guide.md: add current-state section
  documenting the 5-job workflow, Benjamin's filter rationale, legacy
  peft bnb CI reference, build-reuse strategy, and Slack setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Titus-von-Koeller and others added 11 commits April 14, 2026 10:16
The bandb-aws-g6-12xlarge-plus runner doesn't exist yet. Queued jobs
with no runner hang indefinitely, blocking the report job (which needs
all jobs to complete). Set `if: false` on both multi-GPU jobs so the
single-GPU tests and Slack reporting can run while infra provisions the
multi-GPU runner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Following the transformers quantization-CI pattern:

- HF_HOME=/mnt/cache: if the runner has persistent storage at this path,
  model downloads are cached across runs. If not, HF hub falls back to
  the default ~/.cache/huggingface.

- CUDA_VISIBLE_DEVICES=0,1 at workflow level: on single-GPU runners,
  device 1 simply doesn't exist so only GPU 0 is used. On multi-GPU
  runners (g6.12xlarge), both are visible. This eliminates per-step env
  overrides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the accelerate job (which had 26/27 tests silently skipping) with
diffusers, a much higher-value integration target for bnb.

Diffusers has two layers of bnb test coverage:
- tests/quantization/bnb/ — standalone 4-bit/8-bit quantization tests
- tests/models/transformers/ — 9 model classes (Flux, HunyuanVideo, Wan,
  etc.) that inherit BitsAndBytesTesterMixin, testing quantization on
  real diffusion model architectures

All are selected via `pytest -m bitsandbytes tests/` which matches the
@is_bitsandbytes decorator that applies pytest.mark.bitsandbytes.

Runner: L40S (bandb-aws-g6e-4xlarge-plus) matching diffusers' own CI
runner (aws-g6e-xlarge-plus). L40S provides 48GB VRAM needed for larger
diffusion models like Flux and HunyuanVideo.

Also sets CUBLAS_WORKSPACE_CONFIG=:16:8 for determinism, matching the
diffusers CI convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All tests failed with PermissionError at /mnt/cache. Unlike transformers'
Docker-based CI which mounts a persistent cache volume, our runners run
bare-metal without /mnt/cache. Removing HF_HOME falls back to the default
~/.cache/huggingface which is always writable. Models download fresh each
nightly run — acceptable for a smoke test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Axolotl has no pytest marker for bnb, so we select the 3 relevant
test targets explicitly:
- tests/e2e/kernels/test_quantize.py (4 tests — dequantize + QuantState)
- tests/e2e/kernels/test_lora.py (10 tests — LoRA autograd with QuantState)
- tests/e2e/kernels/test_lora_features.py::TestQuantizedModels (2 tests —
  NF4 QLoRA forward/backward)

Runner: A10G (bandb-aws-g5-4xlarge-plus). These are kernel-level tests
that don't need large VRAM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use diffusers/diffusers-pytorch-cuda container for the diffusers test
job — matches their own CI setup and has all system deps (libGL, etc.)
pre-installed. Only override bnb with our continuous-release wheel.
Previous bare-metal approach failed with libGL.so.1 missing during
pytest collection (opencv-python needs it, and bare-metal runners
don't have it).

If the runner doesn't have Docker/nvidia-container-toolkit, the job
will fail fast with a clear error and we'll fall back to installing
opencv-python-headless instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Docker image provides system deps (libGL, etc.) but ships a dev
version from main with no matching tag. Override with PyPI release
inside the container so we test against what real users have, and the
tag clone works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pip install diffusers inside the container was a no-op: the image has
0.38.0.dev0 which pip considers higher than the latest PyPI release
(0.37.1), so it skips the install. Uninstall first to force the
PyPI version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndant deps

The Docker image already has transformers/accelerate/peft. Only diffusers
needs overriding (image has dev version > latest PyPI release).
--force-reinstall --no-deps does this in one line.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Transformers v5.x is released and our integration tests were stuck on
v4.57.6 due to the <5 cap in [test] extras. Some test failures in the
transformers bnb suite are already fixed in v5.x (see transformers#44604).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The continuous-release wheel is built from main which still has
transformers<5 in its metadata. pip install without -U sees 4.57.6
already satisfied and won't upgrade to v5.x. Adding -U forces the
upgrade past the baked-in constraint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants