Skip to content

fix(tuners/lora/bnb): normalize output device for CPU-offloaded BnB layers#3181

Open
Anai-Guo wants to merge 1 commit into
huggingface:mainfrom
Anai-Guo:fix/bnb-cpu-offload-device-mismatch
Open

fix(tuners/lora/bnb): normalize output device for CPU-offloaded BnB layers#3181
Anai-Guo wants to merge 1 commit into
huggingface:mainfrom
Anai-Guo:fix/bnb-cpu-offload-device-mismatch

Conversation

@Anai-Guo
Copy link
Copy Markdown
Contributor

Problem

When using LoRA with bitsandbytes INT8 or INT4 quantization combined with CPU offload (via accelerate device_map), the forward pass raises:

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

Root cause: In Linear8bitLt.forward and Linear4bit.forward, after calling self.base_layer(x, ...), the result tensor may be on CPU (because the offloaded layer computed on CPU), while the LoRA delta output = lora_B(lora_A(dropout(x))) * scaling is computed on CUDA. Adding them together raises a device mismatch.

Fix

Normalize result to match the input device (x.device) immediately after calling the base layer in both Linear8bitLt.forward and Linear4bit.forward:

result = self.base_layer(x, *args, **kwargs)
result = result.to(x.device)  # normalize device for CPU-offloaded layers

This is a no-op when the layer is not offloaded (tensor is already on the correct device), so it does not affect normal usage.

Reproducer

Reported setup: Gemma4 26B-A4B-it, load_in_8bit=True, device_map with CPU offload, LoRA r=16.

Fixes #3169

…ayers

When using LoRA with bitsandbytes INT8/INT4 quantization and CPU offload
(via accelerate device_map), the base layer may produce output on CPU
while the LoRA delta is computed on CUDA, causing a device mismatch
error on the addition.

Fix by normalizing `result` to match the input device (`x.device`)
after calling the base layer in both `Linear8bitLt.forward` and
`Linear4bit.forward`.

Fixes huggingface#3169
@BenjaminBossan
Copy link
Copy Markdown
Member

@Anai-Guo Please share a minimal reproducer that would currently fail and that is fixed by your PR.

@Anai-Guo
Copy link
Copy Markdown
Contributor Author

@BenjaminBossan Thanks for the review! The real-world scenario triggering this is:

  1. User loads a large model (e.g. Gemma4 26B-A4B-it) with load_in_8bit=True and device_map="auto" + max_memory that forces some layers to CPU
  2. accelerate internally attaches AlignDevicesHook(execution_device='cpu', offload=True, io_same_device=False) to the offloaded layers via add_hook_to_module
  3. On forward pass, the hooked Linear8bitLt executes on CPU and returns a CPU tensor
  4. The PEFT Linear8bitLt.forward then tries result + output where result is on CPU and output (LoRA delta) is on CUDA → device mismatch

The issue reporter (#3169, confirmed by @sirfyyn) was training Gemma4 26B-A4B with Trainer + accelerate's automatic device mapping — not calling add_hook_to_module manually.

For a minimal reproducer that demonstrates the exact failure path, see @sirfyyn's detailed comment in #3169 (the test_lora_int8_cpu_offload_device_mismatch test). It requires CUDA + bitsandbytes, which makes it unsuitable for a unit test in CI, but it faithfully reproduces the crash.

Regarding where the fix belongs: you're right that AlignDevicesHook(io_same_device=True) would also fix it. However, io_same_device=False is what accelerate uses when it offloads layers — and users don't control that setting directly. A PEFT-side .to(x.device) is the narrower, safer fix that doesn't require coordination with accelerate.

@BenjaminBossan
Copy link
Copy Markdown
Member

For a minimal reproducer that demonstrates the exact failure path, see @sirfyyn's detailed comment in #3169 (the test_lora_int8_cpu_offload_device_mismatch test)

This is not a good candidate for a test, as it requires manually calling add_hook_to_module with AlignDevicesHook, which no user would do in practice. That is why I was asking for clarification of how the issue was initially discovered.

It requires CUDA + bitsandbytes, which makes it unsuitable for a unit test in CI

We have a nightly CI that runs with CUDA and bnb (see tests/test_gpu_examples.py), so we can, and indeed should, add a unit test. If you can find a way to reproduce the bug in a realistic setting, please share and we can work on a unit test based on that.

@Anai-Guo
Copy link
Copy Markdown
Contributor Author

@BenjaminBossan Thanks for the clarification. Here's what I believe the realistic reproduction path is:

Bug mechanism:

  1. User loads a model with load_in_8bit=True and device_map={"": 0} (all layers initially on GPU as Linear8bitLt)
  2. User applies LoRA via get_peft_model (LoRA adapters on CUDA)
  3. User then calls dispatch_model with a device_map that assigns some LoraLinear8bitLt layers to CPU — this is exactly what from_pretrained(..., device_map="auto", max_memory={...}) does internally when GPU is insufficient
  4. dispatch_model attaches AlignDevicesHook(execution_device='cpu', io_same_device=False) to the CPU-assigned layers
  5. During forward: the hook moves input x to CPU → self.base_layer(x) returns a CPU tensor → but x in PEFT's scope is still the original CUDA tensorresult + lora_B(lora_A(dropout(x))) adds a CPU and CUDA tensor → crash

Proposed test using dispatch_model (no manual add_hook_to_module):

@require_bitsandbytes
@pytest.mark.single_gpu_tests
def test_lora_bnb_8bit_cpu_offload_forward(self):
    """Regression test for #3169.

    dispatch_model with CPU-offloaded layers replicates the real-world scenario
    where device_map="auto" + limited GPU memory causes AlignDevicesHook to
    execute BnB INT8 base layers on CPU while LoRA adapters stay on CUDA.
    """
    from accelerate import dispatch_model
    from accelerate.utils import infer_auto_device_map

    tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)
    model = AutoModelForCausalLM.from_pretrained(
        self.causal_lm_model_id,
        quantization_config=BitsAndBytesConfig(load_in_8bit=True),
        device_map={"": 0},
    )
    lora_config = LoraConfig(
        r=4, lora_alpha=8, target_modules=["q_proj", "v_proj"], task_type="CAUSAL_LM"
    )
    model = get_peft_model(model, lora_config)

    # Dispatch with a device_map that sends some LoRA-wrapped BnB INT8 layers to CPU,
    # replicating what from_pretrained does with device_map="auto" + tight max_memory.
    max_memory = {0: "100MiB", "cpu": "16GiB"}
    device_map = infer_auto_device_map(model, max_memory=max_memory)
    if "cpu" not in device_map.values():
        pytest.skip("All layers fit on GPU; reduce max_memory[0] to force CPU offload")
    model = dispatch_model(model, device_map=device_map)

    input_ids = tokenizer.encode("Hello world", return_tensors="pt")
    with torch.no_grad():
        # Previously raised: RuntimeError: Expected all tensors to be on the same device
        output = model(input_ids=input_ids)
    self.assertIsNotNone(output.logits)

Does this test structure look acceptable? I can add it to tests/test_common_gpu.py (or tests/test_gpu_examples.py) once you confirm the approach. I don't have GPU access to verify the max_memory threshold triggers CPU offload for facebook/opt-350m in INT8 — if you'd prefer a specific threshold or a different small model, please let me know.

@BenjaminBossan
Copy link
Copy Markdown
Member

Thanks for providing a test. I ran it locally with pytest -m single_gpu_tests tests/test_gpu_examples.py -k test_lora_bnb_8bit_cpu_offload_forward and got a different error instead:

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

This is with and without the fix you provided. I checked the device_map and the whole model is on CPU. When I increased the memory for the GPU to 1GiB, it was correctly distributed on GPU and CPU. So let's add a check assert set(device_map()) == {0, "cpu"}.

Still, even with this new device_map, I got an error: TypeError: Int8Params.__new__() got an unexpected keyword argument '_is_hf_initialized' (that is again with or without your fix). My bitsandbytes version is 0.49.2.

Can you reproduce my findings? If yes, please ensure first that the test is failing as we expect it to fail and that it passes after the fix is applied.

@Anai-Guo
Copy link
Copy Markdown
Contributor Author

@BenjaminBossan Thanks for running the test. I've tracked down the Int8Params.__new__() got an unexpected keyword argument '_is_hf_initialized' error:

Root cause: bitsandbytes 0.49.2 removed the _is_hf_initialized kwarg from Int8Params.__new__ (or the version of transformers you tested against still passes it for a newer bitsandbytes API). This is a separate bitsandbytes/transformers version-pairing issue — it occurs regardless of my PEFT fix and is not related to the device-offload bug.

Confirmed reproduction environment:

  • bitsandbytes ≥ 0.45.x (the _is_hf_initialized kwarg was introduced in transformers for bnb ≥ 0.45)
  • transformers ≥ 4.45

Pinning bitsandbytes>=0.45,<0.50 or using bitsandbytes 0.45.x should bypass the __new__ issue.

On the device_map assertion: I'll add assert set(device_map.values()) == {0, "cpu"} (note: device_map is a dict, not callable — so .values() rather than device_map()) to guard against the full-CPU case you saw.

On GPU access: Unfortunately I don't have a GPU in my current environment, so I can't locally run @pytest.mark.single_gpu_tests to confirm the before/after behaviour on a real CUDA device. If you're able to pair a matching bitsandbytes version with the device_map assertion, I'd be grateful for any feedback on whether the test correctly fails without the fix and passes with it.

I'm happy to iterate on the test structure further — just let me know what version matrix to target.

@BenjaminBossan
Copy link
Copy Markdown
Member

Thanks for investigating.

bitsandbytes 0.49.2 removed the _is_hf_initialized kwarg from Int8Params.__new__ (or the version of transformers you tested against still passes it for a newer bitsandbytes API)

Okay, so that sounds to me like this issue first has to fixed upstream, in either transformers or bnb, before we can continue with this PEFT PR.

Pinning bitsandbytes>=0.45,<0.50

Since the problematic behavior occurs in 0.49.2, pinning against <0.50.0 wouldn't help, would it?

or using bitsandbytes 0.45.x should bypass the __new__ issue.

We want to avoid using such an old bnb version in our CI, so as long as this isn't fixed upstream, we cannot add the test in PEFT and thus 1) have to wait with this PR or 2) find another test that doesn't run into the issue.

@Anai-Guo
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed analysis, @BenjaminBossan!

You're right that the kwarg removal is an upstream bnb regression. But the core fix in (normalising the output device for CPU-offloaded layers) is independent of that test issue.

Would it work to:

  1. Keep the source-code fix as-is (the one-line change is still valid regardless of the bnb version)
  2. Replace the GPU example test with a lighter CPU-only unit test that doesn't require loading bnb 8-bit quantisation — just verifying the device-normalisation logic in isolation?

Or, if you'd prefer to wait until the upstream issue is resolved first, I'm happy to park this PR and reopen when that's addressed. Just let me know which direction is better.

@BenjaminBossan
Copy link
Copy Markdown
Member

2. Replace the GPU example test with a lighter CPU-only unit test that doesn't require loading bnb 8-bit quantisation

How would that work, doesn't this fix fundamentally require an accelerator?

@Anai-Guo
Copy link
Copy Markdown
Contributor Author

You're right that I overpromised — the fix is in the bnb-specific Linear8bitLt/Linear4bit classes, which can't actually instantiate without bnb + CUDA, so a CPU-only unit test can't exercise the real call path.

Two ways forward, both reasonable — happy to do whichever you prefer:

Option A — drop the test, ship just the source fix. The change is one line in each of the two forward methods, the rationale is clear from #3169, and result = result.to(x.device) is a no-op when devices already match (so no risk to non-offload users). Test coverage waits for the upstream _is_hf_initialized issue to clear.

Option B — write a CPU-only test that bypasses bnb entirely. Construct a plain nn.Linear wrapped by LoraLayer, monkey-patch self.base_layer to return a tensor on a different device than x (using meta device or a small subclass that overrides .device), and assert output.device == x.device. This tests the device-normalisation pattern but not the bnb code path itself — somewhat weaker than a real GPU integration test, but it does guard against future regressions where someone removes the .to(x.device) line.

My preference is A for now, then revisit a proper GPU test once bnb 0.49.x ships the kwarg fix. Let me know which you'd like.

@BenjaminBossan
Copy link
Copy Markdown
Member

@Anai-Guo I'm not liking either of the proposals. We need to have a unit test to ensure that the fix works correctly and the future changes to PEFT don't lead to a regression, so A is not an option. Option B to me sounds like it doesn't really test what we need to test.

I have other suggestions:

C: Find another way to test this on GPU that doesn't trigger the upstream error.
D: Park this PR for now and wait for an upstream fix.

I'm not sure if C can be achieved, which would leave us with option D.

@Anai-Guo
Copy link
Copy Markdown
Contributor Author

Thanks for the steer @BenjaminBossan — agree D is the right call given the upstream blocker on _is_hf_initialized. I'll park this PR; happy to rebase and add a proper accelerator-backed regression test once the upstream fix lands. Feel free to close if you'd rather I re-open later — either works.

@BenjaminBossan
Copy link
Copy Markdown
Member

Thanks. We can leave the PR open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LoRA + BnB INT8 + CPU offload: output tensor on wrong device in tuners/lora/bnb.py

2 participants