Skip to content

LoRA + BnB INT8 + CPU offload: output tensor on wrong device in tuners/lora/bnb.py #3169

@sirfyyn

Description

@sirfyyn

Bug: LoRA output on wrong device when base weight is CPU-offloaded

Environment

  • peft latest
  • LoRA r=16, task_type=CAUSAL_LM
  • BnB INT8 (load_in_8bit=True)
  • accelerate device_map with CPU offload
  • Gemma4 26B-A4B-it on RTX 4090

Problem

In peft/tuners/lora/bnb.py, the LoRA forward pass computes the low-rank update on CUDA but the base INT8 linear output may be on CPU (the layer was offloaded). The final addition raises a device mismatch.

There are two sites in bnb.py that need .to(x.device) normalization on the output:

  1. Linear8bitLt.forward — after the base matmul, before adding the LoRA delta
  2. A second occurrence in the same file (a different branch / mixed-precision path)

Fix (P8):

# Both sites — normalize output device to match input:
result = result.to(x.device)

Without this, training fails with:

RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!

Full context

Discovered while making Gemma4 26B-A4B train on a single RTX 4090 (BnB INT8 + LoRA + Gradient Checkpointing + CPU offload). All patches + complete training example:

https://github.com/sirfyyn/consumer-llm-patches

Happy to submit a PR. The fix is a one-liner per site but needs testing against non-offload setups to confirm no regression.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions