Bug: LoRA output on wrong device when base weight is CPU-offloaded
Environment
- peft latest
- LoRA r=16,
task_type=CAUSAL_LM
- BnB INT8 (
load_in_8bit=True)
accelerate device_map with CPU offload
- Gemma4 26B-A4B-it on RTX 4090
Problem
In peft/tuners/lora/bnb.py, the LoRA forward pass computes the low-rank update on CUDA but the base INT8 linear output may be on CPU (the layer was offloaded). The final addition raises a device mismatch.
There are two sites in bnb.py that need .to(x.device) normalization on the output:
Linear8bitLt.forward — after the base matmul, before adding the LoRA delta
- A second occurrence in the same file (a different branch / mixed-precision path)
Fix (P8):
# Both sites — normalize output device to match input:
result = result.to(x.device)
Without this, training fails with:
RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!
Full context
Discovered while making Gemma4 26B-A4B train on a single RTX 4090 (BnB INT8 + LoRA + Gradient Checkpointing + CPU offload). All patches + complete training example:
https://github.com/sirfyyn/consumer-llm-patches
Happy to submit a PR. The fix is a one-liner per site but needs testing against non-offload setups to confirm no regression.
Bug: LoRA output on wrong device when base weight is CPU-offloaded
Environment
task_type=CAUSAL_LMload_in_8bit=True)acceleratedevice_map with CPU offloadProblem
In
peft/tuners/lora/bnb.py, the LoRA forward pass computes the low-rank update on CUDA but the base INT8 linear output may be on CPU (the layer was offloaded). The final addition raises a device mismatch.There are two sites in
bnb.pythat need.to(x.device)normalization on the output:Linear8bitLt.forward— after the base matmul, before adding the LoRA deltaFix (P8):
Without this, training fails with:
Full context
Discovered while making Gemma4 26B-A4B train on a single RTX 4090 (BnB INT8 + LoRA + Gradient Checkpointing + CPU offload). All patches + complete training example:
https://github.com/sirfyyn/consumer-llm-patches
Happy to submit a PR. The fix is a one-liner per site but needs testing against non-offload setups to confirm no regression.