fix(tuners/lora/bnb): normalize output device for CPU-offloaded BnB layers#3181
fix(tuners/lora/bnb): normalize output device for CPU-offloaded BnB layers#3181Anai-Guo wants to merge 1 commit into
Conversation
…ayers When using LoRA with bitsandbytes INT8/INT4 quantization and CPU offload (via accelerate device_map), the base layer may produce output on CPU while the LoRA delta is computed on CUDA, causing a device mismatch error on the addition. Fix by normalizing `result` to match the input device (`x.device`) after calling the base layer in both `Linear8bitLt.forward` and `Linear4bit.forward`. Fixes huggingface#3169
|
@Anai-Guo Please share a minimal reproducer that would currently fail and that is fixed by your PR. |
|
@BenjaminBossan Thanks for the review! The real-world scenario triggering this is:
The issue reporter (#3169, confirmed by @sirfyyn) was training Gemma4 26B-A4B with For a minimal reproducer that demonstrates the exact failure path, see @sirfyyn's detailed comment in #3169 (the Regarding where the fix belongs: you're right that |
This is not a good candidate for a test, as it requires manually calling
We have a nightly CI that runs with CUDA and bnb (see |
|
@BenjaminBossan Thanks for the clarification. Here's what I believe the realistic reproduction path is: Bug mechanism:
Proposed test using @require_bitsandbytes
@pytest.mark.single_gpu_tests
def test_lora_bnb_8bit_cpu_offload_forward(self):
"""Regression test for #3169.
dispatch_model with CPU-offloaded layers replicates the real-world scenario
where device_map="auto" + limited GPU memory causes AlignDevicesHook to
execute BnB INT8 base layers on CPU while LoRA adapters stay on CUDA.
"""
from accelerate import dispatch_model
from accelerate.utils import infer_auto_device_map
tokenizer = AutoTokenizer.from_pretrained(self.causal_lm_model_id)
model = AutoModelForCausalLM.from_pretrained(
self.causal_lm_model_id,
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
device_map={"": 0},
)
lora_config = LoraConfig(
r=4, lora_alpha=8, target_modules=["q_proj", "v_proj"], task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Dispatch with a device_map that sends some LoRA-wrapped BnB INT8 layers to CPU,
# replicating what from_pretrained does with device_map="auto" + tight max_memory.
max_memory = {0: "100MiB", "cpu": "16GiB"}
device_map = infer_auto_device_map(model, max_memory=max_memory)
if "cpu" not in device_map.values():
pytest.skip("All layers fit on GPU; reduce max_memory[0] to force CPU offload")
model = dispatch_model(model, device_map=device_map)
input_ids = tokenizer.encode("Hello world", return_tensors="pt")
with torch.no_grad():
# Previously raised: RuntimeError: Expected all tensors to be on the same device
output = model(input_ids=input_ids)
self.assertIsNotNone(output.logits)Does this test structure look acceptable? I can add it to |
|
Thanks for providing a test. I ran it locally with
This is with and without the fix you provided. I checked the Still, even with this new Can you reproduce my findings? If yes, please ensure first that the test is failing as we expect it to fail and that it passes after the fix is applied. |
|
@BenjaminBossan Thanks for running the test. I've tracked down the Root cause: bitsandbytes Confirmed reproduction environment:
Pinning On the On GPU access: Unfortunately I don't have a GPU in my current environment, so I can't locally run I'm happy to iterate on the test structure further — just let me know what version matrix to target. |
|
Thanks for investigating.
Okay, so that sounds to me like this issue first has to fixed upstream, in either transformers or bnb, before we can continue with this PEFT PR.
Since the problematic behavior occurs in 0.49.2, pinning against <0.50.0 wouldn't help, would it?
We want to avoid using such an old bnb version in our CI, so as long as this isn't fixed upstream, we cannot add the test in PEFT and thus 1) have to wait with this PR or 2) find another test that doesn't run into the issue. |
|
Thanks for the detailed analysis, @BenjaminBossan! You're right that the kwarg removal is an upstream bnb regression. But the core fix in (normalising the output device for CPU-offloaded layers) is independent of that test issue. Would it work to:
Or, if you'd prefer to wait until the upstream issue is resolved first, I'm happy to park this PR and reopen when that's addressed. Just let me know which direction is better. |
How would that work, doesn't this fix fundamentally require an accelerator? |
|
You're right that I overpromised — the fix is in the bnb-specific Two ways forward, both reasonable — happy to do whichever you prefer: Option A — drop the test, ship just the source fix. The change is one line in each of the two Option B — write a CPU-only test that bypasses bnb entirely. Construct a plain My preference is A for now, then revisit a proper GPU test once bnb 0.49.x ships the kwarg fix. Let me know which you'd like. |
|
@Anai-Guo I'm not liking either of the proposals. We need to have a unit test to ensure that the fix works correctly and the future changes to PEFT don't lead to a regression, so A is not an option. Option B to me sounds like it doesn't really test what we need to test. I have other suggestions: C: Find another way to test this on GPU that doesn't trigger the upstream error. I'm not sure if C can be achieved, which would leave us with option D. |
|
Thanks for the steer @BenjaminBossan — agree D is the right call given the upstream blocker on |
|
Thanks. We can leave the PR open for now. |
Problem
When using LoRA with bitsandbytes INT8 or INT4 quantization combined with CPU offload (via
acceleratedevice_map), the forward pass raises:Root cause: In
Linear8bitLt.forwardandLinear4bit.forward, after callingself.base_layer(x, ...), theresulttensor may be on CPU (because the offloaded layer computed on CPU), while the LoRA deltaoutput = lora_B(lora_A(dropout(x))) * scalingis computed on CUDA. Adding them together raises a device mismatch.Fix
Normalize
resultto match the input device (x.device) immediately after calling the base layer in bothLinear8bitLt.forwardandLinear4bit.forward:This is a no-op when the layer is not offloaded (tensor is already on the correct device), so it does not affect normal usage.
Reproducer
Reported setup: Gemma4 26B-A4B-it,
load_in_8bit=True,device_mapwith CPU offload, LoRA r=16.Fixes #3169