Fix load_adapter OOM caused by full-model warmup sizing by Yooniel · Pull Request #46145 · huggingface/transformers

Yooniel · 2026-05-21T15:59:30Z

What does this PR do?

Fixes an OOM in load_adapter on configurations where the base model occupies more than ~half of GPU memory, e.g. Gemma-3-27B in bf16 on a single H100/H200 or Llama-70B on a single 80 GB GPU.

Root cause

load_adapter passes every named parameter on the model, base model included, as expected_keys to _load_pretrained_model. Downstream, caching_allocator_warmup sums those into a full base-model byte count and issues a single same-size allocation on top of the already-resident base model, OOMing.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.87 GiB.
GPU 0 has a total capacity of 94.50 GiB of which 41.85 GiB is free.
Including non-PyTorch memory, this process has 52.64 GiB memory in use.

The allocation attempt, 51.87 GiB, is essentially the size of the base model already resident on the GPU.

Fix

Hoist the existing is_adapter_key helper above the _load_pretrained_model call and apply it to expected_keys, so warmup is sized only from adapter parameters. The downstream missing_keys filter that already used the helper is preserved.

Tests

Adds a regression test that captures the device map passed to caching_allocator_warmup during load_adapter and asserts it contains only adapter-owned parameter names, not base-model names. Without the fix, the test fails with 84 base-model parameter names leaking into the warmup.

make style
RUN_SLOW=1 python -m unittest tests.peft_integration.test_peft_integration.PeftIntegrationTester.test_peft_load_adapter_warmup_uses_adapter_expected_keys -v

Also verified the original GH200 repro locally: before the fix, load_adapter tried to allocate 51.87 GiB and OOMed; after the fix, the adapter loads successfully.

Accidentally allocating 2x memory in new caching_allocator_warmup #36483, restrict cache allocator to non quantized model #36428, Loading optimizations #36742 — same warmup, fixed for the base-model loading path only; the adapter path was untouched.
load_best_model_at_end reloads PEFT adapter weights onto CUDA and can OOM under low remaining GPU memory #44637 / Fix: avoid late CUDA OOM in load_best_model_at_end with PEFT models #44660 — adjacent open issue/PR about a different load_adapter OOM (state-dict materialization in load_best_model_at_end), not warmup over-allocation.

No associated issue was filed; this is a focused bugfix PR with a local repro, root-cause analysis, and regression test.

Code Agent Policy

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@Cyrilvallez (model loading): this change touches the caching_allocator_warmup path.
@BenjaminBossan (PEFT integration): this change is in integrations/peft.py and concerns adapter loading semantics.

load_adapter passed every named parameter on the model, including the base model, as expected_keys to _load_pretrained_model. Downstream, caching_allocator_warmup summed those into a full base-model byte count and issued a single same-size allocation on top of the already-resident base model, OOMing whenever the base model occupies more than ~half of GPU memory. The file already defined an is_adapter_key helper for identifying parameters belonging to the freshly-injected adapter, but it was declared after the _load_pretrained_model call. Hoist the helper above the call and apply it to expected_keys. Adds a regression test that captures the device map passed to caching_allocator_warmup during load_adapter and asserts it contains only adapter-owned parameter names, not base-model names.

BenjaminBossan

Thanks for creating this PR. I confirmed that memory usage is actually doubled without the amendment. Here is a small reproducer:

import os
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model

path = "/tmp/peft/llama"
model_id = "meta-llama/Llama-3.2-3B"

if not os.path.exists(os.path.join(path, "adapter_model.safetensors")):
    model = model = AutoModelForCausalLM.from_pretrained(model_id)
    config = LoraConfig()
    model = get_peft_model(model, config)
    model.save_pretrained(path)
    del model
    print(f"LoRA adapter did not exist, saved it to {path}")

model = AutoModelForCausalLM.from_pretrained(model_id).to(0)
model.load_adapter(path)

Setting a breakpoint before and after the self._load_pretrained_model, I can see that the VRAM usage doubles. With the provided fix, this is no longer the case. Thus, from my side, the PR looks good, just some small comments.

As I'm not very knowledgeable about the overall weight loading machinery in Transformers, I defer to @Cyrilvallez to judge if this is the best solution to the problem.

BenjaminBossan · 2026-05-26T12:36:18Z

+                def capture_warmup(model, expanded_device_map, hf_quantizer):
+                    captured_device_maps.append(dict(expanded_device_map))
+
+                modeling_utils.caching_allocator_warmup = capture_warmup


How about using unittest.mock.patch?

BenjaminBossan · 2026-05-26T12:39:16Z

                # after loading, no meta device should be remaining
                self.assertFalse(any((p.device.type == "meta") for p in model.parameters()))

+    def test_peft_load_adapter_warmup_uses_adapter_expected_keys(self):


The test somewhat relies on implementation details to work. As it's not easy to check the actual effect in terms of memory usage, I'd say it's fine. But if, for instance, caching_allocator_warmup is no longer used by _load_pretrained_model, the test would break and it would be hard to debug. So let's expand the test description to include how exactly this is being tested.

Cyrilvallez

Indeed, good catch! Looks good to me on the logic size. Will let @BenjaminBossan merge when he's happy with the test (I see a few comments there)

BenjaminBossan · 2026-05-28T16:15:22Z

@Yooniel Do you agree with my feedback regarding the test? If yes, would you please update it accordingly?

Yooniel · 2026-05-29T02:08:25Z

Thanks for the comments! I updated the test to use unittest.mock.patch and added a short note explaining what internal path the test relies on.

@BenjaminBossan

HuggingFaceDocBuilderDev · 2026-05-29T12:33:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for updating the test, PR LGTM.

…46145) * Fix load_adapter OOM caused by full-model warmup sizing load_adapter passed every named parameter on the model, including the base model, as expected_keys to _load_pretrained_model. Downstream, caching_allocator_warmup summed those into a full base-model byte count and issued a single same-size allocation on top of the already-resident base model, OOMing whenever the base model occupies more than ~half of GPU memory. The file already defined an is_adapter_key helper for identifying parameters belonging to the freshly-injected adapter, but it was declared after the _load_pretrained_model call. Hoist the helper above the call and apply it to expected_keys. Adds a regression test that captures the device map passed to caching_allocator_warmup during load_adapter and asserts it contains only adapter-owned parameter names, not base-model names. * Address review: use unittest.mock.patch and expand test docstring

BenjaminBossan reviewed May 26, 2026

View reviewed changes

Cyrilvallez approved these changes May 27, 2026

View reviewed changes

Address review: use unittest.mock.patch and expand test docstring

d4d1371

Merge branch 'main' into fix/load-adapter-warmup-expected-keys

683f43e

BenjaminBossan approved these changes May 29, 2026

View reviewed changes

BenjaminBossan added this pull request to the merge queue May 29, 2026

Merged via the queue into huggingface:main with commit c220ea9 May 29, 2026
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix load_adapter OOM caused by full-model warmup sizing#46145

Fix load_adapter OOM caused by full-model warmup sizing#46145
BenjaminBossan merged 3 commits into
huggingface:mainfrom
Yooniel:fix/load-adapter-warmup-expected-keys

Yooniel commented May 21, 2026

Uh oh!

BenjaminBossan left a comment

Uh oh!

BenjaminBossan May 26, 2026

Uh oh!

BenjaminBossan May 26, 2026

Uh oh!

Cyrilvallez left a comment

Uh oh!

BenjaminBossan commented May 28, 2026

Uh oh!

Yooniel commented May 29, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2026

Uh oh!

BenjaminBossan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Yooniel commented May 21, 2026

What does this PR do?

Root cause

Fix

Tests

Related

Code Agent Policy

Before submitting

Who can review?

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan May 26, 2026

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented May 28, 2026

Uh oh!

Yooniel commented May 29, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2026

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants