Skip to content

Fix load_adapter OOM caused by full-model warmup sizing#46145

Merged
BenjaminBossan merged 3 commits into
huggingface:mainfrom
Yooniel:fix/load-adapter-warmup-expected-keys
May 29, 2026
Merged

Fix load_adapter OOM caused by full-model warmup sizing#46145
BenjaminBossan merged 3 commits into
huggingface:mainfrom
Yooniel:fix/load-adapter-warmup-expected-keys

Conversation

@Yooniel

@Yooniel Yooniel commented May 21, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes an OOM in load_adapter on configurations where the base model occupies more than ~half of GPU memory, e.g. Gemma-3-27B in bf16 on a single H100/H200 or Llama-70B on a single 80 GB GPU.

Root cause

load_adapter passes every named parameter on the model, base model included, as expected_keys to _load_pretrained_model. Downstream, caching_allocator_warmup sums those into a full base-model byte count and issues a single same-size allocation on top of the already-resident base model, OOMing.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.87 GiB.
GPU 0 has a total capacity of 94.50 GiB of which 41.85 GiB is free.
Including non-PyTorch memory, this process has 52.64 GiB memory in use.

The allocation attempt, 51.87 GiB, is essentially the size of the base model already resident on the GPU.

Fix

Hoist the existing is_adapter_key helper above the _load_pretrained_model call and apply it to expected_keys, so warmup is sized only from adapter parameters. The downstream missing_keys filter that already used the helper is preserved.

Tests

Adds a regression test that captures the device map passed to caching_allocator_warmup during load_adapter and asserts it contains only adapter-owned parameter names, not base-model names. Without the fix, the test fails with 84 base-model parameter names leaking into the warmup.

make style
RUN_SLOW=1 python -m unittest tests.peft_integration.test_peft_integration.PeftIntegrationTester.test_peft_load_adapter_warmup_uses_adapter_expected_keys -v

Also verified the original GH200 repro locally: before the fix, load_adapter tried to allocate 51.87 GiB and OOMed; after the fix, the adapter loads successfully.

Related

No associated issue was filed; this is a focused bugfix PR with a local repro, root-cause analysis, and regression test.

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

  • @Cyrilvallez (model loading): this change touches the caching_allocator_warmup path.
  • @BenjaminBossan (PEFT integration): this change is in integrations/peft.py and concerns adapter loading semantics.

load_adapter passed every named parameter on the model, including the
base model, as expected_keys to _load_pretrained_model. Downstream,
caching_allocator_warmup summed those into a full base-model byte count
and issued a single same-size allocation on top of the already-resident
base model, OOMing whenever the base model occupies more than ~half of
GPU memory.

The file already defined an is_adapter_key helper for identifying
parameters belonging to the freshly-injected adapter, but it was declared
after the _load_pretrained_model call. Hoist the helper above the call
and apply it to expected_keys.

Adds a regression test that captures the device map passed to
caching_allocator_warmup during load_adapter and asserts it contains only
adapter-owned parameter names, not base-model names.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR. I confirmed that memory usage is actually doubled without the amendment. Here is a small reproducer:

import os
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model

path = "/tmp/peft/llama"
model_id = "meta-llama/Llama-3.2-3B"

if not os.path.exists(os.path.join(path, "adapter_model.safetensors")):
    model = model = AutoModelForCausalLM.from_pretrained(model_id)
    config = LoraConfig()
    model = get_peft_model(model, config)
    model.save_pretrained(path)
    del model
    print(f"LoRA adapter did not exist, saved it to {path}")

model = AutoModelForCausalLM.from_pretrained(model_id).to(0)
model.load_adapter(path)

Setting a breakpoint before and after the self._load_pretrained_model, I can see that the VRAM usage doubles. With the provided fix, this is no longer the case. Thus, from my side, the PR looks good, just some small comments.

As I'm not very knowledgeable about the overall weight loading machinery in Transformers, I defer to @Cyrilvallez to judge if this is the best solution to the problem.

def capture_warmup(model, expanded_device_map, hf_quantizer):
captured_device_maps.append(dict(expanded_device_map))

modeling_utils.caching_allocator_warmup = capture_warmup

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using unittest.mock.patch?

# after loading, no meta device should be remaining
self.assertFalse(any((p.device.type == "meta") for p in model.parameters()))

def test_peft_load_adapter_warmup_uses_adapter_expected_keys(self):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test somewhat relies on implementation details to work. As it's not easy to check the actual effect in terms of memory usage, I'd say it's fine. But if, for instance, caching_allocator_warmup is no longer used by _load_pretrained_model, the test would break and it would be hard to debug. So let's expand the test description to include how exactly this is being tested.

@Cyrilvallez Cyrilvallez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, good catch! Looks good to me on the logic size. Will let @BenjaminBossan merge when he's happy with the test (I see a few comments there)

@BenjaminBossan

Copy link
Copy Markdown
Member

@Yooniel Do you agree with my feedback regarding the test? If yes, would you please update it accordingly?

@Yooniel

Yooniel commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the comments! I updated the test to use unittest.mock.patch and added a short note explaining what internal path the test relies on.

@BenjaminBossan

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan BenjaminBossan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the test, PR LGTM.

@BenjaminBossan BenjaminBossan added this pull request to the merge queue May 29, 2026
Merged via the queue into huggingface:main with commit c220ea9 May 29, 2026
29 checks passed
kashif pushed a commit to kashif/transformers that referenced this pull request Jun 1, 2026
…46145)

* Fix load_adapter OOM caused by full-model warmup sizing

load_adapter passed every named parameter on the model, including the
base model, as expected_keys to _load_pretrained_model. Downstream,
caching_allocator_warmup summed those into a full base-model byte count
and issued a single same-size allocation on top of the already-resident
base model, OOMing whenever the base model occupies more than ~half of
GPU memory.

The file already defined an is_adapter_key helper for identifying
parameters belonging to the freshly-injected adapter, but it was declared
after the _load_pretrained_model call. Hoist the helper above the call
and apply it to expected_keys.

Adds a regression test that captures the device map passed to
caching_allocator_warmup during load_adapter and asserts it contains only
adapter-owned parameter names, not base-model names.

* Address review: use unittest.mock.patch and expand test docstring
khushali9 pushed a commit to khushali9/transformers that referenced this pull request Jun 8, 2026
…46145)

* Fix load_adapter OOM caused by full-model warmup sizing

load_adapter passed every named parameter on the model, including the
base model, as expected_keys to _load_pretrained_model. Downstream,
caching_allocator_warmup summed those into a full base-model byte count
and issued a single same-size allocation on top of the already-resident
base model, OOMing whenever the base model occupies more than ~half of
GPU memory.

The file already defined an is_adapter_key helper for identifying
parameters belonging to the freshly-injected adapter, but it was declared
after the _load_pretrained_model call. Hoist the helper above the call
and apply it to expected_keys.

Adds a regression test that captures the device map passed to
caching_allocator_warmup during load_adapter and asserts it contains only
adapter-owned parameter names, not base-model names.

* Address review: use unittest.mock.patch and expand test docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants