FIX Error when prefix tuning Gemma 4 by BenjaminBossan · Pull Request #3205 · huggingface/peft

BenjaminBossan · 2026-04-30T13:10:58Z

There was an issue with applying prefix tuning to Gemma 4 because the model uses different head dimensions for layers that use sliding window attention. As prefix tuning only initializes a single projection matrix that is used for all layers, this would lead to a shape mismatch.

The solution is to "overprovision" the matrix and then slice the prefix down to size of the layer is smaller. This is not quite as parameter efficient as it could be, but the overhead shouldn't be too large.

For robustness, we also skip layers if the matrix is underprovisioned, but we warn about it and raise an error if all layers are skipped.

Alternatively, we could implement one project per layer, each with the right size, like in google-deepmind/gemma#631. However, this would be a big refactor and also very hard to make backwards compatible with existing checkpoints, so going with the less efficient solution is preferable.

This PR also contains an independent, single line fix to a prefix tuning test that was referencing a non-existing model.

There was an issue with applying prefix tuning to Gemma 4 because the model uses different head dimensions for layers that use sliding window attention. As prefix tuning only initializes a single projection matrix that is used for all layers, this would lead to a shape mismatch. The solution is to "overprovision" the matrix and then slice the prefix down to size of the layer is smaller. This is not quite as parameter efficient as it could be, but the overhead shouldn't be too large. For robustness, we also skip layers if the matrix is underprovisioned, but we warn about it and raise an error if all layers are skipped. Alternatively, we could implement one project per layer, each with the right size, like in google-deepmind/gemma#631. However, this would be a big refactor and also very hard to make backwards compatible with existing checkpoints, so going with the less efficient solution is preferable. This PR also contains an independent, single line fix to a prefix tuning test that was referencing a non-existing model.

HuggingFaceDocBuilderDev · 2026-04-30T13:15:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Tests passed on Linux but not on Windows. Trying to guess tolerances that could work.

stharrold · 2026-05-04T04:34:43Z

Tested peft#3205 (head c020c11c) against Gemma-4-E2B + SDPA + PrefixTuning on NVIDIA H100 80GB. The forward pass succeeds. ✅

Environment:

peft @ git+https://github.com/huggingface/peft.git@c020c11c397cdf2d66a34dccceab4246517a28c1 (reports 0.19.2.dev0)
transformers==5.7.0
torch==2.5.1+cu124, CUDA available
attn_implementation="sdpa", torch_dtype=torch.bfloat16, NVIDIA H100 80GB
model_id="google/gemma-4-e2b-it", PrefixTuningConfig(num_virtual_tokens=20, prefix_projection=False)

Result:

Forward returned loss=15.6689 (finite, no NaN)
input shape: torch.Size([1, 400])
logits shape: torch.Size([1, 400, 262144])
The skip-when-KV-mismatch logic from this PR kicked in cleanly. From the run output:

UserWarning: Prefix tuning injected into layers [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]; skipped [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34] due to KV shape mismatch or shared-KV layers.

Caveat on shape: our test input ("Hello, world." * 100 with max_length=1632) tokenized to 400 tokens, not the 1632 the reproducer in #3201 finding 1 used. So the exact expanded size (1632) must match the existing size (1652) error didn't surface at this shape — but the forward path that previously crashed for prefix-tuning + SDPA + Gemma-4 combinations now runs end-to-end, and the PR's layer-skip logic is observable. Happy to retest with a longer literal prompt that hits 1632+ tokens if you want exact-shape repro confirmation.

For the other two findings in #3201 (eager-attn position_ids overflow at modeling_gemma4.py:2262; P-Tuning v2 PLE 1.69 TB OOM), this PR doesn't change behavior — they remain WONTFIX per the upstream consensus, and we're tracking them on synavistra's side via internal triage.

Thanks for the partial-fix PR — happy to test follow-up shapes (prefix_projection=True, multi-batch, longer sequence length) if useful for the merge.

BenjaminBossan · 2026-05-04T09:02:43Z

@zucchini-nlp It would be great if you could review the PR.

zucchini-nlp

Lgtm if we want to support gemma4. I think there is no uniform standard way for a unique arch, so hardcoding specific config names is fine

zucchini-nlp · 2026-05-05T09:42:31Z

+def _get_layer_kv_target_shape(base_config, layer_idx: int) -> tuple[int, int] | None:
+    """Per-layer (num_kv_heads, head_dim) for prefix-tuning injection, or None for uniform models.
+
+    Models with heterogeneous attention (e.g. Gemma4) expose `global_head_dim` / `num_global_key_value_heads` alongside
+    the sliding-layer `head_dim` / `num_key_value_heads`. The provisioned prefix is sized for the global footprint;
+    this returns the shape each layer actually expects so the caller can slice down or skip layers that don't fit.
+    """
+    layer_types = getattr(base_config, "layer_types", None)


so ig we're supporting specifically gemma4 with hardcoded attr names

Yeah, if there is a more general approach, LMK, otherwise I'm okay with a Gemma-specific solution.

zucchini-nlp · 2026-05-05T09:44:03Z

+                    if num_kv_shared_layers > 0 and layer_idx >= first_kv_shared_layer_idx:
+                        skipped_layers.append(layer_idx)
+                        continue
+                    key_states, value_states = layer_past_key_values


nice, prev gemma3 also used to skip layers, so we shouldn't need a prefix cache for it

zucchini-nlp · 2026-05-05T09:47:52Z

+            )
+            model = get_peft_model(model, config)
+            text_config = model.config.get_text_config()
+            text_config.num_key_value_heads = 999


curious about this. If configs dim are changed, doesn't it mean that key/value cache will also be a larger tensor?

I'm not 100% sure, but I think it works because the model is already initialized and the cache is already created at this point, so changing the config won't affect it. But I haven't checked the full code path.

BenjaminBossan

Thanks a lot for reviewing @zucchini-nlp. I replied to your comments. LMK if there is anything left, otherwise I'll go ahead and merge.

BenjaminBossan · 2026-05-05T10:01:35Z

+def _get_layer_kv_target_shape(base_config, layer_idx: int) -> tuple[int, int] | None:
+    """Per-layer (num_kv_heads, head_dim) for prefix-tuning injection, or None for uniform models.
+
+    Models with heterogeneous attention (e.g. Gemma4) expose `global_head_dim` / `num_global_key_value_heads` alongside
+    the sliding-layer `head_dim` / `num_key_value_heads`. The provisioned prefix is sized for the global footprint;
+    this returns the shape each layer actually expects so the caller can slice down or skip layers that don't fit.
+    """
+    layer_types = getattr(base_config, "layer_types", None)


Yeah, if there is a more general approach, LMK, otherwise I'm okay with a Gemma-specific solution.

BenjaminBossan · 2026-05-05T10:07:28Z

+            )
+            model = get_peft_model(model, config)
+            text_config = model.config.get_text_config()
+            text_config.num_key_value_heads = 999


I'm not 100% sure, but I think it works because the model is already initialized and the cache is already created at this point, so changing the config won't affect it. But I haven't checked the full code path.

zucchini-nlp · 2026-05-05T10:34:49Z

Yep, all good for me. The thing about bigger head dim seems to be a specific edge case :)

Change tolerances for Windows (!!!)

c020c11

Tests passed on Linux but not on Windows. Trying to guess tolerances that could work.

BenjaminBossan mentioned this pull request Apr 30, 2026

PrefixTuningConfig fails on google/gemma-4-e2b-it: tensor expand size mismatch in attention forward (peft 0.19.1, transformers 5.6.2) #3201

Closed

BenjaminBossan marked this pull request as ready for review May 4, 2026 09:02

zucchini-nlp approved these changes May 5, 2026

View reviewed changes

BenjaminBossan commented May 5, 2026

View reviewed changes

BenjaminBossan merged commit 17a7a16 into huggingface:main May 5, 2026
10 checks passed

BenjaminBossan deleted the fix-prefix-tuning-gemma4 branch May 5, 2026 10:44

Conversation

BenjaminBossan commented Apr 30, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 30, 2026

Uh oh!

stharrold commented May 4, 2026

Uh oh!

BenjaminBossan commented May 4, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp May 5, 2026

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan May 5, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp May 5, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp May 5, 2026

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan May 5, 2026

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan May 5, 2026

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan May 5, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants