Skip to content

fix(merge_lora): silence false-positive '16-mixed' AMP warning on CPU#2252

Closed
jbbqqf wants to merge 1 commit into
Lightning-AI:mainfrom
jbbqqf:fix/1242-merge-lora-precision-warning
Closed

fix(merge_lora): silence false-positive '16-mixed' AMP warning on CPU#2252
jbbqqf wants to merge 1 commit into
Lightning-AI:mainfrom
jbbqqf:fix/1242-merge-lora-precision-warning

Conversation

@jbbqqf
Copy link
Copy Markdown

@jbbqqf jbbqqf commented May 23, 2026

Summary

Fixes #1242.

When a user finetunes LoRA with precision: 16-mixed (a common default for GPU runs), litgpt merge_lora runs Fabric on CPU with that precision and Fabric prints:

You passed `Fabric(accelerator='cpu', precision='16-mixed')` but AMP with fp16 is not supported on CPU. Using `precision='bf16-mixed'` instead.

The warning is a false positive: the merge step only loads weights and immediately overrides their dtype with model.to(dtype=lora_dtype, device="cpu") (merge_lora.py:75), so the precision passed to Fabric has no effect on the saved checkpoint. The warning still surfaces to users who configure Fabric in their training script and don't know what's happening under the hood, as flagged in the original issue.

This PR downgrades "16-mixed""bf16-mixed" ourselves before calling L.Fabric(...), matching what Fabric would do internally but without the noisy warning. All other precision values pass through unchanged, so behaviour is preserved.

Reproduce BEFORE/AFTER yourself (copy-paste)

# In a fresh venv with: lightning torch huggingface-hub safetensors tokenizers tqdm jsonargparse

git clone https://github.com/Lightning-AI/litgpt.git litgpt-repro && cd litgpt-repro

# --- BEFORE (origin/main) ---
git checkout main
# Expected: 1 line containing "AMP with fp16 is not supported on CPU" printed by Fabric
python -c "
import warnings, tempfile, shutil, yaml, torch
from pathlib import Path
from litgpt.model import GPT
from litgpt.lora import GPT as LoRAGPT, lora_filter
from litgpt.scripts.merge_lora import merge_lora

with tempfile.TemporaryDirectory() as tmp:
    tmp = Path(tmp)
    base = tmp / 'base'; lora = tmp / 'lora'
    for d in (base, lora):
        d.mkdir()
        for f in ('lit_model.pth','model_config.yaml','tokenizer.json','tokenizer_config.json'):
            (d / f).touch()
    cfg = dict(block_size=128, padded_vocab_size=256, n_layer=3, n_head=8, n_embd=16)
    yaml.dump(cfg, open(base/'model_config.yaml','w'))
    yaml.dump(cfg, open(lora/'model_config.yaml','w'))
    torch.save(GPT.from_name('pythia-14m', **cfg).state_dict(), base/'lit_model.pth')
    lora_kwargs = dict(lora_r=8, lora_alpha=16, lora_dropout=0.0, lora_query=True, lora_value=True)
    m = LoRAGPT.from_name('pythia-14m', **cfg, **lora_kwargs)
    sd = {k:v for k,v in m.state_dict().items() if lora_filter(k,v)}
    torch.save(sd, lora/'lit_model.pth.lora')
    (lora/'lit_model.pth').unlink()
    hparams = dict(checkpoint_dir=str(base), precision='16-mixed', **lora_kwargs)
    yaml.dump(hparams, open(lora/'hyperparameters.yaml','w'))
    merge_lora(lora)
"

# --- AFTER (this branch) ---
git fetch origin pull/<this-pr>/head:fix-1242 && git checkout fix-1242
# Expected: NO "AMP with fp16 is not supported on CPU" line; merge completes silently
# (re-run the same Python snippet)

What I ran locally

Unit test under tests/test_merge_lora.py::test_merge_lora_downgrades_16_mixed_to_avoid_cpu_warning:

  • On origin/main: fails — the Fabric warning record matches "AMP with fp16 is not supported on CPU".
  • On this branch: passes — caplog captures no such record; lit_model.pth is written successfully.

The existing parametrised test_merge_lora (3 cases) still passes because the new branch only fires when precision == "16-mixed", which the existing tests don't exercise.

Edge cases

Input precision from hyperparameters.yaml Old behaviour New behaviour
"16-mixed" Fabric warns + downgrades to bf16-mixed Silently downgraded to bf16-mixed here
"bf16-mixed" passthrough passthrough (unchanged)
"bf16-true", "32-true", "16-true", etc. passthrough passthrough (unchanged)
None (no precision in metadata) passthrough → Fabric default passthrough (unchanged)
User passes --precision 16-mixed explicitly on CLI warning also silenced — same physics, same intent

The downgrade applies to both the metadata-driven path (lora_precision from hyperparameters.yaml) and the CLI-driven path (--precision), because the warning is purely a function of (precision, accelerator) and merge_lora is always CPU-bound here.


PR drafted with assistance from Claude Code (Anthropic). The change was reviewed manually against litgpt/scripts/merge_lora.py and tests/test_merge_lora.py. The reproducer block above is the one I used during development; reviewers can paste it verbatim.

…Lightning-AI#1242)

When a LoRA checkpoint records `precision: 16-mixed` (a common default for
GPU finetune runs), `litgpt merge_lora` instantiates Fabric on CPU with that
precision and Fabric emits:

  "You passed `Fabric(accelerator='cpu', precision='16-mixed')` but AMP with
   fp16 is not supported on CPU. Using `precision='bf16-mixed'` instead."

The warning is a false positive: the merge step overrides the dtype with
`model.to(dtype=lora_dtype, device='cpu')` immediately after loading, so the
precision passed to Fabric has no effect on the saved checkpoint. The warning
confuses users who configure Fabric in their training script and don't know
what's happening under the hood (reported in Lightning-AI#1242).

Downgrade '16-mixed' to 'bf16-mixed' ourselves before constructing Fabric,
matching what Fabric would do internally but without the warning. All other
precision values pass through unchanged.

Add a regression test that exercises the fix by setting
`precision: 16-mixed` in `hyperparameters.yaml` and asserting the Fabric
warning is not captured by `caplog`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

False positive warning about mixed precision in merge_lora.py

1 participant