fix(merge_lora): silence false-positive '16-mixed' AMP warning on CPU by jbbqqf · Pull Request #2252 · Lightning-AI/litgpt

jbbqqf · 2026-05-23T12:50:34Z

Summary

When a user finetunes LoRA with precision: 16-mixed (a common default for GPU runs), litgpt merge_lora runs Fabric on CPU with that precision and Fabric prints:

You passed `Fabric(accelerator='cpu', precision='16-mixed')` but AMP with fp16 is not supported on CPU. Using `precision='bf16-mixed'` instead.

The warning is a false positive: the merge step only loads weights and immediately overrides their dtype with model.to(dtype=lora_dtype, device="cpu") (merge_lora.py:75), so the precision passed to Fabric has no effect on the saved checkpoint. The warning still surfaces to users who configure Fabric in their training script and don't know what's happening under the hood, as flagged in the original issue.

This PR downgrades "16-mixed" → "bf16-mixed" ourselves before calling L.Fabric(...), matching what Fabric would do internally but without the noisy warning. All other precision values pass through unchanged, so behaviour is preserved.

Reproduce BEFORE/AFTER yourself (copy-paste)

# In a fresh venv with: lightning torch huggingface-hub safetensors tokenizers tqdm jsonargparse

git clone https://github.com/Lightning-AI/litgpt.git litgpt-repro && cd litgpt-repro

# --- BEFORE (origin/main) ---
git checkout main
# Expected: 1 line containing "AMP with fp16 is not supported on CPU" printed by Fabric
python -c "
import warnings, tempfile, shutil, yaml, torch
from pathlib import Path
from litgpt.model import GPT
from litgpt.lora import GPT as LoRAGPT, lora_filter
from litgpt.scripts.merge_lora import merge_lora

with tempfile.TemporaryDirectory() as tmp:
    tmp = Path(tmp)
    base = tmp / 'base'; lora = tmp / 'lora'
    for d in (base, lora):
        d.mkdir()
        for f in ('lit_model.pth','model_config.yaml','tokenizer.json','tokenizer_config.json'):
            (d / f).touch()
    cfg = dict(block_size=128, padded_vocab_size=256, n_layer=3, n_head=8, n_embd=16)
    yaml.dump(cfg, open(base/'model_config.yaml','w'))
    yaml.dump(cfg, open(lora/'model_config.yaml','w'))
    torch.save(GPT.from_name('pythia-14m', **cfg).state_dict(), base/'lit_model.pth')
    lora_kwargs = dict(lora_r=8, lora_alpha=16, lora_dropout=0.0, lora_query=True, lora_value=True)
    m = LoRAGPT.from_name('pythia-14m', **cfg, **lora_kwargs)
    sd = {k:v for k,v in m.state_dict().items() if lora_filter(k,v)}
    torch.save(sd, lora/'lit_model.pth.lora')
    (lora/'lit_model.pth').unlink()
    hparams = dict(checkpoint_dir=str(base), precision='16-mixed', **lora_kwargs)
    yaml.dump(hparams, open(lora/'hyperparameters.yaml','w'))
    merge_lora(lora)
"

# --- AFTER (this branch) ---
git fetch origin pull/<this-pr>/head:fix-1242 && git checkout fix-1242
# Expected: NO "AMP with fp16 is not supported on CPU" line; merge completes silently
# (re-run the same Python snippet)

What I ran locally

Unit test under tests/test_merge_lora.py::test_merge_lora_downgrades_16_mixed_to_avoid_cpu_warning:

On origin/main: fails — the Fabric warning record matches "AMP with fp16 is not supported on CPU".
On this branch: passes — caplog captures no such record; lit_model.pth is written successfully.

The existing parametrised test_merge_lora (3 cases) still passes because the new branch only fires when precision == "16-mixed", which the existing tests don't exercise.

Edge cases

Input `precision` from `hyperparameters.yaml`	Old behaviour	New behaviour
`"16-mixed"`	Fabric warns + downgrades to `bf16-mixed`	Silently downgraded to `bf16-mixed` here
`"bf16-mixed"`	passthrough	passthrough (unchanged)
`"bf16-true"`, `"32-true"`, `"16-true"`, etc.	passthrough	passthrough (unchanged)
`None` (no precision in metadata)	passthrough → Fabric default	passthrough (unchanged)
User passes `--precision 16-mixed` explicitly on CLI	warning	also silenced — same physics, same intent

The downgrade applies to both the metadata-driven path (lora_precision from hyperparameters.yaml) and the CLI-driven path (--precision), because the warning is purely a function of (precision, accelerator) and merge_lora is always CPU-bound here.

PR drafted with assistance from Claude Code (Anthropic). The change was reviewed manually against litgpt/scripts/merge_lora.py and tests/test_merge_lora.py. The reproducer block above is the one I used during development; reviewers can paste it verbatim.

…Lightning-AI#1242) When a LoRA checkpoint records `precision: 16-mixed` (a common default for GPU finetune runs), `litgpt merge_lora` instantiates Fabric on CPU with that precision and Fabric emits: "You passed `Fabric(accelerator='cpu', precision='16-mixed')` but AMP with fp16 is not supported on CPU. Using `precision='bf16-mixed'` instead." The warning is a false positive: the merge step overrides the dtype with `model.to(dtype=lora_dtype, device='cpu')` immediately after loading, so the precision passed to Fabric has no effect on the saved checkpoint. The warning confuses users who configure Fabric in their training script and don't know what's happening under the hood (reported in Lightning-AI#1242). Downgrade '16-mixed' to 'bf16-mixed' ourselves before constructing Fabric, matching what Fabric would do internally but without the warning. All other precision values pass through unchanged. Add a regression test that exercises the fix by setting `precision: 16-mixed` in `hyperparameters.yaml` and asserting the Fabric warning is not captured by `caplog`.

jbbqqf requested review from andyland, k223kim, lianakoleva and t-vi as code owners May 23, 2026 12:50

jbbqqf closed this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(merge_lora): silence false-positive '16-mixed' AMP warning on CPU#2252

fix(merge_lora): silence false-positive '16-mixed' AMP warning on CPU#2252
jbbqqf wants to merge 1 commit into
Lightning-AI:mainfrom
jbbqqf:fix/1242-merge-lora-precision-warning

jbbqqf commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jbbqqf commented May 23, 2026

Summary

Reproduce BEFORE/AFTER yourself (copy-paste)

What I ran locally

Edge cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant