Skip to content

GPT-OSS 120B HF load fails with "Missing key in checkpoint state_dict: lm_head.weight" #1952

@jqwang2373

Description

@jqwang2373

Summary

Loading openai/gpt-oss-120b with the current NeMo AutoModel source fails during base checkpoint load with:

RuntimeError: Missing key in checkpoint state_dict: lm_head.weight.
The run gets past:

container startup
distributed initialization
transformers==5.5.0
HybridEPBuffer availability
and then fails while loading the base HF checkpoint.

This does not look like a HybridEP installation issue anymore. It looks like a GPT-OSS checkpoint loading / state-dict adaptation issue.

Environment
NeMo AutoModel source version:
0.4.0+c9d1aa03
Runtime container:
nvcr.io/nvidia/nemo-automodel:26.02.00 converted to sqsh
Hardware:
GB200
CUDA:
13.0
PyTorch:
2.10.0a0+b558c986e8.nv25.11
Transformers:
5.5.0
Note: the stock 26.02.00 container ships with transformers==5.0.0, which caused an earlier gemma4 import failure. After upgrading to transformers==5.5.0, that issue is resolved and the next blocker is the lm_head.weight error reported here.

Reproduction
I reproduced this with the GPT-OSS 120B recipe on 8 nodes x 4 GPUs (WORLD_SIZE=32) using:

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: openai/gpt-oss-120b
  backend:
    _target_: nemo_automodel.components.models.common.BackendConfig
    dispatcher: hybridep
    experts: torch_mm
    enable_hf_state_dict_adapter: true
    enable_fsdp_optimizations: true
distributed:
  strategy: fsdp2
  tp_size: 1
  cp_size: 1
  pp_size: 1
  ep_size: 32
The job gets all the way to:

> initializing torch distributed with 32 workers.
NCCL version 2.28.8+cuda13.0
recipe: TrainFinetuneRecipeForNextTokenPrediction
and then fails during base checkpoint load.

Actual Error
The first relevant failure looks like this:

File "/opt/Automodel/nemo_automodel/_transformers/infrastructure.py", line 522, in apply_model_infrastructure
    checkpointer.load_base_model(
File "/opt/Automodel/nemo_automodel/components/checkpoint/checkpointing.py", line 701, in load_base_model
    self.load_model(
...
RuntimeError: Missing key in checkpoint state_dict: lm_head.weight.
I then see repeated distributed checkpoint failures like:

torch.distributed.checkpoint.api.CheckpointException
RuntimeError: Missing key in checkpoint state_dict: lm_head.weight.
Additional Debugging
I verified the following:

transformers==5.5.0 is active during the run.
HybridEPBuffer is importable in the container, so this is not the earlier HybridEP installation problem.
HF config for openai/gpt-oss-120b reports:
"tie_word_embeddings": false
The HF checkpoint index still contains lm_head.weight (so the source checkpoint does appear to have that key).
nemo_automodel/components/models/gpt_oss/state_dict_adapter.py currently dequantizes tensors and applies key mapping, but does not appear to have any explicit lm_head.weight compatibility handling.
Other adapters (for example qwen2) do explicitly handle missing / tied lm_head.weight.
Suspected Area
This looks like it may be in one of these code paths:

nemo_automodel/components/models/gpt_oss/state_dict_adapter.py
nemo_automodel/components/checkpoint/checkpointing.py
nemo_automodel/components/checkpoint/utils.py
In particular, it seems possible that lm_head.weight is present in the HF checkpoint but gets dropped or not materialized correctly somewhere between:

HF checkpoint load
GPT-OSS state-dict adaptation
distributed/full-state load into the model
Why this seems GPT-OSS-specific
This does not look like a generic container or cluster issue:

distributed setup succeeds
HF download succeeds
transformers==5.5.0 fixes the earlier gemma4 import problem
failure is specifically about lm_head.weight during GPT-OSS checkpoint load
Question
Is this a known issue in the GPT-OSS custom loading path?

If not, would you prefer a follow-up PR that adds explicit lm_head.weight handling for GPT-OSS in the HF adapter / checkpoint compatibility path?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions