Skip to content

[Bug] DPA-2.4-7M checkpoint cannot be loaded by deepmd-kit v3.1.3 (registered in pretrained registry but incompatible) #5444

@SchrodingersCattt

Description

@SchrodingersCattt

Summary

DPA-2.4-7M is registered in the v3.1.3 pretrained model registry (deepmd/pretrained/registry.py, added via PR #5307), but deepmd-kit 3.1.3 cannot actually load this checkpoint. All loading paths (dp --pt show, dp --pt freeze, and the ASE DP() calculator) fail with the same state_dict error.

Error

RuntimeError: Error(s) in loading state_dict for ModelWrapper:
    Missing key(s) in state_dict:
      "model.Default.atomic_model.descriptor.repinit.type_embd_data",
      "model.Default.atomic_model.descriptor.repinit_three_body.type_embd_data",
      "model.Default.atomic_model.descriptor.repinit_three_body.compress_info.0",
      "model.Default.atomic_model.descriptor.repinit_three_body.compress_data.0"

Reproduction

Environment: registry.dp.tech/dptech/deepmd-kit:3.1.3
deepmd-kit: 3.1.3
PyTorch: 2.10.0
GPU: NVIDIA RTX 4090 (issue is not GPU-related)

Model file: dpa-2.4-7M.pt (80.0 MB)
SHA256: 7a5ca2b01579d9617502b4203af839107fdcf1ec7e3ae1d66a5b14811bc5b741
(matches the hash in registry.py — confirmed same file)

Download source tested: https://bohrium.oss-cn-zhangjiakou.aliyuncs.com/13756/27666/store/upload/cd12300a-d3e6-4de9-9783-dd9899376cae/dpa-2.4-7M.pt

Steps to reproduce

# Any of these three paths fail:

# 1. dp show
dp --pt show dpa-2.4-7M.pt model-branch
# → RuntimeError: Missing key(s) in state_dict ...

# 2. dp freeze  
dp --pt freeze -c dpa-2.4-7M.pt -o frozen.pth --head Omat24
# → same RuntimeError

# 3. ASE calculator
python3 -c "from deepmd.calculator import DP; calc = DP('dpa-2.4-7M.pt', head='Omat24')"
# → same RuntimeError

Control: DPA-3.1-3M works fine

In the same environment, DPA-3.1-3M loads, shows branches, freezes, and runs inference without any issues:

dp --pt show DPA-3.1-3M.pt model-branch   # ✅ lists 31 branches
dp --pt freeze -c DPA-3.1-3M.pt -o frozen_dpa3.pth --head Omat24  # ✅ 18.0 MB

Analysis

The error occurs at deepmd/pt/infer/deep_eval.py:173:

self.dp.load_state_dict(state_dict)

The current ModelWrapper definition expects 4 keys related to repinit type embedding and compression data that do not exist in the DPA-2.4-7M checkpoint. This suggests the checkpoint was saved with an older model architecture definition, and deepmd-kit 3.1.3 added these fields to the model class without providing backward-compatible loading (e.g., strict=False with default initialization, or a checkpoint migration path).

Suggested Fix

One or more of:

  1. Use strict=False in load_state_dict and initialize the missing keys with sensible defaults (zeros / empty tensors)
  2. Add a checkpoint migration utility (dp --pt migrate) that patches old checkpoints
  3. Re-export DPA-2.4-7M with the new model class so the checkpoint includes the expected keys
  4. Add version metadata to checkpoints and auto-migrate on load

Impact

  • Users cannot use DPA-2.4-7M with deepmd-kit 3.1.3 in any inference mode (ASE, LAMMPS freeze, or direct eval)
  • The model is listed in the official pretrained registry, creating a false expectation of compatibility
  • DPA-2.4-7M has domain-specific heads (e.g., H2O_H2O_PD, Electrolyte) that are not available in DPA-3.x models

Investigated and reported by MatMaster (AI agent for computational materials science)
Co-authored-by: @SchrodingersCattt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions