Skip to content

add DeePMD property tools#5471

Open
zhaiwenxi wants to merge 3 commits into
deepmodeling:masterfrom
zhaiwenxi:add-property-tools
Open

add DeePMD property tools#5471
zhaiwenxi wants to merge 3 commits into
deepmodeling:masterfrom
zhaiwenxi:add-property-tools

Conversation

@zhaiwenxi
Copy link
Copy Markdown

@zhaiwenxi zhaiwenxi commented May 28, 2026

Summary

  • Add deepmd_property_tools for molecular property training and prediction.
  • Add modern pyproject.toml packaging with a CLI entry point.
  • Add pytest-based unit tests and GitHub Actions workflow.
  • Add small demo DATA files for examples.

Tests

  • python -m pytest tests -v

Summary by CodeRabbit

  • New Features

    • Added property prediction and training tools for molecular property modeling via Python API and command-line interface.
    • Introduced train and predict CLI commands for workflow automation.
    • Added support for molecular structure files (MOL format) in training and prediction pipelines.
    • Included comprehensive documentation on DPA3 fine-tuning hyperparameters and package usage.
  • Tests

    • Added unit test suite covering CLI, training, prediction, and data handling workflows.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

This PR adds deepmd_property_tools, a new package providing end-to-end workflows for training and inference on molecular property prediction. It includes data preparation from CSV+MOL or direct coordinates, DeepMD dataset generation, fine-tuning orchestration, inference pipelines, CLI interface, comprehensive tests, and example scripts.

Changes

Property Tools Training and Inference

Layer / File(s) Summary
MOL file parsing and record building
deepmd/deepmd_property_tools/deepmd_property_tools/data/mol.py, tests/test_mol.py
Element lookup, property value parsing, MOL coordinate extraction with fixed-width and whitespace-delimited format support, record builders for CSV/MOL and direct data with overlap detection and zero-filtering.
Data conversion and DeepMD frame building
deepmd/deepmd_property_tools/deepmd_property_tools/data/converter.py, data/__init__.py
Converts parsed records into dpdata.LabeledSystem frames, writes train/valid mixed-npy directories, generates DeepMD input JSON, tracks failed rows, and returns comprehensive conversion metadata.
Configuration loading and defaults
deepmd/deepmd_property_tools/deepmd_property_tools/config/config_handler.py, config/default.json, config/__init__.py
JSON configuration persistence with deep recursive merging, default DPA3 model/training/loss settings for property prediction fine-tuning.
PropertyTrain high-level interface
deepmd/deepmd_property_tools/deepmd_property_tools/train.py
Validates training arguments, derives step counts from epochs, resolves finetune weights via WeightHub, merges batch/metric configurations, orchestrates DataHub + Trainer pipeline.
Trainer subprocess and checkpoint management
deepmd/deepmd_property_tools/deepmd_property_tools/tasks/trainer.py, tasks/__init__.py, tests/test_trainer.py
Executes DeepMD training (single-process or torchrun multi-process), handles optional model freezing, discovers and selects newest checkpoints by modification time.
PropertyPredict initialization and config resolution
deepmd/deepmd_property_tools/deepmd_property_tools/predict.py
Resolves model paths (preferring frozen models), loads type_map and property_name from property_tools_config.json, validates required parameters.
Predictor model evaluation and output
deepmd/deepmd_property_tools/deepmd_property_tools/predictor.py, tests/test_predict.py
Standardizes atoms/coordinates into padded float32/int32 tensors, evaluates PropertyModel, writes predictions to CSV with auto-incremented run indices and multi-output support.
PropertyModel and DeepProperty wrapper
deepmd/deepmd_property_tools/deepmd_property_tools/models/property_model.py, models/__init__.py
Lightweight wrapper around DeepMD DeepProperty with no_jit=True, exposing unified eval() interface.
DataHub train/inference mode dispatcher
deepmd/deepmd_property_tools/deepmd_property_tools/data/datahub.py
Unified entry point that branches to training (via prepare_property_data) or inference (predict_records_from_data) based on is_train flag.
CLI entrypoint and utility modules
deepmd/deepmd_property_tools/deepmd_property_tools/cli.py, utils/, tests/test_cli.py
Argparse-based CLI with train and predict subcommands; logging setup, metrics (MAE/RMSE), and directory utilities.
WeightHub pretrained model resolution
deepmd/deepmd_property_tools/deepmd_property_tools/weights/weighthub.py, weights/__init__.py
Registry-based model name resolution supporting local paths and built-in model lookup with caching.
Package initialization and build config
deepmd/deepmd_property_tools/deepmd_property_tools/__init__.py, pyproject.toml, MANIFEST.in
Main package re-exports, PEP 621 metadata with deepmd-kit[torch]==3.1.3 dependency, console script entry, JSON data inclusion.
Unit tests for core components
deepmd/deepmd_property_tools/tests/test_*.py
Test coverage for CLI argument handling, config deep-merge, MOL parsing robustness, checkpoint selection, subprocess command generation, and training parameter validation.
Example training and prediction scripts
deepmd/deepmd_property_tools/train_property_20.py, predict_property_20.py
Runnable end-to-end workflows demonstrating DPA3 fine-tuning with learning-rate scheduling and subsequent inference on test data.
Molecular structure test data (40 MOL files)
deepmd/deepmd_property_tools/DATA/mol_convert/id*.mol
RDKit V2000 MOL format structures (17–57 atoms) with 3D coordinates and formal charges for validation and example workflows.
Package README and finetune hyperparameter guide
deepmd/deepmd_property_tools/README.md, DPA3_finetune_hyperparameters.md
User documentation covering installation, Python/CLI usage, data format expectations, and detailed guidance on inheriting vs customizing DPA3 parameters during fine-tuning.
GitHub Actions test workflow
.github/workflows/property_tools_tests.yml
CI pipeline running pytest on push/pull_request for changes under deepmd/deepmd_property_tools/ with Python 3.10 and dependencies.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

Examples

Suggested reviewers

  • njzjz
  • iProzd
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.94% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add DeePMD property tools' directly and clearly describes the main change: the addition of a new property tools package to the DeePMD project.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
deepmd/deepmd_property_tools/README.md (1)

15-20: ⚡ Quick win

Add a single-test pytest example for faster local iteration.

This section currently only shows running the full test set. Please also document a single-test invocation pattern so contributors follow the repo’s preferred quick loop.

As per coding guidelines, "**/tests/**/*.py: Use pytest for testing single test cases ... instead of full test suite (60+ minutes)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/README.md` around lines 15 - 20, Update the "For
local development with tests" section in README.md to include a quick
single-test invocation example for faster iteration: add a short sentence
showing how to run an individual test file or a single test case using pytest by
specifying the test file path and optional test nodeid (using pytest's ::
notation) or by using -k to filter, and recommend using -q for concise output;
reference the repository's test pattern "**/tests/**/*.py" and place this
example immediately after the existing full-suite commands.
deepmd/deepmd_property_tools/deepmd_property_tools/data/mol.py (1)

264-281: ⚡ Quick win

Add strict=True to zip() for consistency with the length check.

The length validation on line 272 ensures matching lengths, but adding strict=True provides defense-in-depth and aligns with Ruff B905.

♻️ Proposed fix
-    for idx, (symbols, coords, target) in enumerate(zip(atoms, coordinates, targets)):
+    for idx, (symbols, coords, target) in enumerate(zip(atoms, coordinates, targets, strict=True)):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/deepmd_property_tools/data/mol.py` around lines
264 - 281, In records_from_direct_data, the zip over atoms, coordinates, and
targets should use strict=True to match the earlier length validation and
satisfy Ruff B905; update the enumerate(zip(atoms, coordinates, targets)) call
in the records_from_direct_data function to pass strict=True (i.e.,
enumerate(zip(atoms, coordinates, targets, strict=True))) so any length mismatch
raises immediately.
.github/workflows/property_tools_tests.yml (1)

18-18: ⚡ Quick win

Pin GitHub Actions to commit SHAs for supply-chain security.

The workflow uses unpinned action references (actions/checkout@v4, actions/setup-python@v5) which reference mutable tags. Pinning to specific commit SHAs prevents potential supply-chain attacks if tags are moved or compromised.

🔒 Pin actions to commit hashes
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11  # v4.1.1

       - name: Set up Python
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c  # v5.0.0
         with:

Also consider adding persist-credentials: false to the checkout step to prevent credential leakage:

-      - uses: actions/checkout@v4
+      - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11  # v4.1.1
+        with:
+          persist-credentials: false

Also applies to: 21-21

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/property_tools_tests.yml at line 18, Replace mutable
action tags with pinned commit SHAs and disable credential persistence: update
the checkout usage (currently referenced as actions/checkout@v4) to the
corresponding actions/checkout@<commit-sha> and update setup-python (currently
actions/setup-python@v5) to actions/setup-python@<commit-sha>; also add
persist-credentials: false to the checkout step to avoid credential leakage.
Locate the workflow steps that reference actions/checkout and
actions/setup-python and substitute the tags with the official commit SHAs from
the actions repositories, then add the persist-credentials: false key under the
checkout step.
deepmd/deepmd_property_tools/tests/test_predict.py (1)

60-65: 💤 Low value

Consider a more reliable approach for timestamp ordering.

Using time.sleep(0.01) to ensure file modification time differences can be fragile on filesystems with coarse timestamp resolution or under heavy system load.

Alternative approaches

Consider explicitly setting file modification times using os.utime() or Path.touch() with specific timestamps:

 def test_predict_directory_uses_latest_checkpoint(tmp_path: Path) -> None:
     old_checkpoint = tmp_path / "model.ckpt-1.pt"
     old_checkpoint.write_text("old", encoding="utf-8")
-    time.sleep(0.01)
     latest_checkpoint = tmp_path / "model.ckpt-2.pt"
     latest_checkpoint.write_text("new", encoding="utf-8")
+    # Ensure old_checkpoint has earlier mtime
+    import os
+    os.utime(old_checkpoint, (0, 0))
     (tmp_path / "property_tools_config.json").write_text(

Or rely on checkpoint numbering logic rather than file timestamps if the implementation supports it.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/tests/test_predict.py` around lines 60 - 65, The
test test_predict_directory_uses_latest_checkpoint is relying on time.sleep to
order old_checkpoint and latest_checkpoint which is fragile; replace the
sleep-based approach by explicitly setting file modification times (e.g., use
os.utime or Path.touch with a specified times argument) on old_checkpoint and
latest_checkpoint (or alternatively rely on checkpoint numbering logic if the
code under test uses filename semantics) so the test deterministically makes
model.ckpt-2.pt the latest; update the setup that writes "old"/"new" to ensure
the timestamps are set explicitly rather than waiting.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/deepmd_property_tools/DATA/mol_convert/id5.mol`:
- Around line 1-121: The file id5.mol contains the same molecule data as id4.mol
(same atoms/bonds/charges with only the title "id_5" changed); remove or replace
id5.mol with the intended unique molecule data (or update your dataset
generation/deduplication logic so files like id5.mol are not created as
duplicates of id4.mol). Search for the identifier/title "id_5" and the matching
atom/bond/charge blocks (e.g., the M  CHG block and the same bond list) to
locate the duplicate, then either swap in the correct .mol content or
deduplicate at generation time to avoid biased/duplicated examples.

In `@deepmd/deepmd_property_tools/deepmd_property_tools/data/datahub.py`:
- Around line 47-62: When is_train is True, ensure property_col is a non-None
str before calling prepare_property_data: in the DataHub constructor (or the
method where this code lives) validate that property_col is not None and raise a
clear ValueError if it is, or set a sensible default string (e.g., "Property")
before passing it to prepare_property_data; reference the symbols property_col,
is_train, and prepare_property_data to locate and fix the call site so
prepare_property_data always receives a str.

In `@deepmd/deepmd_property_tools/deepmd_property_tools/predictor.py`:
- Around line 55-67: The zip(...) usage silently truncates mismatched sequences
(e.g., atoms vs coordinates, rows vs y_pred, predict_cols vs pred) causing data
loss; before each zip, validate lengths explicitly and raise a clear ValueError
if they differ (e.g., check len(atoms) == len(coordinates), len(rows) ==
len(y_pred), len(predict_cols) == len(pred)), include identifying context in the
message, then proceed with the existing loops that populate coords/atom_types
and consume y_pred/pred so no samples/columns are silently dropped; apply these
checks around the blocks that reference atoms/coordinates and
rows/y_pred/predict_cols/pred.

In `@deepmd/deepmd_property_tools/deepmd_property_tools/tasks/trainer.py`:
- Around line 28-30: The code sets self.save_path (and other path attributes
like self.input_path) as Path(save_path) but later run() changes the working
directory, causing relative paths to resolve incorrectly; change the constructor
to immediately resolve all incoming path attributes (e.g., call
Path(save_path).resolve() and Path(input_path).resolve() for any similar fields
set around lines where save_path/input_path are assigned) so that self.save_path
and related attributes are absolute before run() mutates cwd, and ensure any
other places assigning paths (the blocks around lines 38-39, 44-60, 66-77, 88)
are updated to use resolved Paths as well.

In `@deepmd/deepmd_property_tools/predict_property_20.py`:
- Around line 13-17: The code currently hardcodes MODEL_PATH = ROOT /
"exp_property_20" / "model.ckpt-10.pt", which couples the script to one
checkpoint; change it to accept an experiment directory (or path) and let
PropertyPredict or its loader resolve the correct file (e.g., frozen_model.pth
or latest checkpoint). Modify the usage of MODEL_PATH and any instantiation of
PropertyPredict to accept a directory/path variable (e.g., exp_dir or model_dir)
instead of a specific filename, and update the existence check to verify the
directory and delegate selecting the actual checkpoint file to PropertyPredict
(or a helper like PropertyPredict.load_checkpoint) so the script works with
frozen_model.pth or the latest checkpoint automatically.

In `@deepmd/deepmd_property_tools/train_property_20.py`:
- Around line 38-40: The finetuning invocation sets finetune=PRETRAINED_MODEL
while disabling the pretrained alignment via use_pretrain_script=False, which
can produce descriptor/config mismatches; update the call that includes
finetune, PRETRAINED_MODEL and input_updates to enable the pretrained-structure
alignment by setting use_pretrain_script=True (or removing the override so the
default alignment runs) so the model descriptors/config match the pretrained
weights during load.

---

Nitpick comments:
In @.github/workflows/property_tools_tests.yml:
- Line 18: Replace mutable action tags with pinned commit SHAs and disable
credential persistence: update the checkout usage (currently referenced as
actions/checkout@v4) to the corresponding actions/checkout@<commit-sha> and
update setup-python (currently actions/setup-python@v5) to
actions/setup-python@<commit-sha>; also add persist-credentials: false to the
checkout step to avoid credential leakage. Locate the workflow steps that
reference actions/checkout and actions/setup-python and substitute the tags with
the official commit SHAs from the actions repositories, then add the
persist-credentials: false key under the checkout step.

In `@deepmd/deepmd_property_tools/deepmd_property_tools/data/mol.py`:
- Around line 264-281: In records_from_direct_data, the zip over atoms,
coordinates, and targets should use strict=True to match the earlier length
validation and satisfy Ruff B905; update the enumerate(zip(atoms, coordinates,
targets)) call in the records_from_direct_data function to pass strict=True
(i.e., enumerate(zip(atoms, coordinates, targets, strict=True))) so any length
mismatch raises immediately.

In `@deepmd/deepmd_property_tools/README.md`:
- Around line 15-20: Update the "For local development with tests" section in
README.md to include a quick single-test invocation example for faster
iteration: add a short sentence showing how to run an individual test file or a
single test case using pytest by specifying the test file path and optional test
nodeid (using pytest's :: notation) or by using -k to filter, and recommend
using -q for concise output; reference the repository's test pattern
"**/tests/**/*.py" and place this example immediately after the existing
full-suite commands.

In `@deepmd/deepmd_property_tools/tests/test_predict.py`:
- Around line 60-65: The test test_predict_directory_uses_latest_checkpoint is
relying on time.sleep to order old_checkpoint and latest_checkpoint which is
fragile; replace the sleep-based approach by explicitly setting file
modification times (e.g., use os.utime or Path.touch with a specified times
argument) on old_checkpoint and latest_checkpoint (or alternatively rely on
checkpoint numbering logic if the code under test uses filename semantics) so
the test deterministically makes model.ckpt-2.pt the latest; update the setup
that writes "old"/"new" to ensure the timestamps are set explicitly rather than
waiting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 3b5f2302-3bde-4f08-abe0-e6307f9eed60

📥 Commits

Reviewing files that changed from the base of the PR and between a7c597d and db05969.

⛔ Files ignored due to path filters (1)
  • deepmd/deepmd_property_tools/DATA/dataset_demo.csv is excluded by !**/*.csv
📒 Files selected for processing (75)
  • .github/workflows/property_tools_tests.yml
  • deepmd/deepmd_property_tools/DATA/mol_convert/id0.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id1.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id10.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id11.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id12.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id13.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id14.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id15.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id16.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id17.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id18.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id19.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id2.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id20.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id21.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id22.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id23.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id24.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id25.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id26.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id27.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id28.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id29.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id3.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id30.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id31.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id32.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id33.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id34.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id35.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id36.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id37.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id38.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id39.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id4.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id5.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id6.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id7.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id8.mol
  • deepmd/deepmd_property_tools/DATA/mol_convert/id9.mol
  • deepmd/deepmd_property_tools/DPA3_finetune_hyperparameters.md
  • deepmd/deepmd_property_tools/MANIFEST.in
  • deepmd/deepmd_property_tools/README.md
  • deepmd/deepmd_property_tools/deepmd_property_tools/__init__.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/cli.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/config/__init__.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/config/config_handler.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/config/default.json
  • deepmd/deepmd_property_tools/deepmd_property_tools/data/__init__.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/data/converter.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/data/datahub.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/data/mol.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/models/__init__.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/models/property_model.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/predict.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/predictor.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/tasks/__init__.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/tasks/trainer.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/train.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/utils/__init__.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/utils/base_logger.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/utils/metrics.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/utils/util.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/weights/__init__.py
  • deepmd/deepmd_property_tools/deepmd_property_tools/weights/weighthub.py
  • deepmd/deepmd_property_tools/predict_property_20.py
  • deepmd/deepmd_property_tools/pyproject.toml
  • deepmd/deepmd_property_tools/tests/test_cli.py
  • deepmd/deepmd_property_tools/tests/test_config.py
  • deepmd/deepmd_property_tools/tests/test_mol.py
  • deepmd/deepmd_property_tools/tests/test_predict.py
  • deepmd/deepmd_property_tools/tests/test_train.py
  • deepmd/deepmd_property_tools/tests/test_trainer.py
  • deepmd/deepmd_property_tools/train_property_20.py

Comment on lines +1 to +121
id_5
RDKit 2D

57 56 0 0 0 0 0 0 0 0999 V2000
-2.2500 -1.2990 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-1.5000 0.0000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.5000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.0000 0.0000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.7500 1.2990 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.5000 2.5981 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.0000 2.5981 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.5000 2.5981 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
9.0000 2.5981 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
9.7500 1.2990 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
9.7500 3.8971 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.5000 4.0981 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
8.7990 4.8481 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.3021 5.0008 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.5000 1.0981 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
8.7990 0.3481 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.3021 0.1953 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.0490 0.5490 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.0490 -0.9510 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0490 -2.4510 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0490 -3.9510 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.7500 -4.7010 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.3481 -4.7010 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.5490 -2.4510 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
2.7990 -3.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.6463 -1.2530 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.5490 -2.4510 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
7.2990 -3.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
7.4518 -1.2530 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.4510 2.0490 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.4510 3.5490 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4510 5.0490 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4510 6.5490 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
1.1519 7.2990 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.7500 7.2990 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.9510 5.0490 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
4.7010 6.3481 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.8537 3.8511 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.9510 5.0490 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
0.2010 6.3481 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0482 3.8511 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 1.5000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-1.2990 2.2500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.1979 2.4028 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.0000 -1.5000 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-1.2990 -2.2500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.1979 -2.4028 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-2.2500 1.2990 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.6828 -1.4888 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1.6828 1.4888 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
5.8172 1.1093 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
5.8172 4.0869 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
6.5379 -0.7681 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
3.5602 -0.7681 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
3.9398 3.3662 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
0.9621 3.3662 0.0000 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 1 0
6 7 1 0
7 8 1 0
8 9 1 0
9 10 1 0
10 11 1 0
10 12 2 0
9 13 1 0
13 14 1 0
13 15 2 0
9 16 1 0
16 17 1 0
16 18 2 0
6 19 1 0
19 20 1 0
20 21 1 0
21 22 1 0
22 23 1 0
22 24 2 0
21 25 1 0
25 26 1 0
25 27 2 0
21 28 1 0
28 29 1 0
28 30 2 0
6 31 1 0
31 32 1 0
32 33 1 0
33 34 1 0
34 35 1 0
34 36 2 0
33 37 1 0
37 38 1 0
37 39 2 0
33 40 1 0
40 41 1 0
40 42 2 0
3 43 1 0
43 44 1 0
43 45 2 0
3 46 1 0
46 47 1 0
46 48 2 0
2 49 1 0
4 50 1 0
4 51 1 0
8 52 1 0
8 53 1 0
20 54 1 0
20 55 1 0
32 56 1 0
32 57 1 0
M CHG 8 2 1 10 1 11 -1 13 1 14 -1 16 1 17 -1 22 1
M CHG 8 23 -1 25 1 26 -1 28 1 29 -1 34 1 35 -1 37 1
M CHG 8 38 -1 40 1 41 -1 43 1 44 -1 46 1 47 -1 49 -1
M END
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

id5.mol duplicates id4.mol content under a different identifier.

The structure/bond/charge blocks here are effectively identical to deepmd/deepmd_property_tools/DATA/mol_convert/id4.mol (only title differs). If unintentional, this introduces duplicate examples that can bias splits and inflate evaluation confidence. Please replace this file with the intended molecule or deduplicate at dataset generation time.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/DATA/mol_convert/id5.mol` around lines 1 - 121,
The file id5.mol contains the same molecule data as id4.mol (same
atoms/bonds/charges with only the title "id_5" changed); remove or replace
id5.mol with the intended unique molecule data (or update your dataset
generation/deduplication logic so files like id5.mol are not created as
duplicates of id4.mol). Search for the identifier/title "id_5" and the matching
atom/bond/charge blocks (e.g., the M  CHG block and the same bond list) to
locate the duplicate, then either swap in the correct .mol content or
deduplicate at generation time to avoid biased/duplicated examples.

Comment on lines +47 to +62
if is_train:
self.result: PropertyDataResult | None = prepare_property_data(
data,
output_dir=self.save_path / "prepared_data",
input_out=self.save_path / "input.json",
property_name=property_name,
property_col=property_col,
train_ratio=train_ratio,
mol_dir=mol_dir,
mol_template=mol_template,
overlap_tol=overlap_tol,
seed=seed,
overwrite=overwrite,
numb_steps=numb_steps,
input_updates=input_updates,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Type mismatch: property_col can be None but prepare_property_data expects str.

When is_train=True, property_col is passed to prepare_property_data which has signature property_col: str = "Property". If DataHub is called directly with is_train=True and property_col=None, this would pass None where a str is expected.

Consider either:

  1. Validating that property_col is not None when is_train=True, or
  2. Changing the type annotation to str with a default value, since the inference path (context snippet from predict.py:67) explicitly passes property_col=None
🛡️ Option 1: Add validation
         if is_train:
+            if property_col is None:
+                raise ValueError("property_col is required for training")
             self.result: PropertyDataResult | None = prepare_property_data(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if is_train:
self.result: PropertyDataResult | None = prepare_property_data(
data,
output_dir=self.save_path / "prepared_data",
input_out=self.save_path / "input.json",
property_name=property_name,
property_col=property_col,
train_ratio=train_ratio,
mol_dir=mol_dir,
mol_template=mol_template,
overlap_tol=overlap_tol,
seed=seed,
overwrite=overwrite,
numb_steps=numb_steps,
input_updates=input_updates,
)
if is_train:
if property_col is None:
raise ValueError("property_col is required for training")
self.result: PropertyDataResult | None = prepare_property_data(
data,
output_dir=self.save_path / "prepared_data",
input_out=self.save_path / "input.json",
property_name=property_name,
property_col=property_col,
train_ratio=train_ratio,
mol_dir=mol_dir,
mol_template=mol_template,
overlap_tol=overlap_tol,
seed=seed,
overwrite=overwrite,
numb_steps=numb_steps,
input_updates=input_updates,
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/deepmd_property_tools/data/datahub.py` around
lines 47 - 62, When is_train is True, ensure property_col is a non-None str
before calling prepare_property_data: in the DataHub constructor (or the method
where this code lives) validate that property_col is not None and raise a clear
ValueError if it is, or set a sensible default string (e.g., "Property") before
passing it to prepare_property_data; reference the symbols property_col,
is_train, and prepare_property_data to locate and fix the call site so
prepare_property_data always receives a str.

Comment on lines +55 to +67
for frame_idx, (symbols, coord) in enumerate(zip(atoms, coordinates)):
if coord.shape != (len(symbols), 3):
raise ValueError(
f"coordinates shape mismatch at sample {frame_idx}: {coord.shape}"
)
for atom_idx, symbol in enumerate(symbols):
if symbol not in self.type_index:
raise ValueError(
f"Element {symbol!r} is not present in type_map {self.type_map}"
)
atom_types[frame_idx, atom_idx] = self.type_index[symbol]
coords[frame_idx, : len(symbols), :] = coord
return coords, atom_types
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent silent truncation when input lengths diverge.

These zip(...) loops can drop samples/columns without error when lengths mismatch (atoms vs coordinates, rows vs y_pred, or predict_cols vs pred). This can silently corrupt prediction outputs.

💡 Suggested fix
 def standardize(
     self, atoms: list[list[str]], coordinates: list[np.ndarray]
 ) -> tuple[np.ndarray, np.ndarray]:
     if not atoms:
         raise ValueError("No samples to predict")
+    if len(atoms) != len(coordinates):
+        raise ValueError(
+            f"atoms/coordinates length mismatch: {len(atoms)} != {len(coordinates)}"
+        )
     max_natoms = max(len(symbols) for symbols in atoms)
     coords = np.zeros((len(atoms), max_natoms, 3), dtype=np.float32)
     atom_types = np.full((len(atoms), max_natoms), -1, dtype=np.int32)
-    for frame_idx, (symbols, coord) in enumerate(zip(atoms, coordinates)):
+    for frame_idx, (symbols, coord) in enumerate(
+        zip(atoms, coordinates, strict=True)
+    ):
         if coord.shape != (len(symbols), 3):
             raise ValueError(
                 f"coordinates shape mismatch at sample {frame_idx}: {coord.shape}"
             )
@@
     with out_path.open("w", encoding="utf-8", newline="") as fp:
         writer = csv.DictWriter(fp, fieldnames=fieldnames)
         writer.writeheader()
-        for row, pred in zip(rows, y_pred):
+        if len(rows) != len(y_pred):
+            raise ValueError(
+                f"rows/prediction length mismatch: {len(rows)} != {len(y_pred)}"
+            )
+        for row, pred in zip(rows, y_pred, strict=True):
             out_row = dict(row)
-            for col, value in zip(predict_cols, pred):
+            for col, value in zip(predict_cols, pred, strict=True):
                 out_row[col] = float(value)
             writer.writerow(out_row)

Also applies to: 95-99

🧰 Tools
🪛 Ruff (0.15.14)

[warning] 55-55: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/deepmd_property_tools/predictor.py` around lines
55 - 67, The zip(...) usage silently truncates mismatched sequences (e.g., atoms
vs coordinates, rows vs y_pred, predict_cols vs pred) causing data loss; before
each zip, validate lengths explicitly and raise a clear ValueError if they
differ (e.g., check len(atoms) == len(coordinates), len(rows) == len(y_pred),
len(predict_cols) == len(pred)), include identifying context in the message,
then proceed with the existing loops that populate coords/atom_types and consume
y_pred/pred so no samples/columns are silently dropped; apply these checks
around the blocks that reference atoms/coordinates and
rows/y_pred/predict_cols/pred.

Comment on lines +28 to +30
self.save_path = Path(save_path)
self.finetune = finetune
self.nproc_per_node = nproc_per_node
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve paths before changing working directory.

save_path/input_path can be relative, but run() changes cwd and still passes paths prefixed with self.save_path. This can mis-target files (e.g., nested .../exp_property/exp_property/out.json) and break default relative-path runs.

💡 Proposed fix
 class Trainer:
@@
-        self.save_path = Path(save_path)
+        self.save_path = Path(save_path).resolve()
@@
     def run(self, input_path: str | Path) -> None:
-        input_path = Path(input_path)
+        input_path = Path(input_path).resolve()
@@
                 train(
                     input_file=str(input_path),
@@
                     output=str(self.save_path / "out.json"),
                 )

Also applies to: 38-39, 44-60, 66-77, 88-88

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/deepmd_property_tools/tasks/trainer.py` around
lines 28 - 30, The code sets self.save_path (and other path attributes like
self.input_path) as Path(save_path) but later run() changes the working
directory, causing relative paths to resolve incorrectly; change the constructor
to immediately resolve all incoming path attributes (e.g., call
Path(save_path).resolve() and Path(input_path).resolve() for any similar fields
set around lines where save_path/input_path are assigned) so that self.save_path
and related attributes are absolute before run() mutates cwd, and ensure any
other places assigning paths (the blocks around lines 38-39, 44-60, 66-77, 88)
are updated to use resolved Paths as well.

Comment on lines +13 to +17
MODEL_PATH = ROOT / "exp_property_20" / "model.ckpt-10.pt"

if not MODEL_PATH.exists():
raise FileNotFoundError(f"Train first; checkpoint not found: {MODEL_PATH}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid hardcoding a specific checkpoint filename.

model.ckpt-10.pt tightly couples this script to one training run. Passing the experiment directory is more robust and lets PropertyPredict pick frozen_model.pth or the latest checkpoint automatically.

Suggested patch
-MODEL_PATH = ROOT / "exp_property_20" / "model.ckpt-10.pt"
-
-if not MODEL_PATH.exists():
-    raise FileNotFoundError(f"Train first; checkpoint not found: {MODEL_PATH}")
+MODEL_PATH = ROOT / "exp_property_20"
+
+if not MODEL_PATH.exists():
+    raise FileNotFoundError(f"Train first; model directory not found: {MODEL_PATH}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
MODEL_PATH = ROOT / "exp_property_20" / "model.ckpt-10.pt"
if not MODEL_PATH.exists():
raise FileNotFoundError(f"Train first; checkpoint not found: {MODEL_PATH}")
MODEL_PATH = ROOT / "exp_property_20"
if not MODEL_PATH.exists():
raise FileNotFoundError(f"Train first; model directory not found: {MODEL_PATH}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/predict_property_20.py` around lines 13 - 17,
The code currently hardcodes MODEL_PATH = ROOT / "exp_property_20" /
"model.ckpt-10.pt", which couples the script to one checkpoint; change it to
accept an experiment directory (or path) and let PropertyPredict or its loader
resolve the correct file (e.g., frozen_model.pth or latest checkpoint). Modify
the usage of MODEL_PATH and any instantiation of PropertyPredict to accept a
directory/path variable (e.g., exp_dir or model_dir) instead of a specific
filename, and update the existence check to verify the directory and delegate
selecting the actual checkpoint file to PropertyPredict (or a helper like
PropertyPredict.load_checkpoint) so the script works with frozen_model.pth or
the latest checkpoint automatically.

Comment on lines +38 to +40
finetune=PRETRAINED_MODEL,
use_pretrain_script=False,
input_updates={
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enable pretrained-structure alignment when finetuning.

With finetune=PRETRAINED_MODEL, setting use_pretrain_script=False can cause descriptor/config mismatch against pretrained weights and make this demo fail on load.

Suggested patch
-    use_pretrain_script=False,
+    use_pretrain_script=True,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
finetune=PRETRAINED_MODEL,
use_pretrain_script=False,
input_updates={
finetune=PRETRAINED_MODEL,
use_pretrain_script=True,
input_updates={
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/deepmd_property_tools/train_property_20.py` around lines 38 - 40, The
finetuning invocation sets finetune=PRETRAINED_MODEL while disabling the
pretrained alignment via use_pretrain_script=False, which can produce
descriptor/config mismatches; update the call that includes finetune,
PRETRAINED_MODEL and input_updates to enable the pretrained-structure alignment
by setting use_pretrain_script=True (or removing the override so the default
alignment runs) so the model descriptors/config match the pretrained weights
during load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant