Skip to content

feat(pt): full validation support lmdb format#5419

Merged
njzjz merged 2 commits intodeepmodeling:masterfrom
OutisLi:pr/val
Apr 26, 2026
Merged

feat(pt): full validation support lmdb format#5419
njzjz merged 2 commits intodeepmodeling:masterfrom
OutisLi:pr/val

Conversation

@OutisLi
Copy link
Copy Markdown
Collaborator

@OutisLi OutisLi commented Apr 25, 2026

Summary by CodeRabbit

  • New Features

    • Full validation now supports LMDB-backed validation datasets with snapshot caching and group-by-atom-count evaluation.
    • Datasets can register and expose data-requirement metadata for downstream tooling.
    • Added a view adapter to evaluate mixed atom-count validation frames group-by-group.
  • Refactor

    • Consolidated per-atom-count test-data wrapper into a shared utility.
  • Tests

    • Added unit and integration tests for mixed-atom-count views, data-requirement registration, and validation iteration.

Copilot AI review requested due to automatic review settings April 25, 2026 02:25
@dosubot dosubot Bot added the new feature label Apr 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds full-validation support for LMDB-formatted datasets in the PyTorch training pipeline, aligning LMDB validation behavior with existing dp-test mixed-nloc handling.

Changes:

  • Extend FullValidator to iterate validation “systems” from either loader-backed datasets or LMDB datasets (evaluating LMDB per nloc group).
  • Introduce reusable LmdbTestDataNlocView and a LmdbTestData.add_data_requirement() helper to mirror training-time requirements in full validation.
  • Replace the local LMDB nloc-view wrapper in dp test entrypoint with the shared implementation and export it in dpmodel.utils.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
deepmd/pt/train/validation.py Adds LMDB-aware iteration + cached LMDB test snapshot for full validation.
deepmd/entrypoints/test.py Switches dp test to use shared LmdbTestDataNlocView instead of a local wrapper.
deepmd/dpmodel/utils/lmdb_data.py Adds add_data_requirement() to LmdbTestData and introduces LmdbTestDataNlocView.
deepmd/dpmodel/utils/__init__.py Re-exports LmdbTestDataNlocView as part of the public utils API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deepmd/pt/train/validation.py
Comment thread deepmd/dpmodel/utils/lmdb_data.py
Comment thread deepmd/dpmodel/utils/lmdb_data.py
Comment thread deepmd/pt/train/validation.py Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 25, 2026

📝 Walkthrough

Walkthrough

Exports LmdbTestDataNlocView; adds LmdbDataReader.data_requirements property and LmdbTestData.add_data_requirement(); exposes reader properties on LmdbDataset; refactors FullValidator to cache a LmdbTestData snapshot and iterate mixed-nloc LMDB test-data via per-nloc LmdbTestDataNlocView; entrypoint uses the shared view; tests added for new APIs and mixed-nloc validation.

Changes

Cohort / File(s) Summary
LMDB utilities & exports
deepmd/dpmodel/utils/__init__.py, deepmd/dpmodel/utils/lmdb_data.py
Adds LmdbTestDataNlocView class; adds LmdbDataReader.data_requirements property; adds LmdbTestData.add_data_requirement(); exports LmdbTestDataNlocView.
Dataset surface area
deepmd/pt/utils/lmdb_dataset.py
Adds read-only LmdbDataset.type_map and LmdbDataset.data_requirements properties delegating to the underlying reader.
Validation flow
deepmd/pt/train/validation.py
Refactors FullValidator to detect LmdbDataset, build/cache a LmdbTestData snapshot, and yield LmdbTestDataNlocView instances per nloc (sorted) so full validation runs group-by-nloc.
Entrypoint cleanup
deepmd/entrypoints/test.py
Removes local _LmdbTestDataNlocView and uses the exported LmdbTestDataNlocView from utils.
Tests
source/tests/common/dpmodel/test_lmdb_data.py, source/tests/pt/test_validation.py
Adds unit tests for LmdbTestDataNlocView delegation and add_data_requirement parity; adds an integration test that builds a synthetic mixed-natoms LMDB and asserts per-nloc iteration and aggregated metrics.

Sequence Diagram(s)

sequenceDiagram
    participant FV as FullValidator
    participant VS as validation_data
    participant LMDB as LmdbDataset
    participant Reader as LmdbDataReader / LmdbTestData
    participant View as LmdbTestDataNlocView

    FV->>VS: iterate systems
    alt non-LMDB dataset
        VS-->>FV: yield dataset.data_system
        FV->>FV: _evaluate_system(data_system)
    else LMDB dataset
        VS-->>FV: detect LmdbDataset
        FV->>Reader: request or build cached LmdbTestData snapshot
        Reader-->>FV: return snapshot
        FV->>View: instantiate LmdbTestDataNlocView(snapshot, nloc) per nloc group
        View-->>FV: yield fixed-nloc test-system
        FV->>FV: _evaluate_system(View)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • iProzd
  • wanghan-iapcm
  • njzjz
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding full validation support for LMDB format in the pt module.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
deepmd/pt/train/validation.py (1)

406-426: Avoid reaching into LmdbDataReader's private attributes.

reader._type_map and reader._data_requirements are implementation details of LmdbDataReader. type_map is already exposed as a property on the reader (see LmdbDataReader.type_map in deepmd/dpmodel/utils/lmdb_data.py), so the snapshot can use that directly. For _data_requirements, consider either adding a small accessor on LmdbDataReader (e.g. data_requirements property returning the values) or plumbing the original DataRequirementItem list through the caller, so this module doesn't rely on a name-mangled-ish private dict that may change shape (today it stores DataRequirementItem objects but LmdbDataReader._resolve_dtype also accepts plain dicts).

♻️ Suggested change (use the public property and add a small accessor)
-        reader = lmdb_dataset._reader
-        self._lmdb_test_data = LmdbTestData(
-            lmdb_dataset.lmdb_path,
-            type_map=list(reader._type_map),
-            shuffle_test=False,
-        )
-        data_requirements = list(reader._data_requirements.values())
+        reader = lmdb_dataset._reader
+        self._lmdb_test_data = LmdbTestData(
+            lmdb_dataset.lmdb_path,
+            type_map=list(reader.type_map),
+            shuffle_test=False,
+        )
+        data_requirements = list(reader._data_requirements.values())
         if data_requirements:
             self._lmdb_test_data.add_data_requirement(data_requirements)
         return self._lmdb_test_data

Plus, in deepmd/dpmodel/utils/lmdb_data.py, expose a read-only accessor for the data requirements:

     `@property`
     def nloc_groups(self) -> dict[int, list[int]]:
         """Nloc → list of frame indices."""
         return self._nloc_groups
+
+    `@property`
+    def data_requirements(self) -> list[DataRequirementItem]:
+        """Registered data requirements (order-preserving view)."""
+        return list(self._data_requirements.values())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/validation.py` around lines 406 - 426, The method
_get_lmdb_test_data_snapshot reaches into LmdbDataset._reader private attrs
(reader._type_map and reader._data_requirements); change it to use the public
API: use reader.type_map (the existing property) and a public accessor for
requirements instead of _data_requirements — either call a new
LmdbDataReader.data_requirements property that returns the values or accept the
DataRequirementItem list from the caller and pass that into
LmdbTestData.add_data_requirement; update references in
_get_lmdb_test_data_snapshot to use reader.type_map and reader.data_requirements
(or the passed-in list) and remove direct access to
_type_map/_data_requirements.
deepmd/dpmodel/utils/lmdb_data.py (1)

1466-1486: Consider proxying nloc_groups to avoid leaking the other groups through the view.

Because __getattr__ forwards everything to _inner, view.nloc_groups returns the full mapping for every group in the underlying LmdbTestData, not just the one this view represents. Consumers that want to iterate a "single-system" view (e.g., for summary/logging) may be misled. If that shape ever matters for downstream code, override nloc_groups (and optionally get_natoms) to expose only self._nloc. Non-blocking nit.

♻️ Optional refinement
     def get_test(self) -> dict[str, Any]:
         return self._inner.get_test(nloc=self._nloc)
+
+    `@property`
+    def nloc_groups(self) -> dict[int, list[int]]:
+        return {self._nloc: self._inner.nloc_groups.get(self._nloc, [])}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 1466 - 1486, The view
currently forwards all attributes via __getattr__ so LmdbTestDataNlocView
exposes the full _inner.nloc_groups; override and implement a property
nloc_groups (and optionally get_natoms) on LmdbTestDataNlocView to return only
the mapping/values for self._nloc (e.g., a single-key mapping or the count for
that group) instead of delegating to _inner, leaving __getattr__ for everything
else and keeping get_test(nloc=self._nloc) as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 1466-1486: The view currently forwards all attributes via
__getattr__ so LmdbTestDataNlocView exposes the full _inner.nloc_groups;
override and implement a property nloc_groups (and optionally get_natoms) on
LmdbTestDataNlocView to return only the mapping/values for self._nloc (e.g., a
single-key mapping or the count for that group) instead of delegating to _inner,
leaving __getattr__ for everything else and keeping get_test(nloc=self._nloc)
as-is.

In `@deepmd/pt/train/validation.py`:
- Around line 406-426: The method _get_lmdb_test_data_snapshot reaches into
LmdbDataset._reader private attrs (reader._type_map and
reader._data_requirements); change it to use the public API: use reader.type_map
(the existing property) and a public accessor for requirements instead of
_data_requirements — either call a new LmdbDataReader.data_requirements property
that returns the values or accept the DataRequirementItem list from the caller
and pass that into LmdbTestData.add_data_requirement; update references in
_get_lmdb_test_data_snapshot to use reader.type_map and reader.data_requirements
(or the passed-in list) and remove direct access to
_type_map/_data_requirements.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 437999f9-dabc-4cd4-b2fb-0e4664606fa5

📥 Commits

Reviewing files that changed from the base of the PR and between 5c22e17 and 184ebb9.

📒 Files selected for processing (4)
  • deepmd/dpmodel/utils/__init__.py
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/entrypoints/test.py
  • deepmd/pt/train/validation.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
deepmd/pt/train/validation.py (1)

406-425: Snapshot is cached only on first call — add_data_requirement updates after that are silently ignored.

_get_lmdb_test_data_snapshot reads lmdb_dataset.data_requirements once and caches the resulting LmdbTestData. If any caller registers additional DataRequirementItems on the LmdbDataset after the first full validation runs (e.g. a downstream task or a follow-up loss adds aparam/atom_pref), those new fields will not appear in subsequent get_test() calls, leading to silently missing metrics. Since today the only registration site is the trainer setup (which runs once before validation begins), this is currently latent — but it's a footgun worth either documenting or guarding against.

Minimal hardening options:

  • Snapshot the requirement keys at cache time and re-build (or assert) if they diverge on later calls.
  • Or document in the docstring that any add_data_requirement calls must happen before the first full validation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/validation.py` around lines 406 - 425, The cached LMDB test
snapshot created in _get_lmdb_test_data_snapshot currently ignores any
DataRequirementItem additions made to lmdb_dataset after the first call; update
_get_lmdb_test_data_snapshot to capture the set of
lmdb_dataset.data_requirements keys when creating self._lmdb_test_data and on
each subsequent call compare the current lmdb_dataset.data_requirements keys to
the cached set—if they differ, rebuild the LmdbTestData (re-instantiate
LmdbTestData and re-apply add_data_requirement) or raise an assertion/clear
error so callers must register requirements before validation; reference
LmdbTestData, add_data_requirement, and lmdb_dataset.data_requirements (and
ensure get_test sees the updated snapshot).
source/tests/pt/test_validation.py (1)

57-133: Test fixture helpers duplicate source/tests/common/dpmodel/test_lmdb_data.py.

_make_lmdb_frame and _create_mixed_nloc_lmdb are essentially the same as _make_frame and _create_mixed_nloc_lmdb in source/tests/common/dpmodel/test_lmdb_data.py (only the inclusion of type_map in metadata and minor naming differ). If you expect more tests under source/tests/pt/ to consume LMDB fixtures, consider extracting these into a small shared helper (e.g. under source/tests/common/dpmodel/_lmdb_fixtures.py) to avoid drift between the two copies.

Not blocking — local copies are fine for a single test.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/pt/test_validation.py` around lines 57 - 133, The test helpers
_make_lmdb_frame and _create_mixed_nloc_lmdb duplicate fixtures in
source/tests/common/dpmodel/test_lmdb_data.py; extract them into a shared helper
(e.g. source/tests/common/dpmodel/_lmdb_fixtures.py) and replace the local
copies by importing the shared _make_frame/_create_mixed_nloc_lmdb (or rename to
match existing symbols) so both test modules use the single implementation and
share the metadata/type_map handling; update the pt test to import the shared
helpers and remove the duplicated functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@deepmd/pt/train/validation.py`:
- Around line 406-425: The cached LMDB test snapshot created in
_get_lmdb_test_data_snapshot currently ignores any DataRequirementItem additions
made to lmdb_dataset after the first call; update _get_lmdb_test_data_snapshot
to capture the set of lmdb_dataset.data_requirements keys when creating
self._lmdb_test_data and on each subsequent call compare the current
lmdb_dataset.data_requirements keys to the cached set—if they differ, rebuild
the LmdbTestData (re-instantiate LmdbTestData and re-apply add_data_requirement)
or raise an assertion/clear error so callers must register requirements before
validation; reference LmdbTestData, add_data_requirement, and
lmdb_dataset.data_requirements (and ensure get_test sees the updated snapshot).

In `@source/tests/pt/test_validation.py`:
- Around line 57-133: The test helpers _make_lmdb_frame and
_create_mixed_nloc_lmdb duplicate fixtures in
source/tests/common/dpmodel/test_lmdb_data.py; extract them into a shared helper
(e.g. source/tests/common/dpmodel/_lmdb_fixtures.py) and replace the local
copies by importing the shared _make_frame/_create_mixed_nloc_lmdb (or rename to
match existing symbols) so both test modules use the single implementation and
share the metadata/type_map handling; update the pt test to import the shared
helpers and remove the duplicated functions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e67b3d13-8e46-4bb9-bfcb-5954eabe3bab

📥 Commits

Reviewing files that changed from the base of the PR and between 184ebb9 and b6e3816.

📒 Files selected for processing (5)
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/pt/train/validation.py
  • deepmd/pt/utils/lmdb_dataset.py
  • source/tests/common/dpmodel/test_lmdb_data.py
  • source/tests/pt/test_validation.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/dpmodel/utils/lmdb_data.py

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 25, 2026

Codecov Report

❌ Patch coverage is 84.09091% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.38%. Comparing base (d14233e) to head (a27d0ae).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
deepmd/pt/train/validation.py 73.91% 6 Missing ⚠️
deepmd/entrypoints/test.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5419      +/-   ##
==========================================
+ Coverage   82.36%   82.38%   +0.02%     
==========================================
  Files         824      824              
  Lines       87109    87138      +29     
  Branches     4197     4197              
==========================================
+ Hits        71743    71788      +45     
+ Misses      14091    14075      -16     
  Partials     1275     1275              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
source/tests/pt/test_validation.py (1)

57-133: Optional: Consider deduplicating LMDB frame helpers.

_make_lmdb_frame and _create_mixed_nloc_lmdb here are near-identical to _make_frame / _create_mixed_nloc_lmdb in source/tests/common/dpmodel/test_lmdb_data.py. If this duplication is expected to grow, consider extracting them into a shared test utility (e.g., source/tests/common/dpmodel/lmdb_test_utils.py or similar) imported by both suites. Not blocking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/tests/pt/test_validation.py` around lines 57 - 133, The tests
duplicate LMDB helper logic across _make_lmdb_frame and _create_mixed_nloc_lmdb
(in this file) and _make_frame/_create_mixed_nloc_lmdb in the other test module;
extract the common helpers into a shared test utility module (e.g.,
lmdb_test_utils) and update both test files to import and call the shared
functions (keep the same function names or provide thin adapters) so the
generation of synthetic frames and LMDB datasets is centralized and avoids
drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@source/tests/pt/test_validation.py`:
- Around line 57-133: The tests duplicate LMDB helper logic across
_make_lmdb_frame and _create_mixed_nloc_lmdb (in this file) and
_make_frame/_create_mixed_nloc_lmdb in the other test module; extract the common
helpers into a shared test utility module (e.g., lmdb_test_utils) and update
both test files to import and call the shared functions (keep the same function
names or provide thin adapters) so the generation of synthetic frames and LMDB
datasets is centralized and avoids drift.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1c165618-a0e6-44c0-85ce-e96859b766dd

📥 Commits

Reviewing files that changed from the base of the PR and between b6e3816 and a27d0ae.

📒 Files selected for processing (7)
  • deepmd/dpmodel/utils/__init__.py
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/entrypoints/test.py
  • deepmd/pt/train/validation.py
  • deepmd/pt/utils/lmdb_dataset.py
  • source/tests/common/dpmodel/test_lmdb_data.py
  • source/tests/pt/test_validation.py
✅ Files skipped from review due to trivial changes (2)
  • deepmd/pt/utils/lmdb_dataset.py
  • deepmd/dpmodel/utils/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • deepmd/entrypoints/test.py
  • deepmd/dpmodel/utils/lmdb_data.py
  • deepmd/pt/train/validation.py

@njzjz njzjz added this pull request to the merge queue Apr 26, 2026
Merged via the queue into deepmodeling:master with commit a3f548d Apr 26, 2026
70 checks passed
@OutisLi OutisLi deleted the pr/val branch April 27, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants