feat(pt): full validation support lmdb format#5419
feat(pt): full validation support lmdb format#5419njzjz merged 2 commits intodeepmodeling:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds full-validation support for LMDB-formatted datasets in the PyTorch training pipeline, aligning LMDB validation behavior with existing dp-test mixed-nloc handling.
Changes:
- Extend
FullValidatorto iterate validation “systems” from either loader-backed datasets or LMDB datasets (evaluating LMDB pernlocgroup). - Introduce reusable
LmdbTestDataNlocViewand aLmdbTestData.add_data_requirement()helper to mirror training-time requirements in full validation. - Replace the local LMDB nloc-view wrapper in
dp testentrypoint with the shared implementation and export it indpmodel.utils.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
deepmd/pt/train/validation.py |
Adds LMDB-aware iteration + cached LMDB test snapshot for full validation. |
deepmd/entrypoints/test.py |
Switches dp test to use shared LmdbTestDataNlocView instead of a local wrapper. |
deepmd/dpmodel/utils/lmdb_data.py |
Adds add_data_requirement() to LmdbTestData and introduces LmdbTestDataNlocView. |
deepmd/dpmodel/utils/__init__.py |
Re-exports LmdbTestDataNlocView as part of the public utils API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
📝 WalkthroughWalkthroughExports Changes
Sequence Diagram(s)sequenceDiagram
participant FV as FullValidator
participant VS as validation_data
participant LMDB as LmdbDataset
participant Reader as LmdbDataReader / LmdbTestData
participant View as LmdbTestDataNlocView
FV->>VS: iterate systems
alt non-LMDB dataset
VS-->>FV: yield dataset.data_system
FV->>FV: _evaluate_system(data_system)
else LMDB dataset
VS-->>FV: detect LmdbDataset
FV->>Reader: request or build cached LmdbTestData snapshot
Reader-->>FV: return snapshot
FV->>View: instantiate LmdbTestDataNlocView(snapshot, nloc) per nloc group
View-->>FV: yield fixed-nloc test-system
FV->>FV: _evaluate_system(View)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
deepmd/pt/train/validation.py (1)
406-426: Avoid reaching intoLmdbDataReader's private attributes.
reader._type_mapandreader._data_requirementsare implementation details ofLmdbDataReader.type_mapis already exposed as a property on the reader (seeLmdbDataReader.type_mapindeepmd/dpmodel/utils/lmdb_data.py), so the snapshot can use that directly. For_data_requirements, consider either adding a small accessor onLmdbDataReader(e.g.data_requirementsproperty returning the values) or plumbing the originalDataRequirementItemlist through the caller, so this module doesn't rely on a name-mangled-ish private dict that may change shape (today it storesDataRequirementItemobjects butLmdbDataReader._resolve_dtypealso accepts plain dicts).♻️ Suggested change (use the public property and add a small accessor)
- reader = lmdb_dataset._reader - self._lmdb_test_data = LmdbTestData( - lmdb_dataset.lmdb_path, - type_map=list(reader._type_map), - shuffle_test=False, - ) - data_requirements = list(reader._data_requirements.values()) + reader = lmdb_dataset._reader + self._lmdb_test_data = LmdbTestData( + lmdb_dataset.lmdb_path, + type_map=list(reader.type_map), + shuffle_test=False, + ) + data_requirements = list(reader._data_requirements.values()) if data_requirements: self._lmdb_test_data.add_data_requirement(data_requirements) return self._lmdb_test_dataPlus, in
deepmd/dpmodel/utils/lmdb_data.py, expose a read-only accessor for the data requirements:`@property` def nloc_groups(self) -> dict[int, list[int]]: """Nloc → list of frame indices.""" return self._nloc_groups + + `@property` + def data_requirements(self) -> list[DataRequirementItem]: + """Registered data requirements (order-preserving view).""" + return list(self._data_requirements.values())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/validation.py` around lines 406 - 426, The method _get_lmdb_test_data_snapshot reaches into LmdbDataset._reader private attrs (reader._type_map and reader._data_requirements); change it to use the public API: use reader.type_map (the existing property) and a public accessor for requirements instead of _data_requirements — either call a new LmdbDataReader.data_requirements property that returns the values or accept the DataRequirementItem list from the caller and pass that into LmdbTestData.add_data_requirement; update references in _get_lmdb_test_data_snapshot to use reader.type_map and reader.data_requirements (or the passed-in list) and remove direct access to _type_map/_data_requirements.deepmd/dpmodel/utils/lmdb_data.py (1)
1466-1486: Consider proxyingnloc_groupsto avoid leaking the other groups through the view.Because
__getattr__forwards everything to_inner,view.nloc_groupsreturns the full mapping for every group in the underlyingLmdbTestData, not just the one this view represents. Consumers that want to iterate a "single-system" view (e.g., for summary/logging) may be misled. If that shape ever matters for downstream code, overridenloc_groups(and optionallyget_natoms) to expose onlyself._nloc. Non-blocking nit.♻️ Optional refinement
def get_test(self) -> dict[str, Any]: return self._inner.get_test(nloc=self._nloc) + + `@property` + def nloc_groups(self) -> dict[int, list[int]]: + return {self._nloc: self._inner.nloc_groups.get(self._nloc, [])}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/dpmodel/utils/lmdb_data.py` around lines 1466 - 1486, The view currently forwards all attributes via __getattr__ so LmdbTestDataNlocView exposes the full _inner.nloc_groups; override and implement a property nloc_groups (and optionally get_natoms) on LmdbTestDataNlocView to return only the mapping/values for self._nloc (e.g., a single-key mapping or the count for that group) instead of delegating to _inner, leaving __getattr__ for everything else and keeping get_test(nloc=self._nloc) as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@deepmd/dpmodel/utils/lmdb_data.py`:
- Around line 1466-1486: The view currently forwards all attributes via
__getattr__ so LmdbTestDataNlocView exposes the full _inner.nloc_groups;
override and implement a property nloc_groups (and optionally get_natoms) on
LmdbTestDataNlocView to return only the mapping/values for self._nloc (e.g., a
single-key mapping or the count for that group) instead of delegating to _inner,
leaving __getattr__ for everything else and keeping get_test(nloc=self._nloc)
as-is.
In `@deepmd/pt/train/validation.py`:
- Around line 406-426: The method _get_lmdb_test_data_snapshot reaches into
LmdbDataset._reader private attrs (reader._type_map and
reader._data_requirements); change it to use the public API: use reader.type_map
(the existing property) and a public accessor for requirements instead of
_data_requirements — either call a new LmdbDataReader.data_requirements property
that returns the values or accept the DataRequirementItem list from the caller
and pass that into LmdbTestData.add_data_requirement; update references in
_get_lmdb_test_data_snapshot to use reader.type_map and reader.data_requirements
(or the passed-in list) and remove direct access to
_type_map/_data_requirements.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 437999f9-dabc-4cd4-b2fb-0e4664606fa5
📒 Files selected for processing (4)
deepmd/dpmodel/utils/__init__.pydeepmd/dpmodel/utils/lmdb_data.pydeepmd/entrypoints/test.pydeepmd/pt/train/validation.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
deepmd/pt/train/validation.py (1)
406-425: Snapshot is cached only on first call —add_data_requirementupdates after that are silently ignored.
_get_lmdb_test_data_snapshotreadslmdb_dataset.data_requirementsonce and caches the resultingLmdbTestData. If any caller registers additionalDataRequirementItems on theLmdbDatasetafter the first full validation runs (e.g. a downstream task or a follow-up loss addsaparam/atom_pref), those new fields will not appear in subsequentget_test()calls, leading to silently missing metrics. Since today the only registration site is the trainer setup (which runs once before validation begins), this is currently latent — but it's a footgun worth either documenting or guarding against.Minimal hardening options:
- Snapshot the requirement keys at cache time and re-build (or assert) if they diverge on later calls.
- Or document in the docstring that any
add_data_requirementcalls must happen before the first full validation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/validation.py` around lines 406 - 425, The cached LMDB test snapshot created in _get_lmdb_test_data_snapshot currently ignores any DataRequirementItem additions made to lmdb_dataset after the first call; update _get_lmdb_test_data_snapshot to capture the set of lmdb_dataset.data_requirements keys when creating self._lmdb_test_data and on each subsequent call compare the current lmdb_dataset.data_requirements keys to the cached set—if they differ, rebuild the LmdbTestData (re-instantiate LmdbTestData and re-apply add_data_requirement) or raise an assertion/clear error so callers must register requirements before validation; reference LmdbTestData, add_data_requirement, and lmdb_dataset.data_requirements (and ensure get_test sees the updated snapshot).source/tests/pt/test_validation.py (1)
57-133: Test fixture helpers duplicatesource/tests/common/dpmodel/test_lmdb_data.py.
_make_lmdb_frameand_create_mixed_nloc_lmdbare essentially the same as_make_frameand_create_mixed_nloc_lmdbinsource/tests/common/dpmodel/test_lmdb_data.py(only the inclusion oftype_mapin metadata and minor naming differ). If you expect more tests undersource/tests/pt/to consume LMDB fixtures, consider extracting these into a small shared helper (e.g. undersource/tests/common/dpmodel/_lmdb_fixtures.py) to avoid drift between the two copies.Not blocking — local copies are fine for a single test.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/tests/pt/test_validation.py` around lines 57 - 133, The test helpers _make_lmdb_frame and _create_mixed_nloc_lmdb duplicate fixtures in source/tests/common/dpmodel/test_lmdb_data.py; extract them into a shared helper (e.g. source/tests/common/dpmodel/_lmdb_fixtures.py) and replace the local copies by importing the shared _make_frame/_create_mixed_nloc_lmdb (or rename to match existing symbols) so both test modules use the single implementation and share the metadata/type_map handling; update the pt test to import the shared helpers and remove the duplicated functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@deepmd/pt/train/validation.py`:
- Around line 406-425: The cached LMDB test snapshot created in
_get_lmdb_test_data_snapshot currently ignores any DataRequirementItem additions
made to lmdb_dataset after the first call; update _get_lmdb_test_data_snapshot
to capture the set of lmdb_dataset.data_requirements keys when creating
self._lmdb_test_data and on each subsequent call compare the current
lmdb_dataset.data_requirements keys to the cached set—if they differ, rebuild
the LmdbTestData (re-instantiate LmdbTestData and re-apply add_data_requirement)
or raise an assertion/clear error so callers must register requirements before
validation; reference LmdbTestData, add_data_requirement, and
lmdb_dataset.data_requirements (and ensure get_test sees the updated snapshot).
In `@source/tests/pt/test_validation.py`:
- Around line 57-133: The test helpers _make_lmdb_frame and
_create_mixed_nloc_lmdb duplicate fixtures in
source/tests/common/dpmodel/test_lmdb_data.py; extract them into a shared helper
(e.g. source/tests/common/dpmodel/_lmdb_fixtures.py) and replace the local
copies by importing the shared _make_frame/_create_mixed_nloc_lmdb (or rename to
match existing symbols) so both test modules use the single implementation and
share the metadata/type_map handling; update the pt test to import the shared
helpers and remove the duplicated functions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e67b3d13-8e46-4bb9-bfcb-5954eabe3bab
📒 Files selected for processing (5)
deepmd/dpmodel/utils/lmdb_data.pydeepmd/pt/train/validation.pydeepmd/pt/utils/lmdb_dataset.pysource/tests/common/dpmodel/test_lmdb_data.pysource/tests/pt/test_validation.py
🚧 Files skipped from review as they are similar to previous changes (1)
- deepmd/dpmodel/utils/lmdb_data.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5419 +/- ##
==========================================
+ Coverage 82.36% 82.38% +0.02%
==========================================
Files 824 824
Lines 87109 87138 +29
Branches 4197 4197
==========================================
+ Hits 71743 71788 +45
+ Misses 14091 14075 -16
Partials 1275 1275 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
source/tests/pt/test_validation.py (1)
57-133: Optional: Consider deduplicating LMDB frame helpers.
_make_lmdb_frameand_create_mixed_nloc_lmdbhere are near-identical to_make_frame/_create_mixed_nloc_lmdbinsource/tests/common/dpmodel/test_lmdb_data.py. If this duplication is expected to grow, consider extracting them into a shared test utility (e.g.,source/tests/common/dpmodel/lmdb_test_utils.pyor similar) imported by both suites. Not blocking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/tests/pt/test_validation.py` around lines 57 - 133, The tests duplicate LMDB helper logic across _make_lmdb_frame and _create_mixed_nloc_lmdb (in this file) and _make_frame/_create_mixed_nloc_lmdb in the other test module; extract the common helpers into a shared test utility module (e.g., lmdb_test_utils) and update both test files to import and call the shared functions (keep the same function names or provide thin adapters) so the generation of synthetic frames and LMDB datasets is centralized and avoids drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@source/tests/pt/test_validation.py`:
- Around line 57-133: The tests duplicate LMDB helper logic across
_make_lmdb_frame and _create_mixed_nloc_lmdb (in this file) and
_make_frame/_create_mixed_nloc_lmdb in the other test module; extract the common
helpers into a shared test utility module (e.g., lmdb_test_utils) and update
both test files to import and call the shared functions (keep the same function
names or provide thin adapters) so the generation of synthetic frames and LMDB
datasets is centralized and avoids drift.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 1c165618-a0e6-44c0-85ce-e96859b766dd
📒 Files selected for processing (7)
deepmd/dpmodel/utils/__init__.pydeepmd/dpmodel/utils/lmdb_data.pydeepmd/entrypoints/test.pydeepmd/pt/train/validation.pydeepmd/pt/utils/lmdb_dataset.pysource/tests/common/dpmodel/test_lmdb_data.pysource/tests/pt/test_validation.py
✅ Files skipped from review due to trivial changes (2)
- deepmd/pt/utils/lmdb_dataset.py
- deepmd/dpmodel/utils/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
- deepmd/entrypoints/test.py
- deepmd/dpmodel/utils/lmdb_data.py
- deepmd/pt/train/validation.py
Summary by CodeRabbit
New Features
Refactor
Tests