You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(pt/dpmodel): add max and filter mode for lmdb (#5413)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* batch_size accepts "max:N" and "filter:N" in addition to
"auto"/"auto:N"; batch-size calculation honors per-frame atom counts.
* print_summary explicitly reports the active batch-size rule.
* **Bug Fixes**
* Dataset length, indexing, and returned frame IDs consistently reflect
filtering; filtering preserves original system numbering.
* Empty probability blocks are removed and weights renormalized so
sampling remains valid even when systems/frames are fully dropped.
* "filter:N" usage is disallowed with mixed-batch mode.
* **Documentation**
* Updated batch_size docs and validation help to describe "max:N" and
"filter:N" semantics.
* **Tests**
* Added tests covering max/filter behaviors, filtering effects on
indexing and sampling, error cases for invalid batch_size strings, and
handling of fully filtered systems.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
- string "auto": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than 32.\n\n\
3719
3719
- string "auto:N": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than N.\n\n\
3720
3720
- string "mixed:N": the batch data will be sampled from all systems and merged into a mixed system with the batch size N. Only support the se_atten descriptor for TensorFlow backend.\n\n\
3721
-
- string "max:N": automatically determines the batch size so that the batch_size times the number of atoms in the system is no more than N.\n\n\
3722
-
- string "filter:N": the same as `"max:N"` but removes the systems with the number of atoms larger than `N` from the data set.\n\n\
3721
+
- string "max:N": automatically determines the batch size so that `batch_size * natoms` is at most `N`. `natoms` is the per-system atom count for npy data and the per-frame nloc for LMDB data. When a single system/frame already has more than `N` atoms, the batch size clamps to 1 and that batch will exceed `N`.\n\n\
3722
+
- string "filter:N": the same as `"max:N"` but additionally drops data whose atom count exceeds `N`. For npy data this removes whole systems with natoms > `N`; for LMDB data this removes individual frames with nloc > `N`.\n\n\
3723
3723
If MPI is used, the value should be considered as the batch size per task.'
3724
3724
doc_auto_prob_style='Determine the probability of systems automatically. The method is assigned by this key and can be\n\n\
3725
3725
- "prob_uniform" : the probability all the systems are equal, namely 1.0/self.get_nsystems()\n\n\
- list: the length of which is the same as the {link_sys}. The batch size of each system is given by the elements of the list.\n\n\
3799
3799
- int: all {link_sys} use the same batch size.\n\n\
3800
3800
- string "auto": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than 32.\n\n\
3801
-
- string "auto:N": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than N.'
3801
+
- string "auto:N": automatically determines the batch size so that the batch_size times the number of atoms in the system is no less than N.\n\n\
3802
+
- string "max:N": automatically determines the batch size so that `batch_size * natoms` is at most `N`. `natoms` is the per-system atom count for npy data and the per-frame nloc for LMDB data. When a single system/frame already has more than `N` atoms, the batch size clamps to 1 and that batch will exceed `N`.\n\n\
3803
+
- string "filter:N": the same as `"max:N"` but additionally drops data whose atom count exceeds `N`. For npy data this removes whole systems with natoms > `N`; for LMDB data this removes individual frames with nloc > `N`.'
3802
3804
doc_auto_prob_style='Determine the probability of systems automatically. The method is assigned by this key and can be\n\n\
3803
3805
- "prob_uniform" : the probability all the systems are equal, namely 1.0/self.get_nsystems()\n\n\
3804
3806
- "prob_sys_size" : the probability of a system is proportional to the number of batches in the system\n\n\
0 commit comments