How to use the dpgen without PBS or SLURM. #5338

LWfool · 2026-03-24T14:10:20Z

LWfool
Mar 24, 2026

Our laboratory have a GPU cluster ( One master and six compute node ), but it doesn't install the PBS or SLURM. And I don't have a root password.

The dpgen program run in the master node, ssh login without password and nfs system are enabled.
I see there is a batch_type is "DistributedSheel", but the example also have a copy file process.
How to realize the dpgen program distribute the training job to compute node and train the potential without copy the file?

Thanks for your help!!!

Answered by dosubot[bot]

Apr 3, 2026

Thanks for the clarification! The problem is that when FP tasks crash with a non-zero exit code, dpdispatcher treats that as a job failure — and after exhausting retry_count, it raises an exception that stops everything. Neither ratio_unfinished nor ratio_failed will help at this stage, because the error happens at the dpdispatcher level before dpgen's ratio_failed logic ever kicks in.

Here's what you need to do — both steps are needed:

Step 1: Force the FP command to always return exit code 0

In your machine.json, append || true to your FP command so that crashes don't produce a non-zero exit code:

{
  "fp": {
    "command": "vasp_std || true",
    ...
  }
}

This way, even if the DFT co…

View full answer

LWfool · 2026-03-24T14:21:50Z

dosubot[bot]
Bot Mar 24, 2026

Hi @whisper-to! I'm Dosu and I’m helping the deepmd-kit team.

For your setup (master + 6 compute nodes, passwordless SSH, NFS shared filesystem, no scheduler), don't use DistributedShell — it's specifically for YARN/Hadoop clusters, not SSH-based clusters.

Instead, use batch_type: "Shell" with one of these context types:

Option 1: LazyLocalContext (Recommended for NFS)

Since you have NFS, this avoids all file copying — jobs run in-place:

{
  "train": [{
    "command": "dp",
    "machine": {
      "batch_type": "Shell",
      "context_type": "LazyLocalContext",
      "local_root": "./",
      "remote_root": "./"
    },
    "resources": {
      "number_node": 1,
      "cpu_per_node": 8,
      "gpu_per_node": 1,
      "group_size": 0
    }
  }]
}

Option 2: SSHContext (for remote execution on compute nodes)

{
  "train": [{
    "command": "dp",
    "machine": {
      "batch_type": "Shell",
      "context_type": "SSHContext",
      "local_root": "./",
      "remote_root": "/shared/nfs/dpgen_work",
      "remote_profile": {
        "hostname": "compute-node-1",
        "username": "your_username",
        "key_filename": "~/.ssh/id_rsa"
      }
    },
    "resources": {
      "gpu_per_node": 1,
      "group_size": 0
    }
  }]
}

Important notes:

Set group_size: 0 — this is critical! Without a scheduler, setting group_size=1 launches ALL jobs simultaneously, potentially crashing your server. Using 0 forces sequential execution.
SSH context is non-login, so .bash_profile won't be executed. Put your environment setup in .bashrc or use source_list in resources.
To distribute across multiple nodes, define an array of machine configurations (one per compute node) — dpgen will distribute tasks across them.

For more details, see the batch systems documentation and context documentation.

To reply, just mention @dosu.

Share context across your team and agents. Try Dosu.

26 replies

LWfool Apr 3, 2026
Author

@dosu Can you explain how to use this parameter "ratio_unfinished"?
For example, now the fp process have some failed task, which due to the very bad structure. I want to abandon them.
If I set this value to 0.1, that means I accept 10 precent jobs failed. And avoid the dpgen process terminated.
Is it correct?

dosubot[bot] Bot Apr 3, 2026

Your understanding is close, but there's an important distinction: there are actually two separate parameters for this.

ratio_unfinished is a dpdispatcher parameter (set in machine.json, not param.json) that handles tasks that are still running or timed out — not tasks that failed. It goes under resources.strategy:

"resources": {
  "strategy": {
    "ratio_unfinished": 0.1
  }
}

This tells dpdispatcher: "if ≤10% of tasks are still running and the rest are done, stop waiting and move on." It was designed for the "long tail" problem where a few slow DFT jobs hold up the entire stage.

What you actually want is ratio_failed — this is a dpgen param.json parameter that controls tolerance for FP tasks that fail to converge (like bad structures causing SCF failures). By default, dpgen already allows some failures [1]:

VASP: 5% (0.05)
CP2K: 10% (0.10)
PWmat: 5% (0.05)

If your failed tasks exceed this ratio, dpgen raises a RuntimeError and stops. To increase tolerance, set it in your param.json:

"ratio_failed": 0.1

This means dpgen will tolerate up to 10% of FP tasks failing and simply discard those bad structures — dpgen won't terminate and will continue to the next iteration.

So in summary: use ratio_failed in param.json for your case (abandoning failed FP tasks from bad structures), and ratio_unfinished in machine.json if you want to skip tasks that are just running too slowly.

To reply, just mention @dosu.

Docs are dead. Just use Dosu.

LWfool Apr 3, 2026
Author

@dosu I sorry, I didn't describe in detail.
Some FP tasks are directly stopped due to the bad structure rather than SCF not converged.
The dpdispatcher try to submit these jobs again, but I use the "retry_count" to limit it.
Because these tasks are meaningless, I want to abandon them.
So how can I set the param.json or machine.json to let the dpgen ignore the failed tasks and continue to iterate?

dosubot[bot] Bot Apr 3, 2026

Thanks for the clarification! The problem is that when FP tasks crash with a non-zero exit code, dpdispatcher treats that as a job failure — and after exhausting retry_count, it raises an exception that stops everything. Neither ratio_unfinished nor ratio_failed will help at this stage, because the error happens at the dpdispatcher level before dpgen's ratio_failed logic ever kicks in.

Here's what you need to do — both steps are needed:

Step 1: Force the FP command to always return exit code 0

In your machine.json, append || true to your FP command so that crashes don't produce a non-zero exit code:

{
  "fp": {
    "command": "vasp_std || true",
    ...
  }
}

This way, even if the DFT code crashes on a bad structure, dpdispatcher sees it as "completed" (exit code 0) and won't retry or raise an exception.

Step 2: Increase ratio_failed in param.json

Now that crashed tasks look "completed" to dpdispatcher, dpgen will try to parse their output. Bad structures will produce unparseable output files, which dpgen counts as failed frames. Set ratio_failed high enough to tolerate them:

{
  "ratio_failed": 0.15
}

This tells dpgen: "if ≤15% of FP frames fail to parse, discard them and continue to the next iteration." Adjust the value based on how many bad structures you expect — values up to 0.35 have been used in practice.

In summary: || true prevents dpdispatcher from blocking on crashed jobs, and ratio_failed lets dpgen tolerate the resulting unparseable output and move on. The bad structures are simply discarded from training data.

To reply, just mention @dosu.

Share context across your team and agents. Try Dosu.

Answer selected by LWfool

LWfool Apr 11, 2026
Author

@dosu How to remove the unreasonable single point data automaticly after the fp calculation finished? I find some data have a very high force about 19 eV/A. These bad value makes the RMSE in the deep training can not decrease.

dosubot[bot] Bot Apr 11, 2026

dpgen doesn't have a built-in parameter for filtering by force magnitude, but you can handle this with dpdata post-processing.

Option 1: Filter by force threshold using dpdata (recommended for your case)

import dpdata
import numpy as np

system = dpdata.LabeledSystem('path/to/data', fmt='deepmd/npy')

# Get max force magnitude per frame
max_forces = np.array([np.abs(f).max() for f in system['forces']])

# Keep only frames with max force below threshold
force_threshold = 10.0  # eV/Å, adjust as needed
valid_indices = np.where(max_forces < force_threshold)[0]
cleaned = system.sub_system(valid_indices)

cleaned.to('deepmd/npy', 'cleaned_data')
print(f"Removed {len(system) - len(cleaned)} bad frames out of {len(system)}")

The sub_system() method lets you extract specific frames by index.

Option 2: Filter by energy outliers

dpdata also has a built-in remove_outlier() method that removes frames where the energy deviates too far from the mean:

cleaned = system.remove_outlier(threshold=8.0)  # 8 standard deviations

This works well if bad structures also have anomalous energies.

Prevention for future iterations:

You can also set fp_skip_bad_box in param.json to skip structures with unreasonable geometries before FP calculation:

"fp_skip_bad_box": "length_ratio:3;height_ratio:3"

For your current situation, I'd suggest running the force-filtering script on your existing training data, then restarting the training. You could also integrate this as a cleaning step between dpgen iterations.

To reply, just mention @dosu.

Share context across your team and agents. Try Dosu.

LWfool Apr 11, 2026
Author

@dosu "model_devi_skip", please explain this parameter. If I set this value is 10. I wonder "the Number of structures skipped for fp in each MD" means skip 10 MD steps or 10 dumped structures.

dosubot[bot] Bot Apr 11, 2026

It skips dumped frames (structures), not MD steps.

Looking at the source code, model_devi_skip works by filtering the frame index (column 0) in the model_devi.out file:

if all_conf[ii][0] < model_devi_skip:
    continue

So if you set model_devi_skip: 10, it skips all frames where the step index in model_devi.out is less than 10. Note that the step index in that file corresponds to the MD step number (e.g., 0, 10, 20, 30... if your trj_freq is 10). So effectively, if your trj_freq is 10 and you set model_devi_skip: 100, it would skip the first 10 dumped frames (steps 0–90).

The typical purpose is to skip early equilibration frames that may not be representative of the actual dynamics you want to sample. The skip filter is applied before the model_devi_f_trust_lo/hi evaluation — only frames that pass the skip filter are then classified as accurate/candidate/failed [1].

To reply, just mention @dosu.

Docs are dead. Just use Dosu.

Uh oh!

How to use the dpgen without PBS or SLURM. #5338

Uh oh!

LWfool Mar 24, 2026

Replies: 1 comment · 26 replies

Uh oh!

Uh oh!

dosubot[bot] Bot Mar 24, 2026

Uh oh!

LWfool Apr 3, 2026 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Apr 3, 2026

Uh oh!

LWfool Apr 3, 2026 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Apr 3, 2026

Uh oh!

LWfool Apr 11, 2026 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Apr 11, 2026

Uh oh!

LWfool Apr 11, 2026 Author

Uh oh!

Uh oh!

dosubot[bot] Bot Apr 11, 2026

LWfool
Mar 24, 2026

Replies: 1 comment 26 replies

dosubot[bot]
Bot Mar 24, 2026

LWfool Apr 3, 2026
Author

LWfool Apr 3, 2026
Author

LWfool Apr 11, 2026
Author

LWfool Apr 11, 2026
Author