How to use the dpgen without PBS or SLURM. #5338
-
|
Our laboratory have a GPU cluster ( One master and six compute node ), but it doesn't install the PBS or SLURM. And I don't have a root password. The dpgen program run in the master node, ssh login without password and nfs system are enabled. Thanks for your help!!! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 26 replies
-
|
Hi @whisper-to! I'm Dosu and I’m helping the deepmd-kit team. For your setup (master + 6 compute nodes, passwordless SSH, NFS shared filesystem, no scheduler), don't use DistributedShell — it's specifically for YARN/Hadoop clusters, not SSH-based clusters. Instead, use Option 1: LazyLocalContext (Recommended for NFS) Since you have NFS, this avoids all file copying — jobs run in-place: {
"train": [{
"command": "dp",
"machine": {
"batch_type": "Shell",
"context_type": "LazyLocalContext",
"local_root": "./",
"remote_root": "./"
},
"resources": {
"number_node": 1,
"cpu_per_node": 8,
"gpu_per_node": 1,
"group_size": 0
}
}]
}Option 2: SSHContext (for remote execution on compute nodes) {
"train": [{
"command": "dp",
"machine": {
"batch_type": "Shell",
"context_type": "SSHContext",
"local_root": "./",
"remote_root": "/shared/nfs/dpgen_work",
"remote_profile": {
"hostname": "compute-node-1",
"username": "your_username",
"key_filename": "~/.ssh/id_rsa"
}
},
"resources": {
"gpu_per_node": 1,
"group_size": 0
}
}]
}Important notes:
For more details, see the batch systems documentation and context documentation. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Thanks for the clarification! The problem is that when FP tasks crash with a non-zero exit code, dpdispatcher treats that as a job failure — and after exhausting
retry_count, it raises an exception that stops everything. Neitherratio_unfinishednorratio_failedwill help at this stage, because the error happens at the dpdispatcher level before dpgen'sratio_failedlogic ever kicks in.Here's what you need to do — both steps are needed:
Step 1: Force the FP command to always return exit code 0
In your
machine.json, append|| trueto your FP command so that crashes don't produce a non-zero exit code:{ "fp": { "command": "vasp_std || true", ... } }This way, even if the DFT co…