Skip to content

[BUG] : converting to JAX results in segmentation fault #5389

@JelleLagerweij

Description

@JelleLagerweij

Bug summary

Converting a dpa2.pth model to dpa2.savedmodel crashes with segfault before any real processing happens.

DeePMD-kit Version

3.1.3_cuda129

Backend and its version

JAX

How did you download the software?

docker

Input Files, Running Commands, Error Log, etc.

The output below shows that it attempts to register to cuDNN and cuBLAS multiple times. I believe that this is quite normal if there are multiple backends availlable in the same install (as in the docker image). However, I had no problems with freeze and train when running freeze or train etc.

2026-04-10 15:49:01.569013: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0410 15:49:01.589544 3571892 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0410 15:49:01.601176 3571892 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0410 15:49:01.813364 3571892 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0410 15:49:01.813440 3571892 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0410 15:49:01.813447 3571892 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0410 15:49:01.813451 3571892 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
/var/spool/slurm/slurmd/job21734795/slurm_script: line 23: 3571868 Segmentation fault apptainer exec --nv --bind "$LOCATION" --pwd "$LOCATION" "$SIF" dp convert-backend dpa2.pth dpa2.savedmodel

Steps to Reproduce

I installed the docker image and converted it to an apptainer image. On my cluster, there are a few problems with the how the local path get usedin the apptainer but:

install:
apptainer pull docker://deepmodeling/deepmd-kit:3.1.3_cuda129

run the newly created .sif file (for example to freeze a model):
LOCATION=$(pwd) apptainer exec --nv --bind "$LOCATION" --pwd "$LOCATION" path/to/apptainer/deepmd-kit_3.1.3_cuda129.sif dp --pt freeze -c dpa2.pt -o dpa2.pth
get perfectly fine output
[2026-04-10 15:55:38,011] DEEPMD INFO DeePMD version: 3.1.3 [2026-04-10 15:55:39,403] DEEPMD INFO Saved frozen model to dpa2.pth

but:
apptainer exec --nv --bind "$LOCATION" --pwd "$LOCATION" path/to/apptainer/deepmd-kit_3.1.3_cuda129.sif dp convert-backend dpa2.pth dpa2.savedmodel

2026-04-10 15:55:50.564630: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered E0410 15:55:50.585698 1045729 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0410 15:55:50.597595 1045729 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered W0410 15:55:50.809564 1045729 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0410 15:55:50.809627 1045729 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0410 15:55:50.809633 1045729 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. W0410 15:55:50.809637 1045729 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once. /var/spool/slurm/slurmd/job21734890/slurm_script: line 27: 1045702 Segmentation fault apptainer exec --nv --bind "$LOCATION" --pwd "$LOCATION" path/to/apptainer/deepmd-kit_3.1.3_cuda129.sif dp convert-backend dpa2.pth dpa2.savedmodel

It is hard for me to find the s this is a hard segmentation fault without any useable errors. If I run a training (see here for example for dpa2). No issues occur, while I use the same apptainer image. Also inferencing in lammps and with ipi works exacly how I want it (for lammps I had some issues with getting cpu parallelization to work in the apptainer with gpu' and pimd simulations, but all of that was solved.). Just as example, here is a training setup:

================ GPU Training ================ [2026-04-10 09:46:22,532] DEEPMD INFO DeePMD version: 3.1.3 [2026-04-10 09:46:22,532] DEEPMD INFO Configuration path: input.json [2026-04-10 09:46:24,434] DEEPMD INFO _____ _____ __ __ _____ _ _ _ [2026-04-10 09:46:24,434] DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| | [2026-04-10 09:46:24,434] DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_ [2026-04-10 09:46:24,434] DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __| [2026-04-10 09:46:24,434] DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_ [2026-04-10 09:46:24,434] DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__| [2026-04-10 09:46:24,434] DEEPMD INFO Please read and cite: [2026-04-10 09:46:24,434] DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) [2026-04-10 09:46:24,435] DEEPMD INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023) [2026-04-10 09:46:24,435] DEEPMD INFO Zeng et al, J. Chem. Theory Comput., 21, 4375-4385 (2025) [2026-04-10 09:46:24,435] DEEPMD INFO See https://deepmd.rtfd.io/credits/ for details. [2026-04-10 09:46:24,435] DEEPMD INFO ------------------------------------------------------------------------------------------------------------------- [2026-04-10 09:46:24,435] DEEPMD INFO Installed to: /opt/deepmd-kit/lib/python3.12/site-packages/deepmd [2026-04-10 09:46:24,435] DEEPMD INFO Source: [2026-04-10 09:46:24,435] DEEPMD INFO Source Branch: main [2026-04-10 09:46:24,435] DEEPMD INFO Source Commit: a170962 [2026-04-10 09:46:24,435] DEEPMD INFO Source Commit at: 2026-03-19 22:57:02 +0800 [2026-04-10 09:46:24,436] DEEPMD INFO Float Precision: Double [2026-04-10 09:46:24,436] DEEPMD INFO Build Variant: CUDA [2026-04-10 09:46:24,436] DEEPMD INFO Backend: PyTorch [2026-04-10 09:46:24,436] DEEPMD INFO PT Ver: v2.10.0-gUnknown [2026-04-10 09:46:24,436] DEEPMD INFO Custom OP Enabled: True [2026-04-10 09:46:24,436] DEEPMD INFO Built with PT Ver: 2.10.0 [2026-04-10 09:46:24,436] DEEPMD INFO Built with PT Inc: /opt/deepmd-kit/lib/python3.12/site-packages/torch/include [2026-04-10 09:46:24,436] DEEPMD INFO /opt/deepmd-kit/lib/python3.12/site-packages/torch/../../../../include/torch/csrc/api/include [2026-04-10 09:46:24,436] DEEPMD INFO Built with PT Lib: /opt/deepmd-kit/lib/python3.12/site-packages/torch/lib [2026-04-10 09:46:24,436] DEEPMD INFO Running on: ***clustername*** [2026-04-10 09:46:24,436] DEEPMD INFO Computing Device: CUDA:0 [2026-04-10 09:46:24,436] DEEPMD INFO Device Name: NVIDIA A100-SXM4-40GB [2026-04-10 09:46:24,436] DEEPMD INFO CUDA_VISIBLE_DEVICES: 0 [2026-04-10 09:46:24,436] DEEPMD INFO Visible GPU Count: 1 [2026-04-10 09:46:24,436] DEEPMD INFO Num Intra Threads: 16 [2026-04-10 09:46:24,436] DEEPMD INFO Num Inter Threads: 2 [2026-04-10 09:46:24,436] DEEPMD INFO ------------------------------------------------------------------------------------------------------------------- [2026-04-10 09:46:24,494] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) [2026-04-10 09:46:32,546] DEEPMD INFO Neighbor statistics: training data with minimal neighbor distance: 0.708425 [2026-04-10 09:46:32,546] DEEPMD INFO Neighbor statistics: training data with maximum neighbor size: [123] (cutoff radius: 6.000000) [2026-04-10 09:46:36,607] DEEPMD INFO Neighbor statistics: training data with minimal neighbor distance: 0.708425 [2026-04-10 09:46:36,607] DEEPMD INFO Neighbor statistics: training data with maximum neighbor size: [44] (cutoff radius: 4.000000) [2026-04-10 09:46:40,670] DEEPMD INFO Neighbor statistics: training data with minimal neighbor distance: 0.708425 [2026-04-10 09:46:40,670] DEEPMD INFO Neighbor statistics: training data with maximum neighbor size: [44] (cutoff radius: 4.000000) [2026-04-10 09:46:40,690] DEEPMD INFO Constructing DataLoaders from 35 systems [2026-04-10 09:46:40,826] DEEPMD INFO Packing data for statistics from 35 systems [2026-04-10 09:46:43,931] DEEPMD INFO RMSE of energy per atom after linear regression is: 0.04050035005999527 in the unit of energy. [2026-04-10 09:46:46,344] DEEPMD INFO ---Summary of DataSystem: Training ----------------------------------------------- [2026-04-10 09:46:46,344] DEEPMD INFO Found 35 System(s): [2026-04-10 09:46:46,345] DEEPMD INFO system natoms bch_sz n_bch prob pbc [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/55-1-1 172 1 30 3.705e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/55-3-2 182 1 30 3.705e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/55-4-4 193 1 30 3.705e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/55-5-3 192 1 30 3.705e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/100-2-6 330 1 29 3.581e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/100-3-3 321 1 10 1.235e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/100-4-2 320 1 10 1.235e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/100-8-8 356 1 9 1.111e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/110-2-2 374 1 74 9.138e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/110-4-8 374 1 74 9.138e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/110-6-3 374 1 74 9.138e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.init/init_data/110-12-2 374 1 74 9.138e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000000/02.fp/data.000 206 1 144 1.778e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000000/02.fp/data.001 209 1 256 3.161e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000000/02.fp/data.002 212 1 80 9.879e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000000/02.fp/data.003 233 1 13 1.605e-03 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000001/02.fp/data.003 233 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000001/02.fp/data.004 245 1 394 4.865e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000001/02.fp/data.005 259 1 172 2.124e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000002/02.fp/data.005 259 1 165 2.038e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000002/02.fp/data.007 323 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000003/02.fp/data.008 417 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000003/02.fp/data.009 441 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000004/02.fp/data.000 206 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000004/02.fp/data.001 209 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000004/02.fp/data.002 212 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000004/02.fp/data.003 233 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000005/02.fp/data.004 245 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000005/02.fp/data.005 259 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000005/02.fp/data.006 259 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000005/02.fp/data.010 338 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000006/02.fp/data.007 323 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000006/02.fp/data.008 417 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000006/02.fp/data.009 441 1 400 4.939e-02 T [2026-04-10 09:46:46,345] DEEPMD INFO ../data.iters/iter.000006/02.fp/data.010 338 1 400 4.939e-02 T [2026-04-10 09:46:46,346] DEEPMD INFO -------------------------------------------------------------------------------------- [2026-04-10 09:46:46,350] DEEPMD INFO Model Params: 1.748 M (Trainable: 1.748 M) [2026-04-10 09:46:46,350] DEEPMD INFO Start to train 1000000 steps. [2026-04-10 09:46:47,969] DEEPMD INFO Batch 1: trn: rmse = 7.75e+01, rmse_e = 1.85e+00, rmse_f = 1.73e+00, lr = 1.00e-03 [2026-04-10 09:46:47,970] DEEPMD INFO Batch 1: total wall time = 1.62 s, eta = 18 days, 17:35:04 [2026-04-10 09:51:01,968] DEEPMD INFO Batch 1000: trn: rmse = 1.02e+01, rmse_e = 1.44e-02, rmse_f = 2.28e-01, lr = 9.90e-04 [2026-04-10 09:51:01,969] DEEPMD INFO Batch 1000: total wall time = 254.00 s, eta = 2 days, 22:29:04 [2026-04-10 09:55:15,884] DEEPMD INFO Batch 2000: trn: rmse = 8.77e+00, rmse_e = 1.45e-02, rmse_f = 1.98e-01, lr = 9.80e-04 [2026-04-10 09:55:15,886] DEEPMD INFO Batch 2000: total wall time = 253.92 s, eta = 2 days, 22:23:29 [2026-04-10 09:59:32,311] DEEPMD INFO Batch 3000: trn: rmse = 6.74e+00, rmse_e = 4.01e-02, rmse_f = 1.53e-01, lr = 9.70e-04 [2026-04-10 09:59:32,312] DEEPMD INFO Batch 3000: total wall time = 256.43 s, eta = 2 days, 23:00:56

Just for clearity, I also tried this using the CPU docker image. This is not advised as cpu compiled JAX models are device specific (documentation 1.1.3. JAX). That does not result in an segfault, but still crashes because of an Xla serialization version incompatibility:
running cpu version
apptainer exec --nv --bind "$LOCATION" --pwd "$LOCATION" path/to/apptainer/deepmd-kit_3.1.3_cpu.sif dp convert-backend dpa2.pth dpa2.savedmodel

which obviously does not output any cuda binding problems (it is the cpu version) but still crashes as:
`
WARNING:2026-04-10 16:19:05,508:jax.src.xla_bridge:876: An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
WARNING:jax.src.xla_bridge:An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
Traceback (most recent call last):
File "/opt/deepmd-kit/bin/dp", line 8, in
sys.exit(main())
^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/main.py", line 1052, in main
deepmd_main(args)
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/entrypoints/main.py", line 100, in main
convert_backend(**dict_args)
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/entrypoints/convert_backend.py", line 31, in convert_backend
out_hook(OUTPUT, data)
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/jax/utils/serialization.py", line 145, in deserialize_to_file
return deserialize_to_savedmodel(model_file, data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/jax/jax2tf/serialization.py", line 339, in deserialize_to_file
tf.saved_model.save(
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/saved_model/save.py", line 1432, in save
save_and_return_nodes(obj, export_dir, signatures, options)
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/saved_model/save.py", line 1467, in save_and_return_nodes
build_meta_graph(obj, signatures, options, meta_graph_def))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/saved_model/save.py", line 1682, in build_meta_graph
return build_meta_graph_impl(obj, signatures, options, meta_graph_def)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/saved_model/save.py", line 1592, in build_meta_graph_impl
signatures = signature_serialization.find_function_to_export(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/saved_model/signature_serialization.py", line 109, in find_function_to_export
for name, child in children:
^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/saved_model/save.py", line 190, in list_children
for name, child in super(AugmentedGraphView, self).list_children(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/checkpoint/graph_view.py", line 75, in list_children
for name, ref in super(ObjectGraphView,
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/checkpoint/trackable_view.py", line 84, in children
for name, ref in obj.trackable_children(save_type, **kwargs).items():
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/trackable/autotrackable.py", line 115, in trackable_children
fn.list_all_concrete_functions_for_serialization() # pylint: disable=protected-access
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 1191, in list_all_concrete_functions_for_serialization
concrete_functions.append(self.get_concrete_function(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 1256, in get_concrete_function
concrete = self.get_concrete_function_garbage_collected(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 1226, in get_concrete_function_garbage_collected
self.initialize(args, kwargs, add_initializers_to=initializers)
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 696, in initialize
self.concrete_variable_creation_fn = tracing_compilation.trace_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function
concrete_function = maybe_define_function(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in maybe_define_function
concrete_function = create_concrete_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in create_concrete_function
traced_func_graph = func_graph_module.func_graph_from_py_func(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/framework/func_graph.py", line 1060, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn
out = weak_wrapped_fn().wrapped(*args, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 52, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler
return api.converted_call(
^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/autograph/impl/api.py", line 439, in converted_call
result = converted_f(*effective_args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch-local/jellel.21735848/autograph_generated_file8idoo8vu.py", line 13, in tf__call_without_atomic_virial
retval
= ag
.converted_call(ag
.converted_call(ag
.ld(make_call_whether_do_atomic_virial), (), dict(do_atomic_virial=False), fscope), (ag
.ld(coord), ag
.ld(atype), ag
.ld(box), ag
.ld(fparam), ag
.ld(aparam)), None, fscope)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/autograph/impl/api.py", line 339, in converted_call
return call_unconverted(f, args, kwargs, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/autograph/impl/api.py", line 460, in call_unconverted
return f(*args)
^^^^^^^^
File "/scratch-local/jellel.21735848/autograph_generated_fileq_ztxa8d.py", line 61, in call
retval__1 = ag
.converted_call(ag
.ld(model_call_from_call_lower), (), dict(call_lower=ag
.ld(call_lower), rcut=ag
.converted_call(ag_.ld(model).get_rcut, (), None, fscope_1), sel=ag__.converted_call(ag__.ld(model).get_sel, (), None, fscope_1), mixed_types=ag__.converted_call(ag__.ld(model).mixed_types, (), None, fscope_1), model_output_def=ag__.converted_call(ag__.ld(model).model_output_def, (), None, fscope_1), coord=ag__.ld(coord), atype=ag__.ld(atype), box=ag__.ld(box), fparam=ag__.ld(fparam), aparam=ag__.ld(aparam), do_atomic_virial=ag__.ld(do_atomic_virial)), fscope_1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/autograph/impl/api.py", line 439, in converted_call
result = converted_f(*effective_args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch-local/jellel.21735848/autograph_generated_file677g64fj.py", line 62, in tf__model_call_from_call_lower
model_predict_lower = ag
.converted_call(ag__.ld(call_lower), (ag__.ld(extended_coord), ag__.ld(extended_atype), ag__.ld(nlist), ag__.ld(mapping)), dict(fparam=ag__.ld(fp), aparam=ag__.ld(ap)), fscope)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
return _call_unconverted(f, args, kwargs, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/jax/jax2tf/serialization.py", line 104, in call_lower_without_atomic_virial
return tf.cond(
^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/jax/jax2tf/serialization.py", line 106, in
lambda: exported_whether_do_atomic_virial(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/jax/experimental/jax2tf/jax2tf.py", line 310, in converted_fun_tf
outs_flat_tf = converted_fun_flat_with_custom_gradient_tf(*args_flat_tf)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/jax/experimental/jax2tf/jax2tf.py", line 299, in converted_fun_flat_with_custom_gradient_tf
outs_tf, outs_avals, outs_tree = impl.run_fun_tf(args_flat_tf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/jax/experimental/jax2tf/jax2tf.py", line 376, in run_fun_tf
results = _run_exported_as_tf(args_flat_tf, self.exported)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/deepmd-kit/lib/python3.12/site-packages/jax/experimental/jax2tf/jax2tf.py", line 618, in _run_exported_as_tf
raise NotImplementedError(
NotImplementedError: in user code:

File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/jax/jax2tf/serialization.py", line 241, in call_without_atomic_virial  *
    coord, atype, box, fparam, aparam
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/jax/jax2tf/make_model.py", line 98, in model_call_from_call_lower  *
    model_predict_lower = call_lower(
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/jax/jax2tf/serialization.py", line 104, in call_lower_without_atomic_virial
    return tf.cond(
File "/opt/deepmd-kit/lib/python3.12/site-packages/deepmd/jax/jax2tf/serialization.py", line 106, in <lambda>
    lambda: exported_whether_do_atomic_virial(
File "/opt/deepmd-kit/lib/python3.12/site-packages/jax/experimental/jax2tf/jax2tf.py", line 321, in converted_fun_tf
    impl.after_conversion()
File "/opt/deepmd-kit/lib/python3.12/site-packages/jax/experimental/jax2tf/jax2tf.py", line 299, in converted_fun_flat_with_custom_gradient_tf
    outs_tf, outs_avals, outs_tree = impl.run_fun_tf(args_flat_tf)
File "/opt/deepmd-kit/lib/python3.12/site-packages/jax/experimental/jax2tf/jax2tf.py", line 376, in run_fun_tf
    results = _run_exported_as_tf(args_flat_tf, self.exported)
File "/opt/deepmd-kit/lib/python3.12/site-packages/jax/experimental/jax2tf/jax2tf.py", line 618, in _run_exported_as_tf
    raise NotImplementedError(

NotImplementedError: XlaCallModule from your TensorFlow installation supports up to serialization version 9 but the serialized module needs version 10. You should upgrade TensorFlow, e.g., to tf_nightly.

`

Especially the last line is telling a lot. Somehow the tensorflow version use a different serialization version. I am not sure if the gpu version has the same issue (just hidden behind a segfault error) or that it has something entirely different. I hope you have some good ideas on how to resolve this.

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugreproducedThis bug has been reproduced by developersupstream

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions