Skip to content

Cudnn version mismatch - resnet50 tensorflow - mlc #215

@anandhu-eng

Description

@anandhu-eng

output log:

./run_local.sh tf resnet50 gpu --scenario Offline    --threads 2 --user_conf '/root/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/
tmp/008b42b487e843888434313954e77347.conf' --use_preprocessed_dataset --cache_dir /root/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_f2fa0fec --dataset-lis
t /root/MLC/repos/local/cache/extract-file_49f3fae9/val.txt 2>&1 | tee '/mlc-mount/home/anandhu/test_results/93e5e028e03c-reference-gpu-tf-v2.18.0-cu124/resnet50/offl
ine/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus
python3 python/main.py --profile resnet50-tf --model "/root/MLC/repos/local/cache/download-file_a5ea13cc/resnet50_v1.pb" --dataset-path /root/MLC/repos/local/cache/ge
t-preprocessed-dataset-imagenet_f2fa0fec --output "/mlc-mount/home/anandhu/test_results/93e5e028e03c-reference-gpu-tf-v2.18.0-cu124/resnet50/offline/performance/run_1
" --scenario Offline --threads 2 --user_conf /root/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/008b42b487e843888434313954e77
347.conf --use_preprocessed_dataset --cache_dir /root/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_f2fa0fec --dataset-list /root/MLC/repos/local/cache/extr
act-file_49f3fae9/val.txt
INFO:main:Namespace(dataset='imagenet', dataset_path='/root/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_f2fa0fec', dataset_list='/root/MLC/repos/local/cac
he/extract-file_49f3fae9/val.txt', data_format=None, profile='resnet50-tf', scenario='Offline', max_batchsize=32, model='/root/MLC/repos/local/cache/download-file_a5e
a13cc/resnet50_v1.pb', output='/mlc-mount/home/anandhu/test_results/93e5e028e03c-reference-gpu-tf-v2.18.0-cu124/resnet50/offline/performance/run_1', inputs=['input_te
nsor:0'], outputs=['ArgMax:0'], backend='tensorflow', device=None, model_name='resnet50', threads=2, qps=None, cache=0, cache_dir='/root/MLC/repos/local/cache/get-pre
processed-dataset-imagenet_f2fa0fec', preprocessed_dir=None, use_preprocessed_dataset=True, accuracy=False, find_peak_performance=False, debug=False, user_conf='/root
/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/008b42b487e843888434313954e77347.conf', audit_conf='audit.config', time=None, c
ount=None, performance_sample_count=None, max_latency=None, samples_per_query=8)
2025-02-12 10:30:16.828618: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-poin
t round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-12 10:30:16.853479: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin
 cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1739356216.880204    3058 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been re
gistered
E0000 00:00:1739356216.887595    3058 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been
 registered
2025-02-12 10:30:16.915071: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-
critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlo
w with the appropriate compiler flags.
INFO:matplotlib.font_manager:generated new fontManager
INFO:imagenet:Loading 50000 preprocessed images using 2 threads
INFO:imagenet:loaded 50000 images, cache=0, already_preprocessed=True, took=0.9sec
WARNING:tensorflow:From /root/MLC/repos/local/cache/get-git-repo_c7f3aa29/inference/vision/classification_and_detection/python/backend_tf.py:55: FastGFile.__init__ (f
rom tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.gfile.GFile.
WARNING:tensorflow:From /root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/tools/strip_unused_lib.py:84: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
WARNING:tensorflow:From /root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/tools/optimize_for_inference_lib.py:138: remove_training_nodes (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
I0000 00:00:1739356257.281273    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78665 MB memory:  -> device: 0, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:18:00.0, compute capability: 9.0
I0000 00:00:1739356257.287068    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 78665 MB memory:  -> device: 1, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:2a:00.0, compute capability: 9.0
I0000 00:00:1739356257.290797    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 78665 MB memory:  -> device: 2, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:3a:00.0, compute capability: 9.0
I0000 00:00:1739356257.294197    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 78665 MB memory:  -> device: 3, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:5d:00.0, compute capability: 9.0
I0000 00:00:1739356257.298001    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:4 with 78665 MB memory:  -> device: 4, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:9a:00.0, compute capability: 9.0
I0000 00:00:1739356257.308591    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:5 with 78665 MB memory:  -> device: 5, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:ab:00.0, compute capability: 9.0
I0000 00:00:1739356257.312613    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 78665 MB memory:  -> device: 6, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:ba:00.0, compute capability: 9.0
I0000 00:00:1739356257.315976    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 78665 MB memory:  -> device: 7, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:db:00.0, compute capability: 9.0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1739356257.774200    3058 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
E0000 00:00:1739356261.281341    3599 cuda_dnn.cc:522] Loaded runtime CuDNN library: 9.0.0 but source was compiled with: 9.3.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2025-02-12 10:31:01.283436: W tensorflow/core/framework/op_kernel.cc:1841] OP_REQUIRES failed at conv_ops_fused_impl.h:625 : INVALID_ARGUMENT: No DNN in stream executor.
2025-02-12 10:31:01.283474: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: No DNN in stream executor.
         [[{{node resnet_model/Relu}}]]
2025-02-12 10:31:01.283486: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: No DNN in stream executor.
         [[{{node resnet_model/Relu}}]]
         [[ArgMax/_3]]
2025-02-12 10:31:01.283510: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5866837555468538586
Traceback (most recent call last):
  File "/root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1407, in _do_call
    return fn(*args)
  File "/root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1390, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1483, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) INVALID_ARGUMENT: No DNN in stream executor.
         [[{{node resnet_model/Relu}}]]
         [[ArgMax/_3]]
  (1) INVALID_ARGUMENT: No DNN in stream executor.
         [[{{node resnet_model/Relu}}]]
0 successful operations.
0 derived errors ignored.

run command:

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=resnet50 \
   --implementation=reference \
   --framework=tensorflow \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=5000

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions