diff --git a/docs/user-tutorial/benchmarks/micro-benchmarks.md b/docs/user-tutorial/benchmarks/micro-benchmarks.md index aa3aa965b..7a259eb2b 100644 --- a/docs/user-tutorial/benchmarks/micro-benchmarks.md +++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md @@ -267,20 +267,22 @@ For measurements of peer-to-peer communication performance between AMD GPUs, GPU #### Introduction -Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype. +Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double and float datatypes. + +__Note__: When `--check_data` is enabled, each process allocates 2× `--size` bytes of host memory for validation buffers (e.g. 8 GiB with the default 4 GiB `--size`). Under `default_local_mode` with 8 GPUs this totals ~64 GiB of host RAM. Recommend using a small `--size` such as `1048576` (1 MiB) when `--check_data` is enabled. #### Metrics | Metric Name | Unit | Description | |------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------| -| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the copy operation with specified buffer size and block size. | -| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size. | -| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size. | -| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size. | -| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size. | -| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size. | -| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size. | -| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size. | +| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the copy operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | +| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the scale operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | +| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the add operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | +| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the triad operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | +| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the copy operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | +| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the scale operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | +| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the add operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | +| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the triad operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size. | ### `ib-loopback` diff --git a/examples/benchmarks/gpu_stream.py b/examples/benchmarks/gpu_stream.py index 88c789efb..1aa67b15d 100644 --- a/examples/benchmarks/gpu_stream.py +++ b/examples/benchmarks/gpu_stream.py @@ -12,7 +12,7 @@ if __name__ == '__main__': context = BenchmarkRegistry.create_benchmark_context( - 'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10' + 'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10 --data_type double' ) # For ROCm environment, please specify the benchmark name and the platform as the following. # context = BenchmarkRegistry.create_benchmark_context( diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream.py b/superbench/benchmarks/micro_benchmarks/gpu_stream.py index 2e82262f3..2a3d38421 100644 --- a/superbench/benchmarks/micro_benchmarks/gpu_stream.py +++ b/superbench/benchmarks/micro_benchmarks/gpu_stream.py @@ -51,10 +51,21 @@ def add_parser_arguments(self): help='Number of data buffer copies performed.', ) + self._parser.add_argument( + '--data_type', + type=str, + default='double', + choices=['float', 'double'], + required=False, + help='Data type of the buffer elements.', + ) + self._parser.add_argument( '--check_data', action='store_true', - help='Enable data checking', + help='Enable data checking. Note: allocates 2x --size bytes of host memory per process ' + 'for validation buffers (e.g. 8 GiB with default 4 GiB --size). ' + 'Recommend using a small --size such as 1048576 (1 MiB) when this flag is enabled.', ) def _preprocess(self): @@ -68,8 +79,8 @@ def _preprocess(self): self.__bin_path = os.path.join(self._args.bin_dir, self._bin_name) - args = '--size %d --num_warm_up %d --num_loops %d ' % ( - self._args.size, self._args.num_warm_up, self._args.num_loops + args = '--size %d --num_warm_up %d --num_loops %d --data_type %s' % ( + self._args.size, self._args.num_warm_up, self._args.num_loops, self._args.data_type ) if self._args.check_data: diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt b/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt index ba3d2750b..342c11623 100644 --- a/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt +++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt @@ -29,7 +29,7 @@ message(STATUS "Found CUDA: " ${CUDAToolkit_VERSION}) # Source files set(SOURCES - gpu_stream_test.cpp + gpu_stream_main.cpp gpu_stream_utils.cpp gpu_stream.cu gpu_stream_kernels.cu @@ -38,6 +38,7 @@ set(SOURCES include(../cuda_common.cmake) add_executable(gpu_stream ${SOURCES}) set_property(TARGET gpu_stream PROPERTY CUDA_ARCHITECTURES ${NVCC_ARCHS_SUPPORTED}) +target_compile_definitions(gpu_stream PRIVATE _GNU_SOURCE) target_include_directories(gpu_stream PRIVATE ${CUDAToolkit_INCLUDE_DIRS}) target_link_libraries(gpu_stream numa ${NVML_LIBRARY}) diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu index 617b8338a..a8791cc7b 100644 --- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu +++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu @@ -10,6 +10,7 @@ #include #include #include +#include /** * @brief Destroys the CUDA events used for benchmarking. @@ -235,15 +236,25 @@ template int GpuStream::PrepareBufAndStream(std::unique_ptrcheck_data) { - // Generate data to copy - args->sub.data_buf = static_cast(numa_alloc_onnode(args->size * sizeof(T), args->numa_id)); + // Generate data to copy - use local NUMA node for best CPU access + args->sub.data_buf = static_cast(numa_alloc_local(args->size)); + if (args->sub.data_buf == nullptr) { + std::cerr << "PrepareBufAndStream::numa_alloc_local data_buf failed" << std::endl; + return -1; + } - for (int j = 0; j < args->size / sizeof(T); j++) { + for (uint64_t j = 0; j < args->size / sizeof(T); j++) { args->sub.data_buf[j] = static_cast(j % kUInt8Mod); } - // Allocate check buffer - args->sub.check_buf = static_cast(numa_alloc_onnode(args->size * sizeof(T), args->numa_id)); + // Allocate check buffer on local NUMA node + args->sub.check_buf = static_cast(numa_alloc_local(args->size)); + if (args->sub.check_buf == nullptr) { + std::cerr << "PrepareBufAndStream::numa_alloc_local check_buf failed" << std::endl; + numa_free(args->sub.data_buf, args->size); + args->sub.data_buf = nullptr; + return -1; + } } // Allocate buffers @@ -257,7 +268,7 @@ template int GpuStream::PrepareBufAndStream(std::unique_ptrsub.gpu_buf_ptrs) { T *raw_ptr = nullptr; - cuda_err = GpuMallocDataBuf(&raw_ptr, args->size * sizeof(T)); + cuda_err = GpuMallocDataBuf(&raw_ptr, args->size); if (cuda_err != cudaSuccess) { std::cerr << "PrepareBufAndStream::cudaMalloc error: " << cuda_err << std::endl; return -1; @@ -350,7 +361,12 @@ template int GpuStream::CheckBuf(std::unique_ptr> &arg return -1; } - // Validate result by comparing the data buffer and check buffer + // Validate result by comparing the data buffer and check buffer. + // NOTE: memcmp is exact (byte-for-byte). This works because the current test values + // (j % 256, scalar = 11.0) are exactly representable in both float and double IEEE-754. + // If kUInt8Mod or scalar are changed to values that cause rounding differences between + // host (two separate ops) and GPU (FMA), this check will need a tolerance-based comparison + // for T = float. memcmp_result = memcmp(args->sub.validation_buf_ptrs[kernel_idx].data(), args->sub.check_buf, args->size); if (memcmp_result) { std::cerr << "CheckBuf::Memory check failed for kernel index " << kernel_idx << std::endl; @@ -419,11 +435,19 @@ int GpuStream::RunStreamKernel(std::unique_ptr> &args, Kernel kerne uint64_t num_thread_blocks; int size_factor = 2; + if (num_threads_per_block == 0) { + std::cerr << "RunStreamKernel::num_threads_per_block must be > 0" << std::endl; + return -1; + } + // Validate data size - uint64_t num_elements_in_thread_block = kNumLoopUnroll * num_threads_per_block; - uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(T); + // Each thread processes one VecT element (128 bits / 16 bytes) for optimal memory bandwidth. + // Derived from VecT so any vector type change is caught at compile time. + constexpr uint64_t kBytesPerThread = sizeof(VecT); + static_assert(kBytesPerThread == 16, "Vector type must be 128-bit aligned for current PTX"); + uint64_t num_bytes_in_thread_block = num_threads_per_block * kBytesPerThread; if (args->size % num_bytes_in_thread_block) { - std::cerr << "RunCopy: Data size should be multiple of " << num_bytes_in_thread_block << std::endl; + std::cerr << "RunStreamKernel: Data size should be multiple of " << num_bytes_in_thread_block << std::endl; return -1; } num_thread_blocks = args->size / num_bytes_in_thread_block; @@ -435,7 +459,7 @@ int GpuStream::RunStreamKernel(std::unique_ptr> &args, Kernel kerne } // Launch jobs and collect running time - for (int i = 0; i < args->num_loops + args->num_warm_up; i++) { + for (uint64_t i = 0; i < args->num_loops + args->num_warm_up; i++) { // Record start event once warm up iterations are done if (i == args->num_warm_up) { @@ -448,30 +472,30 @@ int GpuStream::RunStreamKernel(std::unique_ptr> &args, Kernel kerne switch (kernel) { case Kernel::kCopy: - CopyKernel<<sub.stream>>>( - reinterpret_cast(args->sub.gpu_buf_ptrs[2].get()), - reinterpret_cast(args->sub.gpu_buf_ptrs[0].get())); + CopyKernel<<sub.stream>>>( + reinterpret_cast *>(args->sub.gpu_buf_ptrs[2].get()), + reinterpret_cast *>(args->sub.gpu_buf_ptrs[0].get())); args->sub.kernel_name = "COPY"; break; case Kernel::kScale: - ScaleKernel<<sub.stream>>>( - reinterpret_cast(args->sub.gpu_buf_ptrs[2].get()), - reinterpret_cast(args->sub.gpu_buf_ptrs[0].get()), static_cast(scalar)); + ScaleKernel<<sub.stream>>>( + reinterpret_cast *>(args->sub.gpu_buf_ptrs[2].get()), + reinterpret_cast *>(args->sub.gpu_buf_ptrs[0].get()), static_cast(scalar)); args->sub.kernel_name = "SCALE"; break; case Kernel::kAdd: - AddKernel<<sub.stream>>>( - reinterpret_cast(args->sub.gpu_buf_ptrs[2].get()), - reinterpret_cast(args->sub.gpu_buf_ptrs[0].get()), - reinterpret_cast(args->sub.gpu_buf_ptrs[1].get())); + AddKernel<<sub.stream>>>( + reinterpret_cast *>(args->sub.gpu_buf_ptrs[2].get()), + reinterpret_cast *>(args->sub.gpu_buf_ptrs[0].get()), + reinterpret_cast *>(args->sub.gpu_buf_ptrs[1].get())); size_factor = 3; args->sub.kernel_name = "ADD"; break; case Kernel::kTriad: - TriadKernel<<sub.stream>>>( - reinterpret_cast(args->sub.gpu_buf_ptrs[2].get()), - reinterpret_cast(args->sub.gpu_buf_ptrs[0].get()), - reinterpret_cast(args->sub.gpu_buf_ptrs[1].get()), static_cast(scalar)); + TriadKernel<<sub.stream>>>( + reinterpret_cast *>(args->sub.gpu_buf_ptrs[2].get()), + reinterpret_cast *>(args->sub.gpu_buf_ptrs[0].get()), + reinterpret_cast *>(args->sub.gpu_buf_ptrs[1].get()), static_cast(scalar)); size_factor = 3; args->sub.kernel_name = "TRIAD"; break; @@ -583,11 +607,10 @@ int GpuStream::RunStream(std::unique_ptr> &args, const std::string // output formatted results to stdout // Tags are of format: - // STREAM__datatype_gpu__buffer__block_ - for (int i = 0; i < args->sub.times_in_ms.size(); i++) { - std::string tag = "STREAM_" + KernelToString(i) + "_" + data_type + "_gpu_" + std::to_string(args->gpu_id) + - "_buffer_" + std::to_string(args->size); - for (int j = 0; j < args->sub.times_in_ms[i].size(); j++) { + // STREAM__datatype_buffer__block_ + for (size_t i = 0; i < args->sub.times_in_ms.size(); i++) { + std::string tag = "STREAM_" + KernelToString(i) + "_" + data_type + "_buffer_" + std::to_string(args->size); + for (size_t j = 0; j < args->sub.times_in_ms[i].size(); j++) { // Calculate and display bandwidth double bw = args->size * args->num_loops / args->sub.times_in_ms[i][j] / 1e6; std::cout << tag << "_block_" << kThreadsPerBlock[j] << "\t" << bw << "\t"; @@ -605,12 +628,33 @@ int GpuStream::RunStream(std::unique_ptr> &args, const std::string return ret; } +/** + * @brief Creates and initializes a BenchArgs for the given type and adds it to bench_args_. + * + * @tparam T The data type (float or double) for the benchmark arguments. + */ +template int GpuStream::CreateBenchArgs() { + auto args = std::make_unique>(); + args->gpu_id = 0; + cudaError_t cuda_err = cudaGetDeviceProperties(&args->gpu_device_prop, 0); + if (cuda_err != cudaSuccess) { + std::cerr << "CreateBenchArgs::cudaGetDeviceProperties error: " << cuda_err << std::endl; + return -1; + } + args->num_warm_up = opts_.num_warm_up; + args->num_loops = opts_.num_loops; + args->size = opts_.size; + args->check_data = opts_.check_data; + bench_args_ = std::move(args); + return 0; +} + /** * @brief Runs the Stream benchmark. * - * @details This function processes the input args, validates and composes the BenchArgs structure for the - availavble - * GPUs, and runs the benchmark. + * @details This function processes the input args, validates and composes the BenchArgs structure for + * the first visible GPU (CUDA device 0). When running under Superbench's default_local_mode, + * CUDA_VISIBLE_DEVICES is set per process, so device 0 maps to the assigned physical GPU. * * @return int The status code indicating success or failure of the benchmark execution. * */ @@ -631,71 +675,90 @@ int GpuStream::Run() { return ret; } - // find all GPUs and compose the Benchmarking data structure - for (int j = 0; j < gpu_count; j++) { - auto args = std::make_unique>(); - args->numa_id = 0; - args->gpu_id = j; - cudaGetDeviceProperties(&args->gpu_device_prop, j); - - args->num_warm_up = opts_.num_warm_up; - args->num_loops = opts_.num_loops; - args->size = opts_.size; - args->check_data = opts_.check_data; - args->numa_id = 0; - args->gpu_id = j; - - // add data to vector - bench_args_.emplace_back(std::move(args)); - } - - bool has_error = false; - // Run the benchmark for all the configured data - for (auto &variant_args : bench_args_) { - std::visit( - [&](auto &curr_args) { - // Get memory clock rate once for both bandwidth computation and display - float memory_clock_mhz = GetMemoryClockRate(curr_args->gpu_id, curr_args->gpu_device_prop); - - // Compute theoretical bandwidth using the memory clock rate - float peak_bw = -1.0f; - if (memory_clock_mhz > 0.0f) { - // Calculate theoretical bandwidth: memory_clock_mhz * bus_width_bytes * 2 (DDR) / 1000 (convert to - // GB/s) - peak_bw = memory_clock_mhz * (curr_args->gpu_device_prop.memoryBusWidth / 8) * 2 / 1000.0; - } - - // Print device info with both the memory clock and peak bandwidth - PrintCudaDeviceInfo(curr_args->gpu_id, curr_args->gpu_device_prop, memory_clock_mhz, peak_bw); - - // Set the NUMA node - ret = numa_run_on_node(curr_args->numa_id); - if (ret != 0) { - std::cerr << "Run::numa_run_on_node error: " << errno << std::endl; - has_error = true; - return; - } - - // Run the stream benchmark for the configured data, passing the peak bandwidth - if constexpr (std::is_same_v, BenchArgs>) { - ret = RunStream(curr_args, "float", peak_bw); - } else if constexpr (std::is_same_v, BenchArgs>) { - ret = RunStream(curr_args, "double", peak_bw); - } else { - std::cerr << "Run::Unknown type error" << std::endl; - has_error = true; - return; - } - - if (ret != 0) { - std::cerr << "Run::RunStream error: " << errno << std::endl; - has_error = true; - } - }, - variant_args); - } - if (has_error) { + if (gpu_count < 1) { + std::cerr << "Run::No GPU available" << std::endl; + return -1; + } + + // Run on CUDA device 0 (the visible GPU assigned by CUDA_VISIBLE_DEVICES). + if (opts_.data_type == "float") { + ret = CreateBenchArgs(); + } else if (opts_.data_type == "double") { + ret = CreateBenchArgs(); + } else { + std::cerr << "Run::Invalid data_type: " << opts_.data_type << std::endl; + return -1; + } + if (ret != 0) { + return ret; + } + + // Pin the thread to the GPU's NUMA node for optimal host↔device bandwidth. + // Query GPU 0's preferred CPU NUMA node via NVML; fall back to the process's + // current node if the NVML query fails (e.g. NUMA disabled, older driver). + int target_node = -1; + { + nvmlDevice_t nvml_dev; + unsigned int gpu_numa_node = 0; + if (nvmlInit() == NVML_SUCCESS) { + if (nvmlDeviceGetHandleByIndex(0, &nvml_dev) == NVML_SUCCESS && + nvmlDeviceGetNumaNodeId(nvml_dev, &gpu_numa_node) == NVML_SUCCESS) { + target_node = static_cast(gpu_numa_node); + } + nvmlShutdown(); + } + } + if (target_node < 0) { + // Fallback: use the node where this process is currently scheduled + int cpu = sched_getcpu(); + if (cpu < 0) { + std::cerr << "Run::sched_getcpu failed" << std::endl; + return -1; + } + target_node = numa_node_of_cpu(cpu); + if (target_node < 0) { + std::cerr << "Run::numa_node_of_cpu failed for cpu " << cpu << std::endl; + return -1; + } + } + if (numa_run_on_node(target_node) != 0) { + std::cerr << "Run::numa_run_on_node failed for node " << target_node << std::endl; return -1; } + + // Run the benchmark for the configured data + std::visit( + [&](auto &curr_args) { + // Get memory clock rate once for both bandwidth computation and display + float memory_clock_mhz = GetMemoryClockRate(curr_args->gpu_id, curr_args->gpu_device_prop); + + // Compute theoretical bandwidth using the memory clock rate + float peak_bw = -1.0f; + if (memory_clock_mhz > 0.0f) { + // Calculate theoretical bandwidth: memory_clock_mhz * bus_width_bytes * 2 (DDR) / 1000 (convert to + // GB/s) + peak_bw = memory_clock_mhz * (curr_args->gpu_device_prop.memoryBusWidth / 8) * 2 / 1000.0; + } + + // Print device info with both the memory clock and peak bandwidth + PrintCudaDeviceInfo(curr_args->gpu_id, curr_args->gpu_device_prop, memory_clock_mhz, peak_bw); + + // Run the stream benchmark for the configured data, passing the peak bandwidth + if constexpr (std::is_same_v, BenchArgs>) { + ret = RunStream(curr_args, "float", peak_bw); + } else if constexpr (std::is_same_v, BenchArgs>) { + ret = RunStream(curr_args, "double", peak_bw); + } else { + std::cerr << "Run::Unknown type error" << std::endl; + ret = -1; + return; + } + + if (ret != 0) { + std::cerr << "Run::RunStream error: " << errno << std::endl; + } + }, + bench_args_); + return ret; } \ No newline at end of file diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp index 473a78839..65502a243 100644 --- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp +++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp @@ -34,8 +34,8 @@ class GpuStream { int Run(); private: - using BenchArgsVariant = std::variant>>; - std::vector bench_args_; + using BenchArgsVariant = std::variant>, std::unique_ptr>>; + BenchArgsVariant bench_args_; Opts opts_; // Memory management functions @@ -56,6 +56,7 @@ class GpuStream { template int RunStream(std::unique_ptr> &, const std::string &data_type, float peak_bw); // Helper functions + template int CreateBenchArgs(); int GetGpuCount(int *); int SetGpu(int gpu_id); float GetMemoryClockRate(int device_id, const cudaDeviceProp &prop); diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu index 548fc8ba3..e40237b83 100644 --- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu +++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu @@ -1,155 +1,33 @@ // Copyright (c) Microsoft Corporation. // Licensed under the MIT License. -#include "gpu_stream_kernels.hpp" - -/** - * @brief Fetches a value from source memory and writes it to a register. - * - * @details This inline device function fetches a value from the specified source memory - * location and writes it to the provided register. The implementation references the following: - * 1) NCCL: - * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L483 - * 2) RCCL: - * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L268 - * - * @tparam T The type of the value to fetch. - * @param[out] v The register to write the fetched value to. - * @param[in] p The source memory location to fetch the value from. - */ -template inline __device__ void Fetch(T &v, const T *p) { -#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__) - v = *p; -#else - if constexpr (std::is_same::value) { - asm volatile("ld.volatile.global.f32 %0, [%1];" : "=f"(v) : "l"(p) : "memory"); - } else if constexpr (std::is_same::value) { - asm volatile("ld.volatile.global.f64 %0, [%1];" : "=d"(v) : "l"(p) : "memory"); - } -#endif -} - /** - * @brief Stores a value from register and writes it to target memory. + * @file gpu_stream_kernels.cu + * @brief CUDA kernel compilation unit for GPU stream benchmark. * - * @details This inline device function stores a value from the provided register - * and writes it to the specified target memory location. The implementation references the following: - * 1) NCCL: - * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L486 - * 2) RCCL: - * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L276 + * All template kernel implementations (CopyKernel, ScaleKernel, AddKernel, TriadKernel) + * are defined in gpu_stream_kernels.hpp rather than here. This is required because: * - * @tparam T The type of the value to store. - * @param[out] p The target memory location to write the value to. - * @param[in] v The register containing the value to be stored. - */ -template inline __device__ void Store(T *p, const T &v) { -#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__) - *p = v; -#else - if constexpr (std::is_same::value) { - asm volatile("st.volatile.global.f32 [%0], %1;" ::"l"(p), "f"(v) : "memory"); - } else if constexpr (std::is_same::value) { - asm volatile("st.volatile.global.f64 [%0], %1;" ::"l"(p), "d"(v) : "memory"); - } -#endif -} - -/** - * @brief Performs COPY, a simple copy operation from source to target. b = a + * 1. **C++ Template Instantiation Model**: Templates are not compiled until they are + * instantiated with concrete types. The compiler needs to see the full template + * definition (not just declaration) at the point of instantiation. * - * @details This CUDA kernel performs a simple copy operation, copying data from the source array - * to the target array. This is used to measure transfer rates without any arithmetic operations. + * 2. **Separate Compilation Units**: When gpu_stream.cu calls `CopyKernel<<<...>>>`, + * nvcc needs the full kernel implementation visible in that translation unit. + * If implementations were only in this .cu file, gpu_stream.cu would only see + * declarations, causing "undefined reference" linker errors. * - * @param[out] tgt The target array where data will be copied to. - * @param[in] src The source array from which data will be copied. - */ -__global__ void CopyKernel(double *tgt, const double *src) { - uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x; - double val[kNumLoopUnrollAlias]; -#pragma unroll - for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) - Fetch(val[i], src + index + i * blockDim.x); -#pragma unroll - for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) - Store(tgt + index + i * blockDim.x, val[i]); -} - -/** - * @brief Performs SCALE, a scaling operation on the source data. b = x * a - * - * @details This CUDA kernel performs a simple arithmetic operation by scaling the source data - * with a given scalar value and storing the result in the target array. - * - * @param[out] tgt The target array where the scaled data will be stored. - * @param[in] src The source array containing the data to be scaled. - * @param[in] scalar The scalar value used to scale the source data. - */ -__global__ void ScaleKernel(double *tgt, const double *src, const double scalar) { - uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x; - double val[kNumLoopUnrollAlias]; -#pragma unroll - for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) - Fetch(val[i], src + index + i * blockDim.x); -#pragma unroll - for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) { - val[i] *= scalar; - Store(tgt + index + i * blockDim.x, val[i]); - } -} - -/** - * @brief Performs ADD, an addition operation on two source arrays. c = a + b - * - * @details This CUDA kernel adds corresponding elements from two source arrays and stores the result - * in the target array. This operation is used to measure transfer rates with a simple arithmetic addition. - * - * @param[out] tgt The target array where the result of the addition will be stored. - * @param[in] src_a The first source array containing the first set of operands. - * @param[in] src_b The second source array containing the second set of operands. - */ -__global__ void AddKernel(double *tgt, const double *src_a, const double *src_b) { - uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x; - double val_a[kNumLoopUnrollAlias]; - double val_b[kNumLoopUnrollAlias]; - -#pragma unroll - for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) { - Fetch(val_a[i], src_a + index + i * blockDim.x); - Fetch(val_b[i], src_b + index + i * blockDim.x); - } -#pragma unroll - for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) { - val_a[i] += val_b[i]; - Store(tgt + index + i * blockDim.x, val_a[i]); - } -} - -/** - * @brief Performs TRIAD, fused multiply/add operations on source arrays. a = b + x * c + * 3. **CUDA-Specific Consideration**: Unlike regular C++ where explicit template + * instantiation in a .cpp file can work, CUDA kernel launches require the kernel + * code to be visible to nvcc when compiling the caller. This is because nvcc + * generates device code at compile time, not link time. * - * @details This CUDA kernel performs a fused multiply/add operation by multiplying elements from - * the second source array with a scalar value, adding the result to corresponding elements from - * the first source array, and storing the result in the target array. + * 4. **Header Guards for Mixed Compilation**: The header uses `#ifdef __CUDACC__` to + * protect CUDA-specific code (blockIdx, threadIdx, __global__, etc.) from g++ + * when the header is indirectly included by .cpp files (e.g., via gpu_stream.hpp). * - * @param[out] tgt The target array where the result of the fused multiply/add operation will be stored. - * @param[in] src_a The first source array containing the first set of operands. - * @param[in] src_b The second source array containing the second set of operands to be multiplied by the scalar. - * @param[in] scalar The scalar value used in the multiply/add operation. + * This file remains as the compilation unit that ensures the header is processed + * by nvcc, and can host any future non-template helper functions if needed. */ -__global__ void TriadKernel(double *tgt, const double *src_a, const double *src_b, const double scalar) { - uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x; - double val_a[kNumLoopUnrollAlias]; - double val_b[kNumLoopUnrollAlias]; -#pragma unroll - for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) { - Fetch(val_a[i], src_a + index + i * blockDim.x); - Fetch(val_b[i], src_b + index + i * blockDim.x); - } -#pragma unroll - for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) { - val_b[i] += (val_a[i] * scalar); - Store(tgt + index + i * blockDim.x, val_b[i]); - } -} \ No newline at end of file +#include "gpu_stream_kernels.hpp" \ No newline at end of file diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp index cfe9f2052..8702b1421 100644 --- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp +++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp @@ -5,15 +5,190 @@ #include #include +#include #include "gpu_stream_utils.hpp" -constexpr auto kNumLoopUnrollAlias = stream_config::kNumLoopUnroll; -// Function declarations -template inline __device__ void Fetch(T &v, const T *p); -template inline __device__ void Store(T *p, const T &v); +/** + * @brief Type trait mapping scalar types to their 128-bit aligned vector types. + * + * @details For optimal memory bandwidth, we use 128-bit (16 byte) vector loads/stores: + * - double -> double2 (2 x 64-bit = 128-bit) + * - float -> float4 (4 x 32-bit = 128-bit) + */ +template struct VectorType; +template <> struct VectorType { using type = double2; }; +template <> struct VectorType { using type = float4; }; -__global__ void CopyKernel(double *, const double *); -__global__ void ScaleKernel(double *, const double *, const double); -__global__ void AddKernel(double *, const double *, const double *); -__global__ void TriadKernel(double *, const double *, const double *, const double); \ No newline at end of file +template using VecT = typename VectorType::type; + +// Kernel declarations (visible to all compilers for function pointer usage) +template __global__ void CopyKernel(VecT *tgt, const VecT *src); +template __global__ void ScaleKernel(VecT *tgt, const VecT *src, const T scalar); +template __global__ void AddKernel(VecT *tgt, const VecT *src_a, const VecT *src_b); +template +__global__ void TriadKernel(VecT *tgt, const VecT *src_a, const VecT *src_b, const T scalar); + +// Implementation section - only compiled by nvcc +#ifdef __CUDACC__ + +/** + * @brief Fetches a value from source memory and writes it to a register. + * + * @details This inline device function fetches a value from the specified source memory + * location and writes it to the provided register. The implementation references the following: + * 1) NCCL: + * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L483 + * 2) RCCL: + * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L268 + * + * @tparam T The type of the value to fetch. + * @param[out] v The register to write the fetched value to. + * @param[in] p The source memory location to fetch the value from. + */ +template inline __device__ void Fetch(T &v, const T *p) { +#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__) + v = *p; +#else + if constexpr (std::is_same::value) { + asm volatile("ld.volatile.global.v2.f64 {%0,%1}, [%2];" : "=d"(v.x), "=d"(v.y) : "l"(p) : "memory"); + } else if constexpr (std::is_same::value) { + asm volatile("ld.volatile.global.v4.f32 {%0,%1,%2,%3}, [%4];" + : "=f"(v.x), "=f"(v.y), "=f"(v.z), "=f"(v.w) + : "l"(p) + : "memory"); + } +#endif +} + +/** + * @brief Stores a value from register and writes it to target memory. + * + * @details This inline device function stores a value from the provided register + * and writes it to the specified target memory location. The implementation references the following: + * 1) NCCL: + * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L486 + * 2) RCCL: + * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L276 + * + * @tparam T The type of the value to store. + * @param[out] p The target memory location to write the value to. + * @param[in] v The register containing the value to be stored. + */ +template inline __device__ void Store(T *p, const T &v) { +#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__) + *p = v; +#else + if constexpr (std::is_same::value) { + asm volatile("st.volatile.global.v2.f64 [%0], {%1,%2};" ::"l"(p), "d"(v.x), "d"(v.y) : "memory"); + } else if constexpr (std::is_same::value) { + asm volatile("st.volatile.global.v4.f32 [%0], {%1,%2,%3,%4};" ::"l"(p), "f"(v.x), "f"(v.y), "f"(v.z), "f"(v.w) + : "memory"); + } +#endif +} + +/** + * @brief Performs COPY, a simple copy operation from source to target. b = a + * + * @details This CUDA kernel performs a simple copy operation, copying data from the source array + * to the target array. This is used to measure transfer rates without any arithmetic operations. + * + * @param[out] tgt The target array where data will be copied to (128-bit aligned). + * @param[in] src The source array from which data will be copied (128-bit aligned). + */ +template __global__ void CopyKernel(VecT *tgt, const VecT *src) { + uint64_t index = blockIdx.x * blockDim.x + threadIdx.x; + VecT val; + Fetch(val, src + index); + Store(tgt + index, val); +} + +/** + * @brief Performs SCALE, a scaling operation on the source data. b = x * a + * + * @details This CUDA kernel performs a simple arithmetic operation by scaling the source data + * with a given scalar value and storing the result in the target array. + * + * @param[out] tgt The target array where the scaled data will be stored (128-bit aligned). + * @param[in] src The source array containing the data to be scaled (128-bit aligned). + * @param[in] scalar The scalar value used to scale the source data. + */ +template __global__ void ScaleKernel(VecT *tgt, const VecT *src, const T scalar) { + uint64_t index = blockIdx.x * blockDim.x + threadIdx.x; + VecT val; + Fetch(val, src + index); + if constexpr (std::is_same::value) { + val.x *= scalar; + val.y *= scalar; + } else if constexpr (std::is_same::value) { + val.x *= scalar; + val.y *= scalar; + val.z *= scalar; + val.w *= scalar; + } + Store(tgt + index, val); +} + +/** + * @brief Performs ADD, an addition operation on two source arrays. c = a + b + * + * @details This CUDA kernel adds corresponding elements from two source arrays and stores the result + * in the target array. This operation is used to measure transfer rates with a simple arithmetic addition. + * + * @param[out] tgt The target array where the result of the addition will be stored (128-bit aligned). + * @param[in] src_a The first source array containing the first set of operands (128-bit aligned). + * @param[in] src_b The second source array containing the second set of operands (128-bit aligned). + */ +template __global__ void AddKernel(VecT *tgt, const VecT *src_a, const VecT *src_b) { + uint64_t index = blockIdx.x * blockDim.x + threadIdx.x; + VecT val_a; + VecT val_b; + Fetch(val_a, src_a + index); + Fetch(val_b, src_b + index); + if constexpr (std::is_same::value) { + val_a.x += val_b.x; + val_a.y += val_b.y; + } else if constexpr (std::is_same::value) { + val_a.x += val_b.x; + val_a.y += val_b.y; + val_a.z += val_b.z; + val_a.w += val_b.w; + } + Store(tgt + index, val_a); +} + +/** + * @brief Performs TRIAD, fused multiply/add operations on source arrays. c = b + x * a + * + * @details This CUDA kernel performs a fused multiply/add operation by multiplying elements from + * the first source array with a scalar value, adding the result to corresponding elements from + * the second source array, and storing the result in the target array. + * + * @param[out] tgt The target array where the result of the fused multiply/add operation will be stored (128-bit + * aligned). + * @param[in] src_a The first source array containing the first set of operands to be multiplied by the scalar + * (128-bit aligned). + * @param[in] src_b The second source array containing the second set of operands (128-bit aligned). + * @param[in] scalar The scalar value used in the multiply/add operation. + */ +template +__global__ void TriadKernel(VecT *tgt, const VecT *src_a, const VecT *src_b, const T scalar) { + uint64_t index = blockIdx.x * blockDim.x + threadIdx.x; + VecT val_a; + VecT val_b; + Fetch(val_a, src_a + index); + Fetch(val_b, src_b + index); + if constexpr (std::is_same::value) { + val_b.x += (val_a.x * scalar); + val_b.y += (val_a.y * scalar); + } else if constexpr (std::is_same::value) { + val_b.x += (val_a.x * scalar); + val_b.y += (val_a.y * scalar); + val_b.z += (val_a.z * scalar); + val_b.w += (val_a.w * scalar); + } + Store(tgt + index, val_b); +} + +#endif // __CUDACC__ \ No newline at end of file diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_test.cpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp similarity index 100% rename from superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_test.cpp rename to superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp index 6ced0fdd5..fd0dfb913 100644 --- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp +++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp @@ -43,6 +43,7 @@ void PrintUsage() { << "--size " << "--num_warm_up " << "--num_loops " + << "[--data_type ] " << "[--check_data]" << std::endl; } @@ -60,6 +61,7 @@ void PrintInputInfo(Opts &opts) { std::cout << "Buffer size(bytes): " << opts.size << std::endl; std::cout << "Number of warm up runs: " << opts.num_warm_up << std::endl; std::cout << "Number of loops: " << opts.num_loops << std::endl; + std::cout << "Data type: " << opts.data_type << std::endl; std::cout << "Check data: " << (opts.check_data ? "Yes" : "No") << std::endl; } @@ -75,11 +77,12 @@ void PrintInputInfo(Opts &opts) { * @return int The status code. * */ int ParseOpts(int argc, char **argv, Opts *opts) { - enum class OptIdx { kSize, kNumWarmUp, kNumLoops, kEnableCheckData }; + enum class OptIdx { kSize, kNumWarmUp, kNumLoops, kEnableCheckData, kDataType }; const struct option options[] = {{"size", required_argument, nullptr, static_cast(OptIdx::kSize)}, {"num_warm_up", required_argument, nullptr, static_cast(OptIdx::kNumWarmUp)}, {"num_loops", required_argument, nullptr, static_cast(OptIdx::kNumLoops)}, - {"check_data", no_argument, nullptr, static_cast(OptIdx::kEnableCheckData)}}; + {"check_data", no_argument, nullptr, static_cast(OptIdx::kEnableCheckData)}, + {"data_type", required_argument, nullptr, static_cast(OptIdx::kDataType)}}; int getopt_ret = 0; int opt_idx = 0; bool size_specified = true; @@ -126,6 +129,13 @@ int ParseOpts(int argc, char **argv, Opts *opts) { case static_cast(OptIdx::kEnableCheckData): opts->check_data = true; break; + case static_cast(OptIdx::kDataType): + opts->data_type = optarg; + if (opts->data_type != "float" && opts->data_type != "double") { + std::cerr << "Invalid data_type: " << optarg << ". Must be 'float' or 'double'." << std::endl; + parse_err = true; + } + break; default: parse_err = true; } diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp index 0c648514b..907d05ef2 100644 --- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp +++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp @@ -31,7 +31,6 @@ unsigned long long getCurrentTimestampInMicroseconds(); namespace stream_config { constexpr std::array kThreadsPerBlock = {128, 256, 512, 1024}; // Threads per block constexpr uint64_t kDefaultBufferSizeInBytes = 4294967296; // Default buffer size 4GB -constexpr int kNumLoopUnroll = 2; // Unroll depth in SM copy kernel constexpr int kNumBuffers = 3; // Number of buffers for triad, add kernel constexpr int kNumValidationBuffers = 4; // Number of validation buffers, one for each kernel constexpr int kUInt8Mod = 256; // Modulo for unsigned long data type @@ -83,10 +82,7 @@ template struct SubBenchArgs { // Arguments for each benchmark run. template struct BenchArgs { - // NUMA node under which the benchmark is done. - uint64_t numa_id = 0; - - // GPU ID for device. + // GPU ID for device (always 0 - actual GPU determined by CUDA_VISIBLE_DEVICES). int gpu_id = 0; // GPU device info @@ -121,6 +117,9 @@ struct Opts { // Whether check data after copy. bool check_data = false; + + // Data type for the benchmark ("float" or "double"). + std::string data_type = "double"; }; std::string KernelToString(int); // Function to convert enum to string diff --git a/tests/benchmarks/micro_benchmarks/test_gpu_stream.py b/tests/benchmarks/micro_benchmarks/test_gpu_stream.py index dc8048054..bf37074e3 100644 --- a/tests/benchmarks/micro_benchmarks/test_gpu_stream.py +++ b/tests/benchmarks/micro_benchmarks/test_gpu_stream.py @@ -20,7 +20,7 @@ def setUpClass(cls): cls.createMockEnvs(cls) cls.createMockFiles(cls, ['bin/gpu_stream']) - def _test_gpu_stream_command_generation(self, platform): + def _test_gpu_stream_command_generation(self, platform, data_type='double'): """Test gpu-stream benchmark command generation.""" benchmark_name = 'gpu-stream' (benchmark_class, @@ -31,9 +31,9 @@ def _test_gpu_stream_command_generation(self, platform): num_loops = 10 size = 25769803776 - parameters = '--num_warm_up %d --num_loops %d --size %d ' \ + parameters = '--num_warm_up %d --num_loops %d --size %d --data_type %s ' \ '--check_data' % \ - (num_warm_up, num_loops, size) + (num_warm_up, num_loops, size, data_type) benchmark = benchmark_class(benchmark_name, parameters=parameters) # Check basic information @@ -49,6 +49,7 @@ def _test_gpu_stream_command_generation(self, platform): assert (benchmark._args.num_warm_up == num_warm_up) assert (benchmark._args.num_loops == num_loops) assert (benchmark._args.check_data) + assert (benchmark._args.data_type == data_type) # Check command assert (1 == len(benchmark._commands)) @@ -56,6 +57,7 @@ def _test_gpu_stream_command_generation(self, platform): assert ('--size %d' % size in benchmark._commands[0]) assert ('--num_warm_up %d' % num_warm_up in benchmark._commands[0]) assert ('--num_loops %d' % num_loops in benchmark._commands[0]) + assert ('--data_type %s' % data_type in benchmark._commands[0]) assert ('--check_data' in benchmark._commands[0]) @decorator.cuda_test @@ -63,6 +65,11 @@ def test_gpu_stream_command_generation_cuda(self): """Test gpu-stream benchmark command generation, CUDA case.""" self._test_gpu_stream_command_generation(Platform.CUDA) + @decorator.cuda_test + def test_gpu_stream_command_generation_cuda_float(self): + """Test gpu-stream benchmark command generation with float, CUDA case.""" + self._test_gpu_stream_command_generation(Platform.CUDA, data_type='float') + @decorator.load_data('tests/data/gpu_stream.log') def _test_gpu_stream_result_parsing(self, platform, test_raw_output): """Test gpu-stream benchmark result parsing.""" @@ -110,3 +117,44 @@ def _test_gpu_stream_result_parsing(self, platform, test_raw_output): def test_gpu_stream_result_parsing_cuda(self): """Test gpu-stream benchmark result parsing, CUDA case.""" self._test_gpu_stream_result_parsing(Platform.CUDA) + + @decorator.load_data('tests/data/gpu_stream_float.log') + def _test_gpu_stream_result_parsing_float(self, platform, test_raw_output): + """Test gpu-stream benchmark result parsing for float data type.""" + benchmark_name = 'gpu-stream' + (benchmark_class, + predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform) + assert (benchmark_class) + benchmark = benchmark_class(benchmark_name, parameters='--data_type float') + assert (benchmark) + ret = benchmark._preprocess() + assert (ret is True) + assert (benchmark.return_code == ReturnCode.SUCCESS) + + # Positive case - valid raw output with float tags. + assert (benchmark._process_raw_result(0, test_raw_output)) + assert (benchmark.return_code == ReturnCode.SUCCESS) + + assert (1 == len(benchmark.raw_data)) + test_raw_output_dict = { + x.split()[0]: [float(x.split()[1]), float(x.split()[2])] + for x in test_raw_output.strip().splitlines() if x.startswith('STREAM_') + } + assert (len(test_raw_output_dict) * 2 + benchmark.default_metric_count == len(benchmark.result)) + for output_key in benchmark.result: + if output_key == 'return_code': + assert (benchmark.result[output_key] == [0]) + else: + assert (len(benchmark.result[output_key]) == 1) + assert (isinstance(benchmark.result[output_key][0], numbers.Number)) + if output_key.endswith('_bw'): + assert (output_key.strip('_bw') in test_raw_output_dict) + assert (test_raw_output_dict[output_key.strip('_bw')][0] == benchmark.result[output_key][0]) + else: + assert (output_key.strip('_ratio') in test_raw_output_dict) + assert (test_raw_output_dict[output_key.strip('_ratio')][1] == benchmark.result[output_key][0]) + + @decorator.cuda_test + def test_gpu_stream_result_parsing_cuda_float(self): + """Test gpu-stream benchmark result parsing for float, CUDA case.""" + self._test_gpu_stream_result_parsing_float(Platform.CUDA) diff --git a/tests/data/gpu_stream.log b/tests/data/gpu_stream.log index c3d6f2390..a3dcf2b01 100644 --- a/tests/data/gpu_stream.log +++ b/tests/data/gpu_stream.log @@ -2,40 +2,23 @@ STREAM Benchmark Buffer size(bytes): 4294967296 Number of warm up runs: 10 Number of loops: 40 +Data type: double Check data: No Device 0: "NVIDIA Graphics Device" 152 SMs(10.0) Memory: 4000MHz x 8192-bit = 8192 GB/s PEAK ECC is ON -STREAM_COPY_double_gpu_0_buffer_4294967296_block_128 6711.67 81.93 -STREAM_COPY_double_gpu_0_buffer_4294967296_block_256 6549.50 79.95 -STREAM_COPY_double_gpu_0_buffer_4294967296_block_512 6195.43 75.63 -STREAM_COPY_double_gpu_0_buffer_4294967296_block_1024 5721.52 69.84 -STREAM_SCALE_double_gpu_0_buffer_4294967296_block_128 6680.42 81.55 -STREAM_SCALE_double_gpu_0_buffer_4294967296_block_256 6515.51 79.54 -STREAM_SCALE_double_gpu_0_buffer_4294967296_block_512 6106.69 74.54 -STREAM_SCALE_double_gpu_0_buffer_4294967296_block_1024 5626.68 68.69 -STREAM_ADD_double_gpu_0_buffer_4294967296_block_128 7379.25 90.08 -STREAM_ADD_double_gpu_0_buffer_4294967296_block_256 7407.27 90.42 -STREAM_ADD_double_gpu_0_buffer_4294967296_block_512 7309.59 89.23 -STREAM_ADD_double_gpu_0_buffer_4294967296_block_1024 6788.64 82.87 -STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_128 7378.19 90.07 -STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_256 7414.01 90.50 -STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_512 7295.50 89.06 -STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_1024 6730.42 82.16 - -Device 1: "NVIDIA Graphics Device" 152 SMs(10.0) Memory: 4000.00MHz x 8192-bit = 8192.00 GB/s PEAK ECC is ON -STREAM_COPY_double_gpu_1_buffer_4294967296_block_128 6708.74 81.89 -STREAM_COPY_double_gpu_1_buffer_4294967296_block_256 6549.47 79.95 -STREAM_COPY_double_gpu_1_buffer_4294967296_block_512 6195.39 75.63 -STREAM_COPY_double_gpu_1_buffer_4294967296_block_1024 5725.07 69.89 -STREAM_SCALE_double_gpu_1_buffer_4294967296_block_128 6678.56 81.53 -STREAM_SCALE_double_gpu_1_buffer_4294967296_block_256 6514.05 79.52 -STREAM_SCALE_double_gpu_1_buffer_4294967296_block_512 6103.80 74.51 -STREAM_SCALE_double_gpu_1_buffer_4294967296_block_1024 5630.41 68.73 -STREAM_ADD_double_gpu_1_buffer_4294967296_block_128 7377.74 90.06 -STREAM_ADD_double_gpu_1_buffer_4294967296_block_256 7410.97 90.47 -STREAM_ADD_double_gpu_1_buffer_4294967296_block_512 7310.80 89.24 -STREAM_ADD_double_gpu_1_buffer_4294967296_block_1024 6789.91 82.88 -STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_128 7379.03 90.08 -STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_256 7414.04 90.50 -STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_512 7298.26 89.09 -STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_1024 6732.15 82.18 \ No newline at end of file +STREAM_COPY_double_buffer_4294967296_block_128 6711.67 81.93 +STREAM_COPY_double_buffer_4294967296_block_256 6549.50 79.95 +STREAM_COPY_double_buffer_4294967296_block_512 6195.43 75.63 +STREAM_COPY_double_buffer_4294967296_block_1024 5721.52 69.84 +STREAM_SCALE_double_buffer_4294967296_block_128 6680.42 81.55 +STREAM_SCALE_double_buffer_4294967296_block_256 6515.51 79.54 +STREAM_SCALE_double_buffer_4294967296_block_512 6106.69 74.54 +STREAM_SCALE_double_buffer_4294967296_block_1024 5626.68 68.69 +STREAM_ADD_double_buffer_4294967296_block_128 7379.25 90.08 +STREAM_ADD_double_buffer_4294967296_block_256 7407.27 90.42 +STREAM_ADD_double_buffer_4294967296_block_512 7309.59 89.23 +STREAM_ADD_double_buffer_4294967296_block_1024 6788.64 82.87 +STREAM_TRIAD_double_buffer_4294967296_block_128 7378.19 90.07 +STREAM_TRIAD_double_buffer_4294967296_block_256 7414.01 90.50 +STREAM_TRIAD_double_buffer_4294967296_block_512 7295.50 89.06 +STREAM_TRIAD_double_buffer_4294967296_block_1024 6730.42 82.16 \ No newline at end of file diff --git a/tests/data/gpu_stream_float.log b/tests/data/gpu_stream_float.log new file mode 100644 index 000000000..d1ab3ad1e --- /dev/null +++ b/tests/data/gpu_stream_float.log @@ -0,0 +1,24 @@ +STREAM Benchmark +Buffer size(bytes): 4294967296 +Number of warm up runs: 10 +Number of loops: 40 +Data type: float +Check data: No + +Device 0: "NVIDIA Graphics Device" 152 SMs(10.0) Memory: 4000MHz x 8192-bit = 8192 GB/s PEAK ECC is ON +STREAM_COPY_float_buffer_4294967296_block_128 6823.45 83.30 +STREAM_COPY_float_buffer_4294967296_block_256 6650.12 81.18 +STREAM_COPY_float_buffer_4294967296_block_512 6301.88 76.93 +STREAM_COPY_float_buffer_4294967296_block_1024 5812.34 70.95 +STREAM_SCALE_float_buffer_4294967296_block_128 6790.11 82.89 +STREAM_SCALE_float_buffer_4294967296_block_256 6620.33 80.81 +STREAM_SCALE_float_buffer_4294967296_block_512 6210.45 75.81 +STREAM_SCALE_float_buffer_4294967296_block_1024 5718.90 69.81 +STREAM_ADD_float_buffer_4294967296_block_128 7490.22 91.43 +STREAM_ADD_float_buffer_4294967296_block_256 7512.10 91.70 +STREAM_ADD_float_buffer_4294967296_block_512 7405.67 90.40 +STREAM_ADD_float_buffer_4294967296_block_1024 6890.33 84.11 +STREAM_TRIAD_float_buffer_4294967296_block_128 7485.55 91.38 +STREAM_TRIAD_float_buffer_4294967296_block_256 7520.88 91.81 +STREAM_TRIAD_float_buffer_4294967296_block_512 7390.12 90.21 +STREAM_TRIAD_float_buffer_4294967296_block_1024 6825.11 83.32