diff --git a/docs/user-tutorial/benchmarks/micro-benchmarks.md b/docs/user-tutorial/benchmarks/micro-benchmarks.md
index aa3aa965b..7a259eb2b 100644
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -267,20 +267,22 @@ For measurements of peer-to-peer communication performance between AMD GPUs, GPU
 
 #### Introduction
 
-Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double datatype.
+Measure the memory bandwidth of GPU using the STREAM benchmark. The benchmark tests various memory operations including copy, scale, add, and triad for double and float datatypes.
+
+__Note__: When `--check_data` is enabled, each process allocates 2× `--size` bytes of host memory for validation buffers (e.g. 8 GiB with the default 4 GiB `--size`). Under `default_local_mode` with 8 GPUs this totals ~64 GiB of host RAM. Recommend using a small `--size` such as `1048576` (1 MiB) when `--check_data` is enabled.
 
 #### Metrics
 
 | Metric Name                                                | Unit             | Description                                                                                                                             |
 |------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
-| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw  | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the copy operation with specified buffer size and block size.                         |
-| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the scale operation with specified buffer size and block size.                         |
-| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw   | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the add operation with specified buffer size and block size.                         |
-| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The fp64 memory bandwidth of the GPU for the triad operation with specified buffer size and block size.                         |
-| STREAM\_COPY\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio  | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the copy operation with specified buffer size and block size.                         |
-| STREAM\_SCALE\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the scale operation with specified buffer size and block size.                         |
-| STREAM\_ADD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio   | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the add operation with specified buffer size and block size.                         |
-| STREAM\_TRIAD\_double\_gpu\_[0-9]\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The fp64 memory bandwidth efficiency of the GPU for the triad operation with specified buffer size and block size.                         |
+| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw  | bandwidth (GB/s) | The memory bandwidth of the GPU for the copy operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size.                         |
+| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the scale operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size.                         |
+| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw   | bandwidth (GB/s) | The memory bandwidth of the GPU for the add operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size.                         |
+| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_bw | bandwidth (GB/s) | The memory bandwidth of the GPU for the triad operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size.                         |
+| STREAM\_COPY\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio  | Efficiency (%) | The memory bandwidth efficiency of the GPU for the copy operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size.                         |
+| STREAM\_SCALE\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the scale operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size.                         |
+| STREAM\_ADD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio   | Efficiency (%) | The memory bandwidth efficiency of the GPU for the add operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size.                         |
+| STREAM\_TRIAD\_(double\|float)\_buffer\_[0-9]+\_block\_[0-9]+\_ratio | Efficiency (%) | The memory bandwidth efficiency of the GPU for the triad operation with the selected data type (double for fp64, float for fp32), for the specified buffer size and block size.                         |
 
 ### `ib-loopback`
 
diff --git a/examples/benchmarks/gpu_stream.py b/examples/benchmarks/gpu_stream.py
index 88c789efb..1aa67b15d 100644
--- a/examples/benchmarks/gpu_stream.py
+++ b/examples/benchmarks/gpu_stream.py
@@ -12,7 +12,7 @@
 
 if __name__ == '__main__':
     context = BenchmarkRegistry.create_benchmark_context(
-        'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10'
+        'gpu-stream', platform=Platform.CUDA, parameters='--num_warm_up 1 --num_loops 10 --data_type double'
     )
     # For ROCm environment, please specify the benchmark name and the platform as the following.
     # context = BenchmarkRegistry.create_benchmark_context(
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream.py b/superbench/benchmarks/micro_benchmarks/gpu_stream.py
index 2e82262f3..2a3d38421 100644
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream.py
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream.py
@@ -51,10 +51,21 @@ def add_parser_arguments(self):
             help='Number of data buffer copies performed.',
         )
 
+        self._parser.add_argument(
+            '--data_type',
+            type=str,
+            default='double',
+            choices=['float', 'double'],
+            required=False,
+            help='Data type of the buffer elements.',
+        )
+
         self._parser.add_argument(
             '--check_data',
             action='store_true',
-            help='Enable data checking',
+            help='Enable data checking. Note: allocates 2x --size bytes of host memory per process '
+            'for validation buffers (e.g. 8 GiB with default 4 GiB --size). '
+            'Recommend using a small --size such as 1048576 (1 MiB) when this flag is enabled.',
         )
 
     def _preprocess(self):
@@ -68,8 +79,8 @@ def _preprocess(self):
 
         self.__bin_path = os.path.join(self._args.bin_dir, self._bin_name)
 
-        args = '--size %d --num_warm_up %d --num_loops %d ' % (
-            self._args.size, self._args.num_warm_up, self._args.num_loops
+        args = '--size %d --num_warm_up %d --num_loops %d --data_type %s' % (
+            self._args.size, self._args.num_warm_up, self._args.num_loops, self._args.data_type
         )
 
         if self._args.check_data:
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt b/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt
index ba3d2750b..342c11623 100644
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt
@@ -29,7 +29,7 @@ message(STATUS "Found CUDA: " ${CUDAToolkit_VERSION})
 
 # Source files
 set(SOURCES
-    gpu_stream_test.cpp
+    gpu_stream_main.cpp
     gpu_stream_utils.cpp
     gpu_stream.cu
     gpu_stream_kernels.cu
@@ -38,6 +38,7 @@ set(SOURCES
 include(../cuda_common.cmake)
 add_executable(gpu_stream ${SOURCES})
 set_property(TARGET gpu_stream PROPERTY CUDA_ARCHITECTURES ${NVCC_ARCHS_SUPPORTED})
+target_compile_definitions(gpu_stream PRIVATE _GNU_SOURCE)
 target_include_directories(gpu_stream PRIVATE ${CUDAToolkit_INCLUDE_DIRS})
 target_link_libraries(gpu_stream numa ${NVML_LIBRARY})
 
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
index 617b8338a..a8791cc7b 100644
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
@@ -10,6 +10,7 @@
 #include <cassert>
 #include <iostream>
 #include <nvml.h>
+#include <sched.h>
 
 /**
  * @brief Destroys the CUDA events used for benchmarking.
@@ -235,15 +236,25 @@ template <typename T> int GpuStream::PrepareBufAndStream(std::unique_ptr<BenchAr
     cudaError_t cuda_err = cudaSuccess;
 
     if (args->check_data) {
-        // Generate data to copy
-        args->sub.data_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id));
+        // Generate data to copy - use local NUMA node for best CPU access
+        args->sub.data_buf = static_cast<T *>(numa_alloc_local(args->size));
+        if (args->sub.data_buf == nullptr) {
+            std::cerr << "PrepareBufAndStream::numa_alloc_local data_buf failed" << std::endl;
+            return -1;
+        }
 
-        for (int j = 0; j < args->size / sizeof(T); j++) {
+        for (uint64_t j = 0; j < args->size / sizeof(T); j++) {
             args->sub.data_buf[j] = static_cast<T>(j % kUInt8Mod);
         }
 
-        // Allocate check buffer
-        args->sub.check_buf = static_cast<T *>(numa_alloc_onnode(args->size * sizeof(T), args->numa_id));
+        // Allocate check buffer on local NUMA node
+        args->sub.check_buf = static_cast<T *>(numa_alloc_local(args->size));
+        if (args->sub.check_buf == nullptr) {
+            std::cerr << "PrepareBufAndStream::numa_alloc_local check_buf failed" << std::endl;
+            numa_free(args->sub.data_buf, args->size);
+            args->sub.data_buf = nullptr;
+            return -1;
+        }
     }
 
     // Allocate buffers
@@ -257,7 +268,7 @@ template <typename T> int GpuStream::PrepareBufAndStream(std::unique_ptr<BenchAr
     // Allocate buffers
     for (auto &buf_ptr : args->sub.gpu_buf_ptrs) {
         T *raw_ptr = nullptr;
-        cuda_err = GpuMallocDataBuf(&raw_ptr, args->size * sizeof(T));
+        cuda_err = GpuMallocDataBuf(&raw_ptr, args->size);
         if (cuda_err != cudaSuccess) {
             std::cerr << "PrepareBufAndStream::cudaMalloc error: " << cuda_err << std::endl;
             return -1;
@@ -350,7 +361,12 @@ template <typename T> int GpuStream::CheckBuf(std::unique_ptr<BenchArgs<T>> &arg
         return -1;
     }
 
-    // Validate result by comparing the data buffer and check buffer
+    // Validate result by comparing the data buffer and check buffer.
+    // NOTE: memcmp is exact (byte-for-byte). This works because the current test values
+    // (j % 256, scalar = 11.0) are exactly representable in both float and double IEEE-754.
+    // If kUInt8Mod or scalar are changed to values that cause rounding differences between
+    // host (two separate ops) and GPU (FMA), this check will need a tolerance-based comparison
+    // for T = float.
     memcmp_result = memcmp(args->sub.validation_buf_ptrs[kernel_idx].data(), args->sub.check_buf, args->size);
     if (memcmp_result) {
         std::cerr << "CheckBuf::Memory check failed for kernel index " << kernel_idx << std::endl;
@@ -419,11 +435,19 @@ int GpuStream::RunStreamKernel(std::unique_ptr<BenchArgs<T>> &args, Kernel kerne
     uint64_t num_thread_blocks;
     int size_factor = 2;
 
+    if (num_threads_per_block == 0) {
+        std::cerr << "RunStreamKernel::num_threads_per_block must be > 0" << std::endl;
+        return -1;
+    }
+
     // Validate data size
-    uint64_t num_elements_in_thread_block = kNumLoopUnroll * num_threads_per_block;
-    uint64_t num_bytes_in_thread_block = num_elements_in_thread_block * sizeof(T);
+    // Each thread processes one VecT<T> element (128 bits / 16 bytes) for optimal memory bandwidth.
+    // Derived from VecT<T> so any vector type change is caught at compile time.
+    constexpr uint64_t kBytesPerThread = sizeof(VecT<T>);
+    static_assert(kBytesPerThread == 16, "Vector type must be 128-bit aligned for current PTX");
+    uint64_t num_bytes_in_thread_block = num_threads_per_block * kBytesPerThread;
     if (args->size % num_bytes_in_thread_block) {
-        std::cerr << "RunCopy: Data size should be multiple of " << num_bytes_in_thread_block << std::endl;
+        std::cerr << "RunStreamKernel: Data size should be multiple of " << num_bytes_in_thread_block << std::endl;
         return -1;
     }
     num_thread_blocks = args->size / num_bytes_in_thread_block;
@@ -435,7 +459,7 @@ int GpuStream::RunStreamKernel(std::unique_ptr<BenchArgs<T>> &args, Kernel kerne
     }
 
     // Launch jobs and collect running time
-    for (int i = 0; i < args->num_loops + args->num_warm_up; i++) {
+    for (uint64_t i = 0; i < args->num_loops + args->num_warm_up; i++) {
 
         // Record start event once warm up iterations are done
         if (i == args->num_warm_up) {
@@ -448,30 +472,30 @@ int GpuStream::RunStreamKernel(std::unique_ptr<BenchArgs<T>> &args, Kernel kerne
 
         switch (kernel) {
         case Kernel::kCopy:
-            CopyKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()));
+            CopyKernel<T><<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
+                reinterpret_cast<VecT<T> *>(args->sub.gpu_buf_ptrs[2].get()),
+                reinterpret_cast<const VecT<T> *>(args->sub.gpu_buf_ptrs[0].get()));
             args->sub.kernel_name = "COPY";
             break;
         case Kernel::kScale:
-            ScaleKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()), static_cast<T>(scalar));
+            ScaleKernel<T><<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
+                reinterpret_cast<VecT<T> *>(args->sub.gpu_buf_ptrs[2].get()),
+                reinterpret_cast<const VecT<T> *>(args->sub.gpu_buf_ptrs[0].get()), static_cast<T>(scalar));
             args->sub.kernel_name = "SCALE";
             break;
         case Kernel::kAdd:
-            AddKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()),
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[1].get()));
+            AddKernel<T><<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
+                reinterpret_cast<VecT<T> *>(args->sub.gpu_buf_ptrs[2].get()),
+                reinterpret_cast<const VecT<T> *>(args->sub.gpu_buf_ptrs[0].get()),
+                reinterpret_cast<const VecT<T> *>(args->sub.gpu_buf_ptrs[1].get()));
             size_factor = 3;
             args->sub.kernel_name = "ADD";
             break;
         case Kernel::kTriad:
-            TriadKernel<<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[2].get()),
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[0].get()),
-                reinterpret_cast<T *>(args->sub.gpu_buf_ptrs[1].get()), static_cast<T>(scalar));
+            TriadKernel<T><<<num_thread_blocks, num_threads_per_block, 0, args->sub.stream>>>(
+                reinterpret_cast<VecT<T> *>(args->sub.gpu_buf_ptrs[2].get()),
+                reinterpret_cast<const VecT<T> *>(args->sub.gpu_buf_ptrs[0].get()),
+                reinterpret_cast<const VecT<T> *>(args->sub.gpu_buf_ptrs[1].get()), static_cast<T>(scalar));
             size_factor = 3;
             args->sub.kernel_name = "TRIAD";
             break;
@@ -583,11 +607,10 @@ int GpuStream::RunStream(std::unique_ptr<BenchArgs<T>> &args, const std::string
 
     // output formatted results to stdout
     // Tags are of format:
-    // STREAM_<Kernelname>_datatype_gpu_<gpu_id>_buffer_<buffer_size>_block_<block_size>
-    for (int i = 0; i < args->sub.times_in_ms.size(); i++) {
-        std::string tag = "STREAM_" + KernelToString(i) + "_" + data_type + "_gpu_" + std::to_string(args->gpu_id) +
-                          "_buffer_" + std::to_string(args->size);
-        for (int j = 0; j < args->sub.times_in_ms[i].size(); j++) {
+    // STREAM_<Kernelname>_datatype_buffer_<buffer_size>_block_<block_size>
+    for (size_t i = 0; i < args->sub.times_in_ms.size(); i++) {
+        std::string tag = "STREAM_" + KernelToString(i) + "_" + data_type + "_buffer_" + std::to_string(args->size);
+        for (size_t j = 0; j < args->sub.times_in_ms[i].size(); j++) {
             // Calculate and display bandwidth
             double bw = args->size * args->num_loops / args->sub.times_in_ms[i][j] / 1e6;
             std::cout << tag << "_block_" << kThreadsPerBlock[j] << "\t" << bw << "\t";
@@ -605,12 +628,33 @@ int GpuStream::RunStream(std::unique_ptr<BenchArgs<T>> &args, const std::string
     return ret;
 }
 
+/**
+ * @brief Creates and initializes a BenchArgs for the given type and adds it to bench_args_.
+ *
+ * @tparam T The data type (float or double) for the benchmark arguments.
+ */
+template <typename T> int GpuStream::CreateBenchArgs() {
+    auto args = std::make_unique<BenchArgs<T>>();
+    args->gpu_id = 0;
+    cudaError_t cuda_err = cudaGetDeviceProperties(&args->gpu_device_prop, 0);
+    if (cuda_err != cudaSuccess) {
+        std::cerr << "CreateBenchArgs::cudaGetDeviceProperties error: " << cuda_err << std::endl;
+        return -1;
+    }
+    args->num_warm_up = opts_.num_warm_up;
+    args->num_loops = opts_.num_loops;
+    args->size = opts_.size;
+    args->check_data = opts_.check_data;
+    bench_args_ = std::move(args);
+    return 0;
+}
+
 /**
  * @brief Runs the Stream benchmark.
  *
- * @details This function processes the input args, validates and composes the BenchArgs structure for the
- availavble
- * GPUs, and runs the benchmark.
+ * @details This function processes the input args, validates and composes the BenchArgs structure for
+ * the first visible GPU (CUDA device 0). When running under Superbench's default_local_mode,
+ * CUDA_VISIBLE_DEVICES is set per process, so device 0 maps to the assigned physical GPU.
  *
  * @return int The status code indicating success or failure of the benchmark execution.
  * */
@@ -631,71 +675,90 @@ int GpuStream::Run() {
         return ret;
     }
 
-    // find all GPUs and compose the Benchmarking data structure
-    for (int j = 0; j < gpu_count; j++) {
-        auto args = std::make_unique<BenchArgs<double>>();
-        args->numa_id = 0;
-        args->gpu_id = j;
-        cudaGetDeviceProperties(&args->gpu_device_prop, j);
-
-        args->num_warm_up = opts_.num_warm_up;
-        args->num_loops = opts_.num_loops;
-        args->size = opts_.size;
-        args->check_data = opts_.check_data;
-        args->numa_id = 0;
-        args->gpu_id = j;
-
-        // add data to vector
-        bench_args_.emplace_back(std::move(args));
-    }
-
-    bool has_error = false;
-    // Run the benchmark for all the configured data
-    for (auto &variant_args : bench_args_) {
-        std::visit(
-            [&](auto &curr_args) {
-                // Get memory clock rate once for both bandwidth computation and display
-                float memory_clock_mhz = GetMemoryClockRate(curr_args->gpu_id, curr_args->gpu_device_prop);
-
-                // Compute theoretical bandwidth using the memory clock rate
-                float peak_bw = -1.0f;
-                if (memory_clock_mhz > 0.0f) {
-                    // Calculate theoretical bandwidth: memory_clock_mhz * bus_width_bytes * 2 (DDR) / 1000 (convert to
-                    // GB/s)
-                    peak_bw = memory_clock_mhz * (curr_args->gpu_device_prop.memoryBusWidth / 8) * 2 / 1000.0;
-                }
-
-                // Print device info with both the memory clock and peak bandwidth
-                PrintCudaDeviceInfo(curr_args->gpu_id, curr_args->gpu_device_prop, memory_clock_mhz, peak_bw);
-
-                // Set the NUMA node
-                ret = numa_run_on_node(curr_args->numa_id);
-                if (ret != 0) {
-                    std::cerr << "Run::numa_run_on_node error: " << errno << std::endl;
-                    has_error = true;
-                    return;
-                }
-
-                // Run the stream benchmark for the configured data, passing the peak bandwidth
-                if constexpr (std::is_same_v<std::decay_t<decltype(*curr_args)>, BenchArgs<float>>) {
-                    ret = RunStream<float>(curr_args, "float", peak_bw);
-                } else if constexpr (std::is_same_v<std::decay_t<decltype(*curr_args)>, BenchArgs<double>>) {
-                    ret = RunStream<double>(curr_args, "double", peak_bw);
-                } else {
-                    std::cerr << "Run::Unknown type error" << std::endl;
-                    has_error = true;
-                    return;
-                }
-
-                if (ret != 0) {
-                    std::cerr << "Run::RunStream error: " << errno << std::endl;
-                    has_error = true;
-                }
-            },
-            variant_args);
-    }
-    if (has_error) {
+    if (gpu_count < 1) {
+        std::cerr << "Run::No GPU available" << std::endl;
+        return -1;
+    }
+
+    // Run on CUDA device 0 (the visible GPU assigned by CUDA_VISIBLE_DEVICES).
+    if (opts_.data_type == "float") {
+        ret = CreateBenchArgs<float>();
+    } else if (opts_.data_type == "double") {
+        ret = CreateBenchArgs<double>();
+    } else {
+        std::cerr << "Run::Invalid data_type: " << opts_.data_type << std::endl;
+        return -1;
+    }
+    if (ret != 0) {
+        return ret;
+    }
+
+    // Pin the thread to the GPU's NUMA node for optimal host↔device bandwidth.
+    // Query GPU 0's preferred CPU NUMA node via NVML; fall back to the process's
+    // current node if the NVML query fails (e.g. NUMA disabled, older driver).
+    int target_node = -1;
+    {
+        nvmlDevice_t nvml_dev;
+        unsigned int gpu_numa_node = 0;
+        if (nvmlInit() == NVML_SUCCESS) {
+            if (nvmlDeviceGetHandleByIndex(0, &nvml_dev) == NVML_SUCCESS &&
+                nvmlDeviceGetNumaNodeId(nvml_dev, &gpu_numa_node) == NVML_SUCCESS) {
+                target_node = static_cast<int>(gpu_numa_node);
+            }
+            nvmlShutdown();
+        }
+    }
+    if (target_node < 0) {
+        // Fallback: use the node where this process is currently scheduled
+        int cpu = sched_getcpu();
+        if (cpu < 0) {
+            std::cerr << "Run::sched_getcpu failed" << std::endl;
+            return -1;
+        }
+        target_node = numa_node_of_cpu(cpu);
+        if (target_node < 0) {
+            std::cerr << "Run::numa_node_of_cpu failed for cpu " << cpu << std::endl;
+            return -1;
+        }
+    }
+    if (numa_run_on_node(target_node) != 0) {
+        std::cerr << "Run::numa_run_on_node failed for node " << target_node << std::endl;
         return -1;
     }
+
+    // Run the benchmark for the configured data
+    std::visit(
+        [&](auto &curr_args) {
+            // Get memory clock rate once for both bandwidth computation and display
+            float memory_clock_mhz = GetMemoryClockRate(curr_args->gpu_id, curr_args->gpu_device_prop);
+
+            // Compute theoretical bandwidth using the memory clock rate
+            float peak_bw = -1.0f;
+            if (memory_clock_mhz > 0.0f) {
+                // Calculate theoretical bandwidth: memory_clock_mhz * bus_width_bytes * 2 (DDR) / 1000 (convert to
+                // GB/s)
+                peak_bw = memory_clock_mhz * (curr_args->gpu_device_prop.memoryBusWidth / 8) * 2 / 1000.0;
+            }
+
+            // Print device info with both the memory clock and peak bandwidth
+            PrintCudaDeviceInfo(curr_args->gpu_id, curr_args->gpu_device_prop, memory_clock_mhz, peak_bw);
+
+            // Run the stream benchmark for the configured data, passing the peak bandwidth
+            if constexpr (std::is_same_v<std::decay_t<decltype(*curr_args)>, BenchArgs<float>>) {
+                ret = RunStream<float>(curr_args, "float", peak_bw);
+            } else if constexpr (std::is_same_v<std::decay_t<decltype(*curr_args)>, BenchArgs<double>>) {
+                ret = RunStream<double>(curr_args, "double", peak_bw);
+            } else {
+                std::cerr << "Run::Unknown type error" << std::endl;
+                ret = -1;
+                return;
+            }
+
+            if (ret != 0) {
+                std::cerr << "Run::RunStream error: " << errno << std::endl;
+            }
+        },
+        bench_args_);
+
     return ret;
 }
\ No newline at end of file
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp
index 473a78839..65502a243 100644
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp
@@ -34,8 +34,8 @@ class GpuStream {
     int Run();
 
   private:
-    using BenchArgsVariant = std::variant<std::unique_ptr<BenchArgs<double>>>;
-    std::vector<BenchArgsVariant> bench_args_;
+    using BenchArgsVariant = std::variant<std::unique_ptr<BenchArgs<float>>, std::unique_ptr<BenchArgs<double>>>;
+    BenchArgsVariant bench_args_;
     Opts opts_;
 
     // Memory management functions
@@ -56,6 +56,7 @@ class GpuStream {
     template <typename T> int RunStream(std::unique_ptr<BenchArgs<T>> &, const std::string &data_type, float peak_bw);
 
     // Helper functions
+    template <typename T> int CreateBenchArgs();
     int GetGpuCount(int *);
     int SetGpu(int gpu_id);
     float GetMemoryClockRate(int device_id, const cudaDeviceProp &prop);
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu
index 548fc8ba3..e40237b83 100644
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu
@@ -1,155 +1,33 @@
 // Copyright (c) Microsoft Corporation.
 // Licensed under the MIT License.
 
-#include "gpu_stream_kernels.hpp"
-
-/**
- * @brief Fetches a value from source memory and writes it to a register.
- *
- * @details This inline device function fetches a value from the specified source memory
- * location and writes it to the provided register. The implementation references the following:
- * 1) NCCL:
- * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L483
- * 2) RCCL:
- * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L268
- *
- * @tparam T The type of the value to fetch.
- * @param[out] v The register to write the fetched value to.
- * @param[in] p The source memory location to fetch the value from.
- */
-template <typename T> inline __device__ void Fetch(T &v, const T *p) {
-#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
-    v = *p;
-#else
-    if constexpr (std::is_same<T, float>::value) {
-        asm volatile("ld.volatile.global.f32 %0, [%1];" : "=f"(v) : "l"(p) : "memory");
-    } else if constexpr (std::is_same<T, double>::value) {
-        asm volatile("ld.volatile.global.f64 %0, [%1];" : "=d"(v) : "l"(p) : "memory");
-    }
-#endif
-}
-
 /**
- * @brief Stores a value from register and writes it to target memory.
+ * @file gpu_stream_kernels.cu
+ * @brief CUDA kernel compilation unit for GPU stream benchmark.
  *
- * @details This inline device function stores a value from the provided register
- * and writes it to the specified target memory location. The implementation references the following:
- * 1) NCCL:
- * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L486
- * 2) RCCL:
- * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L276
+ * All template kernel implementations (CopyKernel, ScaleKernel, AddKernel, TriadKernel)
+ * are defined in gpu_stream_kernels.hpp rather than here. This is required because:
  *
- * @tparam T The type of the value to store.
- * @param[out] p The target memory location to write the value to.
- * @param[in] v The register containing the value to be stored.
- */
-template <typename T> inline __device__ void Store(T *p, const T &v) {
-#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
-    *p = v;
-#else
-    if constexpr (std::is_same<T, float>::value) {
-        asm volatile("st.volatile.global.f32 [%0], %1;" ::"l"(p), "f"(v) : "memory");
-    } else if constexpr (std::is_same<T, double>::value) {
-        asm volatile("st.volatile.global.f64 [%0], %1;" ::"l"(p), "d"(v) : "memory");
-    }
-#endif
-}
-
-/**
- * @brief Performs COPY, a simple copy operation from source to target. b = a
+ * 1. **C++ Template Instantiation Model**: Templates are not compiled until they are
+ *    instantiated with concrete types. The compiler needs to see the full template
+ *    definition (not just declaration) at the point of instantiation.
  *
- * @details This CUDA kernel performs a simple copy operation, copying data from the source array
- * to the target array. This is used to measure transfer rates without any arithmetic operations.
+ * 2. **Separate Compilation Units**: When gpu_stream.cu calls `CopyKernel<double><<<...>>>`,
+ *    nvcc needs the full kernel implementation visible in that translation unit.
+ *    If implementations were only in this .cu file, gpu_stream.cu would only see
+ *    declarations, causing "undefined reference" linker errors.
  *
- * @param[out] tgt The target array where data will be copied to.
- * @param[in] src The source array from which data will be copied.
- */
-__global__ void CopyKernel(double *tgt, const double *src) {
-    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
-    double val[kNumLoopUnrollAlias];
-#pragma unroll
-    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
-        Fetch(val[i], src + index + i * blockDim.x);
-#pragma unroll
-    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
-        Store(tgt + index + i * blockDim.x, val[i]);
-}
-
-/**
- * @brief Performs SCALE, a scaling operation on the source data. b = x * a
- *
- * @details This CUDA kernel performs a simple arithmetic operation by scaling the source data
- * with a given scalar value and storing the result in the target array.
- *
- * @param[out] tgt The target array where the scaled data will be stored.
- * @param[in] src The source array containing the data to be scaled.
- * @param[in] scalar The scalar value used to scale the source data.
- */
-__global__ void ScaleKernel(double *tgt, const double *src, const double scalar) {
-    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
-    double val[kNumLoopUnrollAlias];
-#pragma unroll
-    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++)
-        Fetch(val[i], src + index + i * blockDim.x);
-#pragma unroll
-    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
-        val[i] *= scalar;
-        Store(tgt + index + i * blockDim.x, val[i]);
-    }
-}
-
-/**
- * @brief Performs ADD, an addition operation on two source arrays. c = a + b
- *
- * @details This CUDA kernel adds corresponding elements from two source arrays and stores the result
- * in the target array. This operation is used to measure transfer rates with a simple arithmetic addition.
- *
- * @param[out] tgt The target array where the result of the addition will be stored.
- * @param[in] src_a The first source array containing the first set of operands.
- * @param[in] src_b The second source array containing the second set of operands.
- */
-__global__ void AddKernel(double *tgt, const double *src_a, const double *src_b) {
-    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
-    double val_a[kNumLoopUnrollAlias];
-    double val_b[kNumLoopUnrollAlias];
-
-#pragma unroll
-    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
-        Fetch(val_a[i], src_a + index + i * blockDim.x);
-        Fetch(val_b[i], src_b + index + i * blockDim.x);
-    }
-#pragma unroll
-    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
-        val_a[i] += val_b[i];
-        Store(tgt + index + i * blockDim.x, val_a[i]);
-    }
-}
-
-/**
- * @brief Performs TRIAD, fused multiply/add operations on source arrays. a = b + x * c
+ * 3. **CUDA-Specific Consideration**: Unlike regular C++ where explicit template
+ *    instantiation in a .cpp file can work, CUDA kernel launches require the kernel
+ *    code to be visible to nvcc when compiling the caller. This is because nvcc
+ *    generates device code at compile time, not link time.
  *
- * @details This CUDA kernel performs a fused multiply/add operation by multiplying elements from
- * the second source array with a scalar value, adding the result to corresponding elements from
- * the first source array, and storing the result in the target array.
+ * 4. **Header Guards for Mixed Compilation**: The header uses `#ifdef __CUDACC__` to
+ *    protect CUDA-specific code (blockIdx, threadIdx, __global__, etc.) from g++
+ *    when the header is indirectly included by .cpp files (e.g., via gpu_stream.hpp).
  *
- * @param[out] tgt The target array where the result of the fused multiply/add operation will be stored.
- * @param[in] src_a The first source array containing the first set of operands.
- * @param[in] src_b The second source array containing the second set of operands to be multiplied by the scalar.
- * @param[in] scalar The scalar value used in the multiply/add operation.
+ * This file remains as the compilation unit that ensures the header is processed
+ * by nvcc, and can host any future non-template helper functions if needed.
  */
-__global__ void TriadKernel(double *tgt, const double *src_a, const double *src_b, const double scalar) {
-    uint64_t index = blockIdx.x * blockDim.x * kNumLoopUnrollAlias + threadIdx.x;
-    double val_a[kNumLoopUnrollAlias];
-    double val_b[kNumLoopUnrollAlias];
 
-#pragma unroll
-    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
-        Fetch(val_a[i], src_a + index + i * blockDim.x);
-        Fetch(val_b[i], src_b + index + i * blockDim.x);
-    }
-#pragma unroll
-    for (uint64_t i = 0; i < kNumLoopUnrollAlias; i++) {
-        val_b[i] += (val_a[i] * scalar);
-        Store(tgt + index + i * blockDim.x, val_b[i]);
-    }
-}
\ No newline at end of file
+#include "gpu_stream_kernels.hpp"
\ No newline at end of file
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
index cfe9f2052..8702b1421 100644
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
@@ -5,15 +5,190 @@
 
 #include <cuda.h>
 #include <cuda_runtime.h>
+#include <type_traits>
 
 #include "gpu_stream_utils.hpp"
-constexpr auto kNumLoopUnrollAlias = stream_config::kNumLoopUnroll;
 
-// Function declarations
-template <typename T> inline __device__ void Fetch(T &v, const T *p);
-template <typename T> inline __device__ void Store(T *p, const T &v);
+/**
+ * @brief Type trait mapping scalar types to their 128-bit aligned vector types.
+ *
+ * @details For optimal memory bandwidth, we use 128-bit (16 byte) vector loads/stores:
+ * - double -> double2 (2 x 64-bit = 128-bit)
+ * - float  -> float4  (4 x 32-bit = 128-bit)
+ */
+template <typename T> struct VectorType;
+template <> struct VectorType<double> { using type = double2; };
+template <> struct VectorType<float> { using type = float4; };
 
-__global__ void CopyKernel(double *, const double *);
-__global__ void ScaleKernel(double *, const double *, const double);
-__global__ void AddKernel(double *, const double *, const double *);
-__global__ void TriadKernel(double *, const double *, const double *, const double);
\ No newline at end of file
+template <typename T> using VecT = typename VectorType<T>::type;
+
+// Kernel declarations (visible to all compilers for function pointer usage)
+template <typename T> __global__ void CopyKernel(VecT<T> *tgt, const VecT<T> *src);
+template <typename T> __global__ void ScaleKernel(VecT<T> *tgt, const VecT<T> *src, const T scalar);
+template <typename T> __global__ void AddKernel(VecT<T> *tgt, const VecT<T> *src_a, const VecT<T> *src_b);
+template <typename T>
+__global__ void TriadKernel(VecT<T> *tgt, const VecT<T> *src_a, const VecT<T> *src_b, const T scalar);
+
+// Implementation section - only compiled by nvcc
+#ifdef __CUDACC__
+
+/**
+ * @brief Fetches a value from source memory and writes it to a register.
+ *
+ * @details This inline device function fetches a value from the specified source memory
+ * location and writes it to the provided register. The implementation references the following:
+ * 1) NCCL:
+ * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L483
+ * 2) RCCL:
+ * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L268
+ *
+ * @tparam T The type of the value to fetch.
+ * @param[out] v The register to write the fetched value to.
+ * @param[in] p The source memory location to fetch the value from.
+ */
+template <typename T> inline __device__ void Fetch(T &v, const T *p) {
+#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
+    v = *p;
+#else
+    if constexpr (std::is_same<T, double2>::value) {
+        asm volatile("ld.volatile.global.v2.f64 {%0,%1}, [%2];" : "=d"(v.x), "=d"(v.y) : "l"(p) : "memory");
+    } else if constexpr (std::is_same<T, float4>::value) {
+        asm volatile("ld.volatile.global.v4.f32 {%0,%1,%2,%3}, [%4];"
+                     : "=f"(v.x), "=f"(v.y), "=f"(v.z), "=f"(v.w)
+                     : "l"(p)
+                     : "memory");
+    }
+#endif
+}
+
+/**
+ * @brief Stores a value from register and writes it to target memory.
+ *
+ * @details This inline device function stores a value from the provided register
+ * and writes it to the specified target memory location. The implementation references the following:
+ * 1) NCCL:
+ * https://github.com/NVIDIA/nccl/blob/7e515921295adaab72adf56ea71a0fafb0ecb5f3/src/collectives/device/common_kernel.h#L486
+ * 2) RCCL:
+ * https://github.com/ROCmSoftwarePlatform/rccl/blob/5c8380ff5b5925cae4bce00b1879a5f930226e8d/src/collectives/device/common_kernel.h#L276
+ *
+ * @tparam T The type of the value to store.
+ * @param[out] p The target memory location to write the value to.
+ * @param[in] v The register containing the value to be stored.
+ */
+template <typename T> inline __device__ void Store(T *p, const T &v) {
+#if defined(__HIP_PLATFORM_HCC__) || defined(__HCC__) || defined(__HIPCC__)
+    *p = v;
+#else
+    if constexpr (std::is_same<T, double2>::value) {
+        asm volatile("st.volatile.global.v2.f64 [%0], {%1,%2};" ::"l"(p), "d"(v.x), "d"(v.y) : "memory");
+    } else if constexpr (std::is_same<T, float4>::value) {
+        asm volatile("st.volatile.global.v4.f32 [%0], {%1,%2,%3,%4};" ::"l"(p), "f"(v.x), "f"(v.y), "f"(v.z), "f"(v.w)
+                     : "memory");
+    }
+#endif
+}
+
+/**
+ * @brief Performs COPY, a simple copy operation from source to target. b = a
+ *
+ * @details This CUDA kernel performs a simple copy operation, copying data from the source array
+ * to the target array. This is used to measure transfer rates without any arithmetic operations.
+ *
+ * @param[out] tgt The target array where data will be copied to (128-bit aligned).
+ * @param[in] src The source array from which data will be copied (128-bit aligned).
+ */
+template <typename T> __global__ void CopyKernel(VecT<T> *tgt, const VecT<T> *src) {
+    uint64_t index = blockIdx.x * blockDim.x + threadIdx.x;
+    VecT<T> val;
+    Fetch(val, src + index);
+    Store(tgt + index, val);
+}
+
+/**
+ * @brief Performs SCALE, a scaling operation on the source data. b = x * a
+ *
+ * @details This CUDA kernel performs a simple arithmetic operation by scaling the source data
+ * with a given scalar value and storing the result in the target array.
+ *
+ * @param[out] tgt The target array where the scaled data will be stored (128-bit aligned).
+ * @param[in] src The source array containing the data to be scaled (128-bit aligned).
+ * @param[in] scalar The scalar value used to scale the source data.
+ */
+template <typename T> __global__ void ScaleKernel(VecT<T> *tgt, const VecT<T> *src, const T scalar) {
+    uint64_t index = blockIdx.x * blockDim.x + threadIdx.x;
+    VecT<T> val;
+    Fetch(val, src + index);
+    if constexpr (std::is_same<T, double>::value) {
+        val.x *= scalar;
+        val.y *= scalar;
+    } else if constexpr (std::is_same<T, float>::value) {
+        val.x *= scalar;
+        val.y *= scalar;
+        val.z *= scalar;
+        val.w *= scalar;
+    }
+    Store(tgt + index, val);
+}
+
+/**
+ * @brief Performs ADD, an addition operation on two source arrays. c = a + b
+ *
+ * @details This CUDA kernel adds corresponding elements from two source arrays and stores the result
+ * in the target array. This operation is used to measure transfer rates with a simple arithmetic addition.
+ *
+ * @param[out] tgt The target array where the result of the addition will be stored (128-bit aligned).
+ * @param[in] src_a The first source array containing the first set of operands (128-bit aligned).
+ * @param[in] src_b The second source array containing the second set of operands (128-bit aligned).
+ */
+template <typename T> __global__ void AddKernel(VecT<T> *tgt, const VecT<T> *src_a, const VecT<T> *src_b) {
+    uint64_t index = blockIdx.x * blockDim.x + threadIdx.x;
+    VecT<T> val_a;
+    VecT<T> val_b;
+    Fetch(val_a, src_a + index);
+    Fetch(val_b, src_b + index);
+    if constexpr (std::is_same<T, double>::value) {
+        val_a.x += val_b.x;
+        val_a.y += val_b.y;
+    } else if constexpr (std::is_same<T, float>::value) {
+        val_a.x += val_b.x;
+        val_a.y += val_b.y;
+        val_a.z += val_b.z;
+        val_a.w += val_b.w;
+    }
+    Store(tgt + index, val_a);
+}
+
+/**
+ * @brief Performs TRIAD, fused multiply/add operations on source arrays. c = b + x * a
+ *
+ * @details This CUDA kernel performs a fused multiply/add operation by multiplying elements from
+ * the first source array with a scalar value, adding the result to corresponding elements from
+ * the second source array, and storing the result in the target array.
+ *
+ * @param[out] tgt The target array where the result of the fused multiply/add operation will be stored (128-bit
+ * aligned).
+ * @param[in] src_a The first source array containing the first set of operands to be multiplied by the scalar
+ * (128-bit aligned).
+ * @param[in] src_b The second source array containing the second set of operands (128-bit aligned).
+ * @param[in] scalar The scalar value used in the multiply/add operation.
+ */
+template <typename T>
+__global__ void TriadKernel(VecT<T> *tgt, const VecT<T> *src_a, const VecT<T> *src_b, const T scalar) {
+    uint64_t index = blockIdx.x * blockDim.x + threadIdx.x;
+    VecT<T> val_a;
+    VecT<T> val_b;
+    Fetch(val_a, src_a + index);
+    Fetch(val_b, src_b + index);
+    if constexpr (std::is_same<T, double>::value) {
+        val_b.x += (val_a.x * scalar);
+        val_b.y += (val_a.y * scalar);
+    } else if constexpr (std::is_same<T, float>::value) {
+        val_b.x += (val_a.x * scalar);
+        val_b.y += (val_a.y * scalar);
+        val_b.z += (val_a.z * scalar);
+        val_b.w += (val_a.w * scalar);
+    }
+    Store(tgt + index, val_b);
+}
+
+#endif // __CUDACC__
\ No newline at end of file
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_test.cpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp
similarity index 100%
rename from superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_test.cpp
rename to superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp
index 6ced0fdd5..fd0dfb913 100644
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp
@@ -43,6 +43,7 @@ void PrintUsage() {
               << "--size <size in bytes> "
               << "--num_warm_up <num_warm_up> "
               << "--num_loops <num_loops> "
+              << "[--data_type <float|double>] "
               << "[--check_data]" << std::endl;
 }
 
@@ -60,6 +61,7 @@ void PrintInputInfo(Opts &opts) {
     std::cout << "Buffer size(bytes): " << opts.size << std::endl;
     std::cout << "Number of warm up runs: " << opts.num_warm_up << std::endl;
     std::cout << "Number of loops: " << opts.num_loops << std::endl;
+    std::cout << "Data type: " << opts.data_type << std::endl;
     std::cout << "Check data: " << (opts.check_data ? "Yes" : "No") << std::endl;
 }
 
@@ -75,11 +77,12 @@ void PrintInputInfo(Opts &opts) {
  * @return int The status code.
  * */
 int ParseOpts(int argc, char **argv, Opts *opts) {
-    enum class OptIdx { kSize, kNumWarmUp, kNumLoops, kEnableCheckData };
+    enum class OptIdx { kSize, kNumWarmUp, kNumLoops, kEnableCheckData, kDataType };
     const struct option options[] = {{"size", required_argument, nullptr, static_cast<int>(OptIdx::kSize)},
                                      {"num_warm_up", required_argument, nullptr, static_cast<int>(OptIdx::kNumWarmUp)},
                                      {"num_loops", required_argument, nullptr, static_cast<int>(OptIdx::kNumLoops)},
-                                     {"check_data", no_argument, nullptr, static_cast<int>(OptIdx::kEnableCheckData)}};
+                                     {"check_data", no_argument, nullptr, static_cast<int>(OptIdx::kEnableCheckData)},
+                                     {"data_type", required_argument, nullptr, static_cast<int>(OptIdx::kDataType)}};
     int getopt_ret = 0;
     int opt_idx = 0;
     bool size_specified = true;
@@ -126,6 +129,13 @@ int ParseOpts(int argc, char **argv, Opts *opts) {
         case static_cast<int>(OptIdx::kEnableCheckData):
             opts->check_data = true;
             break;
+        case static_cast<int>(OptIdx::kDataType):
+            opts->data_type = optarg;
+            if (opts->data_type != "float" && opts->data_type != "double") {
+                std::cerr << "Invalid data_type: " << optarg << ". Must be 'float' or 'double'." << std::endl;
+                parse_err = true;
+            }
+            break;
         default:
             parse_err = true;
         }
diff --git a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp
index 0c648514b..907d05ef2 100644
--- a/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp
+++ b/superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp
@@ -31,7 +31,6 @@ unsigned long long getCurrentTimestampInMicroseconds();
 namespace stream_config {
 constexpr std::array<int, 4> kThreadsPerBlock = {128, 256, 512, 1024}; // Threads per block
 constexpr uint64_t kDefaultBufferSizeInBytes = 4294967296;             // Default buffer size 4GB
-constexpr int kNumLoopUnroll = 2;                                      // Unroll depth in SM copy kernel
 constexpr int kNumBuffers = 3;                                         // Number of buffers for triad, add kernel
 constexpr int kNumValidationBuffers = 4;                          // Number of validation buffers, one for each kernel
 constexpr int kUInt8Mod = 256;                                    // Modulo for unsigned long data type
@@ -83,10 +82,7 @@ template <typename T> struct SubBenchArgs {
 // Arguments for each benchmark run.
 template <typename T> struct BenchArgs {
 
-    // NUMA node under which the benchmark is done.
-    uint64_t numa_id = 0;
-
-    // GPU ID for device.
+    // GPU ID for device (always 0 - actual GPU determined by CUDA_VISIBLE_DEVICES).
     int gpu_id = 0;
 
     // GPU device info
@@ -121,6 +117,9 @@ struct Opts {
 
     // Whether check data after copy.
     bool check_data = false;
+
+    // Data type for the benchmark ("float" or "double").
+    std::string data_type = "double";
 };
 
 std::string KernelToString(int); // Function to convert enum to string
diff --git a/tests/benchmarks/micro_benchmarks/test_gpu_stream.py b/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
index dc8048054..bf37074e3 100644
--- a/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
+++ b/tests/benchmarks/micro_benchmarks/test_gpu_stream.py
@@ -20,7 +20,7 @@ def setUpClass(cls):
         cls.createMockEnvs(cls)
         cls.createMockFiles(cls, ['bin/gpu_stream'])
 
-    def _test_gpu_stream_command_generation(self, platform):
+    def _test_gpu_stream_command_generation(self, platform, data_type='double'):
         """Test gpu-stream benchmark command generation."""
         benchmark_name = 'gpu-stream'
         (benchmark_class,
@@ -31,9 +31,9 @@ def _test_gpu_stream_command_generation(self, platform):
         num_loops = 10
         size = 25769803776
 
-        parameters = '--num_warm_up %d --num_loops %d --size %d ' \
+        parameters = '--num_warm_up %d --num_loops %d --size %d --data_type %s ' \
             '--check_data' % \
-            (num_warm_up, num_loops, size)
+            (num_warm_up, num_loops, size, data_type)
         benchmark = benchmark_class(benchmark_name, parameters=parameters)
 
         # Check basic information
@@ -49,6 +49,7 @@ def _test_gpu_stream_command_generation(self, platform):
         assert (benchmark._args.num_warm_up == num_warm_up)
         assert (benchmark._args.num_loops == num_loops)
         assert (benchmark._args.check_data)
+        assert (benchmark._args.data_type == data_type)
 
         # Check command
         assert (1 == len(benchmark._commands))
@@ -56,6 +57,7 @@ def _test_gpu_stream_command_generation(self, platform):
         assert ('--size %d' % size in benchmark._commands[0])
         assert ('--num_warm_up %d' % num_warm_up in benchmark._commands[0])
         assert ('--num_loops %d' % num_loops in benchmark._commands[0])
+        assert ('--data_type %s' % data_type in benchmark._commands[0])
         assert ('--check_data' in benchmark._commands[0])
 
     @decorator.cuda_test
@@ -63,6 +65,11 @@ def test_gpu_stream_command_generation_cuda(self):
         """Test gpu-stream benchmark command generation, CUDA case."""
         self._test_gpu_stream_command_generation(Platform.CUDA)
 
+    @decorator.cuda_test
+    def test_gpu_stream_command_generation_cuda_float(self):
+        """Test gpu-stream benchmark command generation with float, CUDA case."""
+        self._test_gpu_stream_command_generation(Platform.CUDA, data_type='float')
+
     @decorator.load_data('tests/data/gpu_stream.log')
     def _test_gpu_stream_result_parsing(self, platform, test_raw_output):
         """Test gpu-stream benchmark result parsing."""
@@ -110,3 +117,44 @@ def _test_gpu_stream_result_parsing(self, platform, test_raw_output):
     def test_gpu_stream_result_parsing_cuda(self):
         """Test gpu-stream benchmark result parsing, CUDA case."""
         self._test_gpu_stream_result_parsing(Platform.CUDA)
+
+    @decorator.load_data('tests/data/gpu_stream_float.log')
+    def _test_gpu_stream_result_parsing_float(self, platform, test_raw_output):
+        """Test gpu-stream benchmark result parsing for float data type."""
+        benchmark_name = 'gpu-stream'
+        (benchmark_class,
+         predefine_params) = BenchmarkRegistry._BenchmarkRegistry__select_benchmark(benchmark_name, platform)
+        assert (benchmark_class)
+        benchmark = benchmark_class(benchmark_name, parameters='--data_type float')
+        assert (benchmark)
+        ret = benchmark._preprocess()
+        assert (ret is True)
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+
+        # Positive case - valid raw output with float tags.
+        assert (benchmark._process_raw_result(0, test_raw_output))
+        assert (benchmark.return_code == ReturnCode.SUCCESS)
+
+        assert (1 == len(benchmark.raw_data))
+        test_raw_output_dict = {
+            x.split()[0]: [float(x.split()[1]), float(x.split()[2])]
+            for x in test_raw_output.strip().splitlines() if x.startswith('STREAM_')
+        }
+        assert (len(test_raw_output_dict) * 2 + benchmark.default_metric_count == len(benchmark.result))
+        for output_key in benchmark.result:
+            if output_key == 'return_code':
+                assert (benchmark.result[output_key] == [0])
+            else:
+                assert (len(benchmark.result[output_key]) == 1)
+                assert (isinstance(benchmark.result[output_key][0], numbers.Number))
+                if output_key.endswith('_bw'):
+                    assert (output_key.strip('_bw') in test_raw_output_dict)
+                    assert (test_raw_output_dict[output_key.strip('_bw')][0] == benchmark.result[output_key][0])
+                else:
+                    assert (output_key.strip('_ratio') in test_raw_output_dict)
+                    assert (test_raw_output_dict[output_key.strip('_ratio')][1] == benchmark.result[output_key][0])
+
+    @decorator.cuda_test
+    def test_gpu_stream_result_parsing_cuda_float(self):
+        """Test gpu-stream benchmark result parsing for float, CUDA case."""
+        self._test_gpu_stream_result_parsing_float(Platform.CUDA)
diff --git a/tests/data/gpu_stream.log b/tests/data/gpu_stream.log
index c3d6f2390..a3dcf2b01 100644
--- a/tests/data/gpu_stream.log
+++ b/tests/data/gpu_stream.log
@@ -2,40 +2,23 @@ STREAM Benchmark
 Buffer size(bytes): 4294967296
 Number of warm up runs: 10
 Number of loops: 40
+Data type: double
 Check data: No
 
 Device 0: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000MHz x 8192-bit = 8192 GB/s PEAK   ECC is ON
-STREAM_COPY_double_gpu_0_buffer_4294967296_block_128    6711.67 81.93
-STREAM_COPY_double_gpu_0_buffer_4294967296_block_256    6549.50 79.95
-STREAM_COPY_double_gpu_0_buffer_4294967296_block_512    6195.43 75.63
-STREAM_COPY_double_gpu_0_buffer_4294967296_block_1024   5721.52 69.84
-STREAM_SCALE_double_gpu_0_buffer_4294967296_block_128   6680.42 81.55
-STREAM_SCALE_double_gpu_0_buffer_4294967296_block_256   6515.51 79.54
-STREAM_SCALE_double_gpu_0_buffer_4294967296_block_512   6106.69 74.54
-STREAM_SCALE_double_gpu_0_buffer_4294967296_block_1024  5626.68 68.69
-STREAM_ADD_double_gpu_0_buffer_4294967296_block_128     7379.25 90.08
-STREAM_ADD_double_gpu_0_buffer_4294967296_block_256     7407.27 90.42
-STREAM_ADD_double_gpu_0_buffer_4294967296_block_512     7309.59 89.23
-STREAM_ADD_double_gpu_0_buffer_4294967296_block_1024    6788.64 82.87
-STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_128   7378.19 90.07
-STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_256   7414.01 90.50
-STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_512   7295.50 89.06
-STREAM_TRIAD_double_gpu_0_buffer_4294967296_block_1024  6730.42 82.16
-
-Device 1: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000.00MHz x 8192-bit = 8192.00 GB/s PEAK   ECC is ON
-STREAM_COPY_double_gpu_1_buffer_4294967296_block_128    6708.74 81.89
-STREAM_COPY_double_gpu_1_buffer_4294967296_block_256    6549.47 79.95
-STREAM_COPY_double_gpu_1_buffer_4294967296_block_512    6195.39 75.63
-STREAM_COPY_double_gpu_1_buffer_4294967296_block_1024   5725.07 69.89
-STREAM_SCALE_double_gpu_1_buffer_4294967296_block_128   6678.56 81.53
-STREAM_SCALE_double_gpu_1_buffer_4294967296_block_256   6514.05 79.52
-STREAM_SCALE_double_gpu_1_buffer_4294967296_block_512   6103.80 74.51
-STREAM_SCALE_double_gpu_1_buffer_4294967296_block_1024  5630.41 68.73
-STREAM_ADD_double_gpu_1_buffer_4294967296_block_128     7377.74 90.06
-STREAM_ADD_double_gpu_1_buffer_4294967296_block_256     7410.97 90.47
-STREAM_ADD_double_gpu_1_buffer_4294967296_block_512     7310.80 89.24
-STREAM_ADD_double_gpu_1_buffer_4294967296_block_1024    6789.91 82.88
-STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_128   7379.03 90.08
-STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_256   7414.04 90.50
-STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_512   7298.26 89.09
-STREAM_TRIAD_double_gpu_1_buffer_4294967296_block_1024  6732.15 82.18
\ No newline at end of file
+STREAM_COPY_double_buffer_4294967296_block_128	6711.67	81.93
+STREAM_COPY_double_buffer_4294967296_block_256	6549.50	79.95
+STREAM_COPY_double_buffer_4294967296_block_512	6195.43	75.63
+STREAM_COPY_double_buffer_4294967296_block_1024	5721.52	69.84
+STREAM_SCALE_double_buffer_4294967296_block_128	6680.42	81.55
+STREAM_SCALE_double_buffer_4294967296_block_256	6515.51	79.54
+STREAM_SCALE_double_buffer_4294967296_block_512	6106.69	74.54
+STREAM_SCALE_double_buffer_4294967296_block_1024	5626.68	68.69
+STREAM_ADD_double_buffer_4294967296_block_128	7379.25	90.08
+STREAM_ADD_double_buffer_4294967296_block_256	7407.27	90.42
+STREAM_ADD_double_buffer_4294967296_block_512	7309.59	89.23
+STREAM_ADD_double_buffer_4294967296_block_1024	6788.64	82.87
+STREAM_TRIAD_double_buffer_4294967296_block_128	7378.19	90.07
+STREAM_TRIAD_double_buffer_4294967296_block_256	7414.01	90.50
+STREAM_TRIAD_double_buffer_4294967296_block_512	7295.50	89.06
+STREAM_TRIAD_double_buffer_4294967296_block_1024	6730.42	82.16
\ No newline at end of file
diff --git a/tests/data/gpu_stream_float.log b/tests/data/gpu_stream_float.log
new file mode 100644
index 000000000..d1ab3ad1e
--- /dev/null
+++ b/tests/data/gpu_stream_float.log
@@ -0,0 +1,24 @@
+STREAM Benchmark
+Buffer size(bytes): 4294967296
+Number of warm up runs: 10
+Number of loops: 40
+Data type: float
+Check data: No
+
+Device 0: "NVIDIA Graphics Device"  152 SMs(10.0)  Memory: 4000MHz x 8192-bit = 8192 GB/s PEAK   ECC is ON
+STREAM_COPY_float_buffer_4294967296_block_128	6823.45	83.30
+STREAM_COPY_float_buffer_4294967296_block_256	6650.12	81.18
+STREAM_COPY_float_buffer_4294967296_block_512	6301.88	76.93
+STREAM_COPY_float_buffer_4294967296_block_1024	5812.34	70.95
+STREAM_SCALE_float_buffer_4294967296_block_128	6790.11	82.89
+STREAM_SCALE_float_buffer_4294967296_block_256	6620.33	80.81
+STREAM_SCALE_float_buffer_4294967296_block_512	6210.45	75.81
+STREAM_SCALE_float_buffer_4294967296_block_1024	5718.90	69.81
+STREAM_ADD_float_buffer_4294967296_block_128	7490.22	91.43
+STREAM_ADD_float_buffer_4294967296_block_256	7512.10	91.70
+STREAM_ADD_float_buffer_4294967296_block_512	7405.67	90.40
+STREAM_ADD_float_buffer_4294967296_block_1024	6890.33	84.11
+STREAM_TRIAD_float_buffer_4294967296_block_128	7485.55	91.38
+STREAM_TRIAD_float_buffer_4294967296_block_256	7520.88	91.81
+STREAM_TRIAD_float_buffer_4294967296_block_512	7390.12	90.21
+STREAM_TRIAD_float_buffer_4294967296_block_1024	6825.11	83.32