Speed up recall calculation in cuVS Bench for large top-K (#1816)

jamxia155 · web-flow · commit 5f15f193bddf · 2026-03-18T14:57:03.000Z
Currently, recall calculation in cuVS Bench essentially runs an outer for loop over the `k` ground-truth vector IDs and an inner loop over the `k` ANN result vector IDs, incrementing a counter if the computed value matches the ground truth. This works well assuming `k` is small but the complexity is `O(k^2)`. When benchmarking use cases involving large `k` values, the recall calculation becomes a bottleneck especially since a large `k` does not necessarily lead to much slower search times, so the recall calculation is performed about as many times as would be for a small `k`, leading to unacceptable (or at least humanly unbearable) run times. This update speeds up the recall calculation in the following ways: 1. Eager hashing of vector IDs - During the construction of the dataset, we populate for each query a `std::unordered_map` of {vector_id, neighbor_rank}. This step has complexity `O(k)` and the hash maps are cached for all benchmark cases. - During search, we look up the hash of each search result in the ground truth map to determine whether it is a true result. This step has complexity `O(k)`. 2. Parallelizing hash map build and lookup - We use basic threading to parallelize recall calculation at the query level (for ease of implementation and cache locality). - Care is taken to avoid oversubscribing the CPU when benchmarking is run on multiple threads e.g. in throughput mode. 3. Capping the total number of queries for which recall is calculated to about 10,000 - This avoids unbounded recall calculations if using large sets of queries and ground truths while performing many iterations of the benchmark case. - Underlying assumption is that the sample of queries used for recall calculation will be representative of the recall performance for the benchmark case tested. Testing at k=15000, batch-size=500, iterations=20, cpu=AMD EPYC 7413 24 cores/48 threads: - baseline wall time: 285 s - improved wall time: 3.7 s - Note that the wall time includes loading and running the benchmarks, which takes over 1 s for these settings. - Also note that if the number of iterations is not specified, the benchmark would run for over 100 iterations which would make the baseline runtime much slower as recall calculation is performed for far more than 10,000 queries. - At k=10, wall times are 1.369 s (PR) vs. 1.362 s (baseline) Authors: - James Xia (https://github.com/jamxia155) - Anupam (https://github.com/aamijar) Approvers: - Artem M. Chirkin (https://github.com/achirkin) URL: #1816
diff --git a/cpp/bench/ann/src/common/benchmark.hpp b/cpp/bench/ann/src/common/benchmark.hpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION.
+ * SPDX-FileCopyrightText: Copyright (c) 2023-2026, NVIDIA CORPORATION.
  * SPDX-License-Identifier: Apache-2.0
  */
 #pragma once
@@ -351,15 +351,9 @@ void bench_search(::benchmark::State& state,
 
   // Each thread calculates recall on their partition of queries.
   // evaluate recall
-  if (dataset->max_k() >= k) {
-    const std::int32_t* gt             = dataset->gt_set();
-    const std::uint32_t* filter_bitset = dataset->filter_bitset(MemoryType::kHostMmap);
-    auto filter                        = [filter_bitset](std::int32_t i) -> bool {
-      if (filter_bitset == nullptr) { return true; }
-      auto word = filter_bitset[i >> 5];
-      return word & (1 << (i & 31));
-    };
-    const std::uint32_t max_k = dataset->max_k();
+  if (dataset->max_k() >= k && dataset->gt_maps().has_value()) {
+    // gt_maps[i] is a hash map of {id, neighbor_rank} for query i
+    const auto& gt_maps = dataset->gt_maps();
     result_buf.transfer_data(MemoryType::kHost, current_algo_props->query_memory_type);
     auto* neighbors_host    = reinterpret_cast<index_type*>(result_buf.data(MemoryType::kHost));
     std::size_t rows        = std::min(queries_processed, query_set_size);
@@ -369,39 +363,49 @@ void bench_search(::benchmark::State& state,
     // We go through the groundtruth with same stride as the benchmark loop.
     size_t out_offset   = 0;
     size_t batch_offset = (state.thread_index() * n_queries) % query_set_size;
+    // Avoid CPU oversubscription when parallelizing recall calculation loop
+    int num_recall_calculation_worker_threads =
+      std::thread::hardware_concurrency() / benchmark_n_threads - 1;  // -1 for the main thread
+    // ensure non-negative number of workers (possible if hardware_concurrency()
+    // does not return an expected value) by clamping to 0
+    if (num_recall_calculation_worker_threads < 0) { num_recall_calculation_worker_threads = 0; }
     while (out_offset < rows) {
-      for (std::size_t i = 0; i < n_queries; i++) {
-        size_t i_orig_idx = batch_offset + i;
-        size_t i_out_idx  = out_offset + i;
-        if (i_out_idx < rows) {
-          /* NOTE: recall correctness & filtering
-
-          In the loop below, we filter the ground truth values on-the-fly.
-          We need enough ground truth values to compute recall correctly though.
-          But the ground truth file only contains `max_k` values per row; if there are less valid
-          values than k among them, we overestimate the recall. Essentially, we compare the first
-          `filter_pass_count` values of the algorithm output, and this counter can be less than `k`.
-          In the extreme case of very high filtering rate, we may be bypassing entire rows of
-          results. However, this is still better than no recall estimate at all.
-
-          TODO: consider generating the filtered ground truth on-the-fly
-          */
-          uint32_t filter_pass_count = 0;
-          for (std::uint32_t l = 0; l < max_k && filter_pass_count < k; l++) {
-            auto exp_idx = gt[i_orig_idx * max_k + l];
-            if (!filter(exp_idx)) { continue; }
-            filter_pass_count++;
-            for (std::uint32_t j = 0; j < k; j++) {
-              auto act_idx = static_cast<std::int32_t>(neighbors_host[i_out_idx * k + j]);
-              if (act_idx == exp_idx) {
-                match_count++;
-                break;
-              }
-            }
+      std::vector<std::thread> recall_calculation_workers;
+      recall_calculation_workers.reserve(num_recall_calculation_worker_threads);
+      std::vector<std::size_t> local_match_count(num_recall_calculation_worker_threads + 1);
+      std::vector<std::size_t> local_total_count(num_recall_calculation_worker_threads + 1);
+      int chunk_size =
+        n_queries / (num_recall_calculation_worker_threads + 1);  // +1 for the main thread
+      int remainder           = n_queries % (num_recall_calculation_worker_threads + 1);
+      auto recall_calculation = [&](int start, int end, int tid) -> void {
+        for (int i = start; i < end; ++i) {
+          size_t i_orig_idx = batch_offset + i;
+          size_t i_out_idx  = out_offset + i;
+          if (i_out_idx < rows) {
+            auto* candidates       = neighbors_host + i_out_idx * k;
+            auto [matching, total] = gt_maps->count_matches(i_orig_idx, candidates, k);
+            local_match_count[tid] += matching;
+            local_total_count[tid] += total;
           }
-          total_count += filter_pass_count;
         }
+      };
+      // launch worker threads
+      int start = 0;
+      for (int tid = 0; tid < num_recall_calculation_worker_threads; tid++) {
+        int end = start + chunk_size;
+        if (tid < remainder) { ++end; }
+        recall_calculation_workers.emplace_back(recall_calculation, start, end, tid);
+        start = end;
       }
+      // main thread works on last chunk
+      recall_calculation(start, n_queries, num_recall_calculation_worker_threads);
+      // join all worker threads
+      for (auto& worker : recall_calculation_workers) {
+        worker.join();
+      }
+      match_count += std::accumulate(local_match_count.begin(), local_match_count.end(), 0);
+      total_count += std::accumulate(local_total_count.begin(), local_total_count.end(), 0);
+
       out_offset += n_queries;
       batch_offset = (batch_offset + queries_stride) % query_set_size;
     }
diff --git a/cpp/bench/ann/src/common/dataset.hpp b/cpp/bench/ann/src/common/dataset.hpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2023-2025, NVIDIA CORPORATION.
+ * SPDX-FileCopyrightText: Copyright (c) 2023-2026, NVIDIA CORPORATION.
  * SPDX-License-Identifier: Apache-2.0
  */
 #pragma once
@@ -14,6 +14,7 @@
 #include <optional>
 #include <random>
 #include <string>
+#include <thread>
 
 namespace cuvs::bench {
 
@@ -33,19 +34,121 @@ void generate_bernoulli(CarrierT* data, size_t words, double p)
   }
 };
 
+template <typename T>
+struct ground_truth_map {
+  using bitset_carrier_type                      = uint32_t;
+  static constexpr uint32_t kMaxQueriesForRecall = 10'000;
+
+  explicit ground_truth_map(std::string file_name,
+                            uint32_t n_queries,
+                            std::optional<blob<bitset_carrier_type>>& filter_bitset)
+    : gt_maps_(n_queries)
+  {
+    // Eagerly iterate over and optionally filter the ground truth set to build gt_maps_ for up to
+    // kMaxQueriesForRecall queries
+    /* NOTE: recall correctness & filtering
+
+    We generate the filtered ground truth values and build unordered_maps with them to
+    enable O(1) lookup. We need enough ground truth values to compute recall correctly
+    though. But the ground truth file only contains `max_k_` values per row; if there are
+    less valid values than k among them, we overestimate the recall. Essentially, we compare
+    the first `gt_maps_[query_idx].size()` values of the algorithm output, and this value can be
+    less than `k`. In the extreme case of very high filtering rate, we may be bypassing
+    entire rows of results. However, this is still better than no recall estimate at all.
+
+    */
+    auto ground_truth_set = blob<T>(file_name);
+    max_k_                = ground_truth_set.n_cols();
+    auto filter           = [&](T i) -> bool {
+      if (!filter_bitset.has_value()) { return true; }
+      // bitset is `32 = bitset_carrier_type * 8` times more dense than the data
+      // use bitwise arithmetic to get the `row_id` and correct bit pos in the `word`
+      auto word = filter_bitset->data(MemoryType::kHostMmap)[i >> 5];
+      return word & (1 << (i & 31));
+    };
+    // Avoid CPU oversubscription when parallelizing recall calculation loop
+    int num_map_building_worker_threads =
+      std::thread::hardware_concurrency() - 1;  // -1 for the main thread
+    // ensure non-negative number of workers (possible if hardware_concurrency()
+    // does not return an expected value) by clamping to 0
+    if (num_map_building_worker_threads < 0) { num_map_building_worker_threads = 0; }
+    std::vector<std::thread> gt_map_building_workers;
+    gt_map_building_workers.reserve(num_map_building_worker_threads);
+    int chunk_size    = n_queries / (num_map_building_worker_threads + 1);
+    int remainder     = n_queries % (num_map_building_worker_threads + 1);
+    int stride        = (n_queries - 1) / kMaxQueriesForRecall + 1;  // round-up division
+    auto build_gt_map = [&](int start, int end, int tid) -> void {
+      for (int query_idx = start; query_idx < end; ++query_idx) {
+        if (query_idx % stride) continue;
+        for (std::uint32_t neighbor_rank = 0; neighbor_rank < max_k_; ++neighbor_rank) {
+          auto id = ground_truth_set.data()[query_idx * max_k_ + neighbor_rank];
+          if (!filter(id)) { continue; }
+          if (gt_maps_[query_idx].count(id)) {
+            throw std::invalid_argument(
+              "Duplicate neighbor id found in ground truth set for query " +
+              std::to_string(query_idx));
+          }
+          gt_maps_[query_idx][id] = neighbor_rank;
+        }
+      }
+    };
+    // launch worker threads
+    int start = 0;
+    for (int tid = 0; tid < num_map_building_worker_threads; tid++) {
+      int end = start + chunk_size;
+      if (tid < remainder) { ++end; }
+      gt_map_building_workers.emplace_back(build_gt_map, start, end, tid);
+      start = end;
+    }
+    // main thread works on last chunk
+    build_gt_map(start, n_queries, num_map_building_worker_threads);
+    // join all worker threads
+    for (auto& worker : gt_map_building_workers) {
+      worker.join();
+    }
+  }
+
+  [[nodiscard]] auto max_k() const -> uint32_t { return max_k_; }
+
+  template <typename index_type>
+  [[nodiscard]] auto count_matches(size_t query_idx, const index_type* candidates, uint32_t k) const
+    -> std::pair<size_t, size_t>
+  {
+    if (query_idx >= gt_maps_.size() || gt_maps_[query_idx].empty()) return {0, 0};
+
+    size_t matching = 0;
+    for (uint32_t i = 0; i < k; ++i) {
+      auto act_idx = candidates[i];
+      if (gt_maps_[query_idx].count(act_idx) &&
+          static_cast<uint32_t>(gt_maps_[query_idx].at(act_idx)) < k) {
+        ++matching;
+      }
+    }
+    size_t total = std::min(gt_maps_[query_idx].size(), static_cast<size_t>(k));
+    return {matching, total};
+  }
+
+ private:
+  // Hash maps of {id, neighbor_rank} for up to kMaxQueriesForRecall queries in the ground truth set
+  // e.g. gt_maps_[i][j] = k means that for the i-th query in the ground truth set, the neighbor
+  // with idx j is the k-th nearest. Note that the nearest neighbor rank starts from 0.
+  std::vector<std::unordered_map<T, T>> gt_maps_;
+  uint32_t max_k_ = 0;  // number of nearest neighbors in the ground truth
+};
+
 template <typename DataT, typename IdxT = int32_t>
 struct dataset {
  public:
-  using bitset_carrier_type                           = uint32_t;
+  using bitset_carrier_type = typename ground_truth_map<IdxT>::bitset_carrier_type;
   static inline constexpr size_t kBitsPerCarrierValue = sizeof(bitset_carrier_type) * 8;
 
  private:
   std::string name_;
   std::string distance_;
   blob<DataT> base_set_;
   blob<DataT> query_set_;
-  std::optional<blob<IdxT>> ground_truth_set_;
   std::optional<blob<bitset_carrier_type>> filter_bitset_;
+  std::optional<ground_truth_map<IdxT>> ground_truth_map_;
 
   // Protects the lazy mutations of the blobs accessed by multiple threads
   mutable std::mutex mutex_;
@@ -73,10 +176,7 @@ struct dataset {
     : name_{std::move(name)},
       distance_{std::move(distance)},
       base_set_{base_file, subset_first_row, subset_size},
-      query_set_{query_file},
-      ground_truth_set_{groundtruth_neighbors_file.has_value()
-                          ? std::make_optional<blob<IdxT>>(groundtruth_neighbors_file.value())
-                          : std::nullopt}
+      query_set_{query_file}
   {
     if (filtering_rate.has_value()) {
       // Generate a random bitset for filtering
@@ -94,6 +194,11 @@ struct dataset {
                          1.0 - filtering_rate.value());
       filter_bitset_.emplace(std::move(bitset_blob));
     }
+
+    if (groundtruth_neighbors_file.has_value()) {
+      ground_truth_map_.emplace(ground_truth_map<IdxT>{
+        groundtruth_neighbors_file.value(), query_set_.n_rows(), filter_bitset_});
+    }
   }
 
   [[nodiscard]] auto name() const -> std::string { return name_; }
@@ -118,8 +223,7 @@ struct dataset {
   }
   [[nodiscard]] auto max_k() const -> uint32_t
   {
-    std::lock_guard<std::mutex> lock(mutex_);
-    if (ground_truth_set_.has_value()) { return ground_truth_set_->n_cols(); }
+    if (ground_truth_map_.has_value()) { return ground_truth_map_->max_k(); }
     return 0;
   }
   [[nodiscard]] auto base_set_size() const -> size_t
@@ -137,11 +241,9 @@ struct dataset {
     return r;
   }
 
-  [[nodiscard]] auto gt_set() const -> const IdxT*
+  [[nodiscard]] auto gt_maps() const -> const std::optional<ground_truth_map<IdxT>>&
   {
-    std::lock_guard<std::mutex> lock(mutex_);
-    if (ground_truth_set_.has_value()) { return ground_truth_set_->data(); }
-    return nullptr;
+    return ground_truth_map_;
   }
 
   [[nodiscard]] auto query_set() const -> const DataT*