Fix RNG seeding bug in multi-GPU KMeans by vinaydes · Pull Request #2029 · rapidsai/cuvs

vinaydes · 2026-04-15T10:29:55Z

This fixes the issue #1829.

Root cause

In the multi-GPU (kmeans_mg) KMeans implementation, multiple distinct random sampling operations were all initialized with the same seed (params.rng_state.seed). Also all the threads were also initialized with the same seed. This caused each operation to generate identical random sequences, leading to biased or degenerate centroid initialization, particularly visible during KMeans++ and random init phases where independent draws are expected. A secondary seeding issue existed in single-GPU kmeansPlusPlus: the CPU and GPU RNGs were both seeded with the raw params.rng_state.seed, which is incorrect in my opinion.

User specified generator type was ignored at one place, which is also fixed by this PR

Fix

Multi-GPU (kmeans_mg.cuh): Each rank now initializes a single std::mt19937_64 generator seeded with params.rng_state.seed + (uint64_t(my_rank) << 32), ensuring per-rank uniqueness while remaining deterministic. This generator is passed through the init call chain (initRandom, initKMeansPlusPlus) and is used to derive independent seeds for each GPU RNG operation (RngState, shuffle_and_gather, fit_main, fallback random init), so no two phases share the same seed.
Single-GPU (kmeans.cuh): Replaced std::mt19937 with std::mt19937_64 for consistency, and derive the GPU RNG seed from it rather than reusing the raw seed directly.
The repeated instantiation of CPU generator is needed because we have kept the params argument read-only (and thus params.rng_state). May be we should separate rng_state from params`?

The API changes suggested in the fix are a bit clunky. Better way to do it would be welcome.

copy-pr-bot · 2026-04-15T10:30:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

viclafargue

Thanks for looking into this @vinaydes!

Each worker should indeed be seeded differently. And it looks like the PR solves the issue. The reproducer now has roughly similar number of iterations with different random number generators (much more though on my machine).

One important, but solvable issue though, is that the recluster in initKMeansPlusPlus when potentialCentroids.extent(0) > n_clusters should absolutely be deterministic. The final centroids to be used for KMeans would otherwise diverge across ranks.

viclafargue · 2026-04-15T12:16:24Z


 #define KMEANS_COMM_ROOT 0

 static cuvs::cluster::kmeans::params default_params;


Looks like default_params is now dead code and could be removed.

viclafargue · 2026-04-15T12:19:13Z

    // generate `n_random_clusters` centroids
    cuvs::cluster::kmeans::params rand_params = params;
-    rand_params.rng_state                     = default_params.rng_state;
+    rand_params.rng_state.seed                     = gen_64();


Could be misleading to set this rand_params.rng_state.seed field here as initRandom does not use it anymore, but uses the gen_64 argument instead.

viclafargue · 2026-04-15T13:00:21Z

    // Note - reclustering step is duplicated across all ranks and with the same
    // seed they should generate the same potentialCentroids
    auto const_centroids = raft::make_device_matrix_view<const DataT, IndexT>(
      potentialCentroids.data_handle(), potentialCentroids.extent(0), potentialCentroids.extent(1));
+    auto params_copy = params;
+    params_copy.rng_state.seed = gen_64();
    cuvs::cluster::kmeans::init_plus_plus(
-      handle, params, const_centroids, centroidsRawData, workspace);
+      handle, params_copy, const_centroids, centroidsRawData, workspace);


This is an issue. When initKMeansPlusPlus oversamples (potentialCentroids.extent(0) > n_clusters), a mini single-GPU KMeans is run locally on each rank to reduce the candidates down to n_clusters. Since potentialCentroids and weight are already identical across ranks (via prior allgatherv/allreduce), and no communication happens after this reduction, all ranks must use the same RNG seed to produce identical results. Currently, init_plus_plus is given a per-rank-divergent seed from gen_64(), which breaks this invariant.

We can generate a seed at rank 0 and broadcast it to other ranks? Would that be an acceptable solution? What implications would a sync here will have on load balancing?

What implications would a sync here will have on load balancing?

At this step the workers just completed the allreduce operation. Broadcasting a seed should be relatively cheap to do.

We can generate a seed at rank 0 and broadcast it to other ranks? Would that be an acceptable solution?

I think that this solution could work. That said since earlier steps already introduced some randomness, we could theoretically perform this recluster step with a constant seed.

Also it would be great if we could double check that everything works as expected before we merge the PR. Especially that the KMeans algorithm actually starts with similar centroids initialization on all workers whatever the init mode chosen and whether there is a recluster step or not.

aamijar · 2026-04-21T02:56:00Z

/ok to test 5fb247c

vinaydes added 23 commits February 26, 2026 17:27

Save point in debugging

2de9c50

Adding the reproducer file provided in the bug

46d7ae0

Adding reproducer builder

ffd9a4c

Extracted weights and wrote a RAFT discrete() reproducer

9b00642

Compare discrete result with weights array

c9f281b

Some more debugging

c2e801c

Correctly seeding at multiple places

5207d9c

Randomized seeeding

584fd32

Removing debug code

5ce4d7e

Removing trailing spaces

6cf0df0

Removing debug prints

4fa2009

Removing debug code

aa1db02

Undoing cmake change

cd3949a

Merge branch 'main' into kmeans_mg_rng_bug

ffeaef2

Removing irrelevant changes

635ccac

Removing irrelevant changes 2

c0a8705

Removing irrelevant changes 3

59cf6e5

Removing irrelevant changes 4

33649f7

Correcting the name of the rng type

2761b82

Merge branch 'main' into kmeans_mg_rng_bug

2dbb610

Merge branch 'main' into kmeans_mg_rng_bug

5c57366

Removing debuggin files

abff264

Removing debug print

be7bac1

vinaydes requested a review from a team as a code owner April 15, 2026 10:29

github-project-automation Bot added this to Unstructured Data Processing Apr 15, 2026

vinaydes mentioned this pull request Apr 15, 2026

[BUG] Multi-GPU KMeans convergence degradation with default random generator #1829

Open

viclafargue requested changes Apr 15, 2026

View reviewed changes

aamijar assigned vinaydes Apr 21, 2026

aamijar added the non-breaking Introduces a non-breaking change label Apr 21, 2026

aamijar added the bug Something isn't working label Apr 21, 2026

aamijar moved this to In Progress in Unstructured Data Processing Apr 21, 2026

Merge branch 'main' into fix-mg-kmeans-seeding

5fb247c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RNG seeding bug in multi-GPU KMeans#2029

Fix RNG seeding bug in multi-GPU KMeans#2029
vinaydes wants to merge 24 commits intorapidsai:mainfrom
vinaydes:fix-mg-kmeans-seeding

vinaydes commented Apr 15, 2026

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

viclafargue left a comment •

edited

Loading

Uh oh!

viclafargue Apr 15, 2026

Uh oh!

viclafargue Apr 15, 2026

Uh oh!

viclafargue Apr 15, 2026

Uh oh!

vinaydes Apr 20, 2026

Uh oh!

viclafargue Apr 20, 2026 •

edited

Loading

Uh oh!

aamijar commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		#define KMEANS_COMM_ROOT 0

		static cuvs::cluster::kmeans::params default_params;

Conversation

vinaydes commented Apr 15, 2026

Uh oh!

copy-pr-bot Bot commented Apr 15, 2026

Uh oh!

viclafargue left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viclafargue Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

vinaydes Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aamijar commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

viclafargue left a comment •

edited

Loading

viclafargue Apr 20, 2026 •

edited

Loading