Skip to content

Fix RNG seeding bug in multi-GPU KMeans#2029

Open
vinaydes wants to merge 24 commits intorapidsai:mainfrom
vinaydes:fix-mg-kmeans-seeding
Open

Fix RNG seeding bug in multi-GPU KMeans#2029
vinaydes wants to merge 24 commits intorapidsai:mainfrom
vinaydes:fix-mg-kmeans-seeding

Conversation

@vinaydes
Copy link
Copy Markdown
Contributor

This fixes the issue #1829.

Root cause

In the multi-GPU (kmeans_mg) KMeans implementation, multiple distinct random sampling operations were all initialized with the same seed (params.rng_state.seed). Also all the threads were also initialized with the same seed. This caused each operation to generate identical random sequences, leading to biased or degenerate centroid initialization, particularly visible during KMeans++ and random init phases where independent draws are expected. A secondary seeding issue existed in single-GPU kmeansPlusPlus: the CPU and GPU RNGs were both seeded with the raw params.rng_state.seed, which is incorrect in my opinion.

User specified generator type was ignored at one place, which is also fixed by this PR

Fix

  • Multi-GPU (kmeans_mg.cuh): Each rank now initializes a single std::mt19937_64 generator seeded with params.rng_state.seed + (uint64_t(my_rank) << 32), ensuring per-rank uniqueness while remaining deterministic. This generator is passed through the init call chain (initRandom, initKMeansPlusPlus) and is used to derive independent seeds for each GPU RNG operation (RngState, shuffle_and_gather, fit_main, fallback random init), so no two phases share the same seed.
  • Single-GPU (kmeans.cuh): Replaced std::mt19937 with std::mt19937_64 for consistency, and derive the GPU RNG seed from it rather than reusing the raw seed directly.
  • The repeated instantiation of CPU generator is needed because we have kept the params argument read-only (and thus params.rng_state). May be we should separate rng_state from params`?

The API changes suggested in the fix are a bit clunky. Better way to do it would be welcome.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @vinaydes!

Each worker should indeed be seeded differently. And it looks like the PR solves the issue. The reproducer now has roughly similar number of iterations with different random number generators (much more though on my machine).

One important, but solvable issue though, is that the recluster in initKMeansPlusPlus when potentialCentroids.extent(0) > n_clusters should absolutely be deterministic. The final centroids to be used for KMeans would otherwise diverge across ranks.


#define KMEANS_COMM_ROOT 0

static cuvs::cluster::kmeans::params default_params;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like default_params is now dead code and could be removed.

// generate `n_random_clusters` centroids
cuvs::cluster::kmeans::params rand_params = params;
rand_params.rng_state = default_params.rng_state;
rand_params.rng_state.seed = gen_64();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be misleading to set this rand_params.rng_state.seed field here as initRandom does not use it anymore, but uses the gen_64 argument instead.

Comment on lines 406 to +413
// Note - reclustering step is duplicated across all ranks and with the same
// seed they should generate the same potentialCentroids
auto const_centroids = raft::make_device_matrix_view<const DataT, IndexT>(
potentialCentroids.data_handle(), potentialCentroids.extent(0), potentialCentroids.extent(1));
auto params_copy = params;
params_copy.rng_state.seed = gen_64();
cuvs::cluster::kmeans::init_plus_plus(
handle, params, const_centroids, centroidsRawData, workspace);
handle, params_copy, const_centroids, centroidsRawData, workspace);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue. When initKMeansPlusPlus oversamples (potentialCentroids.extent(0) > n_clusters), a mini single-GPU KMeans is run locally on each rank to reduce the candidates down to n_clusters. Since potentialCentroids and weight are already identical across ranks (via prior allgatherv/allreduce), and no communication happens after this reduction, all ranks must use the same RNG seed to produce identical results. Currently, init_plus_plus is given a per-rank-divergent seed from gen_64(), which breaks this invariant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can generate a seed at rank 0 and broadcast it to other ranks? Would that be an acceptable solution? What implications would a sync here will have on load balancing?

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What implications would a sync here will have on load balancing?

At this step the workers just completed the allreduce operation. Broadcasting a seed should be relatively cheap to do.

We can generate a seed at rank 0 and broadcast it to other ranks? Would that be an acceptable solution?

I think that this solution could work. That said since earlier steps already introduced some randomness, we could theoretically perform this recluster step with a constant seed.

Also it would be great if we could double check that everything works as expected before we merge the PR. Especially that the KMeans algorithm actually starts with similar centroids initialization on all workers whatever the init mode chosen and whether there is a recluster step or not.

@aamijar aamijar added the non-breaking Introduces a non-breaking change label Apr 21, 2026
@aamijar aamijar added the bug Something isn't working label Apr 21, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing Apr 21, 2026
@aamijar
Copy link
Copy Markdown
Member

aamijar commented Apr 21, 2026

/ok to test 5fb247c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants