Fix RNG seeding bug in multi-GPU KMeans#2029
Fix RNG seeding bug in multi-GPU KMeans#2029vinaydes wants to merge 24 commits intorapidsai:mainfrom
Conversation
There was a problem hiding this comment.
Thanks for looking into this @vinaydes!
Each worker should indeed be seeded differently. And it looks like the PR solves the issue. The reproducer now has roughly similar number of iterations with different random number generators (much more though on my machine).
One important, but solvable issue though, is that the recluster in initKMeansPlusPlus when potentialCentroids.extent(0) > n_clusters should absolutely be deterministic. The final centroids to be used for KMeans would otherwise diverge across ranks.
|
|
||
| #define KMEANS_COMM_ROOT 0 | ||
|
|
||
| static cuvs::cluster::kmeans::params default_params; |
There was a problem hiding this comment.
Looks like default_params is now dead code and could be removed.
| // generate `n_random_clusters` centroids | ||
| cuvs::cluster::kmeans::params rand_params = params; | ||
| rand_params.rng_state = default_params.rng_state; | ||
| rand_params.rng_state.seed = gen_64(); |
There was a problem hiding this comment.
Could be misleading to set this rand_params.rng_state.seed field here as initRandom does not use it anymore, but uses the gen_64 argument instead.
| // Note - reclustering step is duplicated across all ranks and with the same | ||
| // seed they should generate the same potentialCentroids | ||
| auto const_centroids = raft::make_device_matrix_view<const DataT, IndexT>( | ||
| potentialCentroids.data_handle(), potentialCentroids.extent(0), potentialCentroids.extent(1)); | ||
| auto params_copy = params; | ||
| params_copy.rng_state.seed = gen_64(); | ||
| cuvs::cluster::kmeans::init_plus_plus( | ||
| handle, params, const_centroids, centroidsRawData, workspace); | ||
| handle, params_copy, const_centroids, centroidsRawData, workspace); |
There was a problem hiding this comment.
This is an issue. When initKMeansPlusPlus oversamples (potentialCentroids.extent(0) > n_clusters), a mini single-GPU KMeans is run locally on each rank to reduce the candidates down to n_clusters. Since potentialCentroids and weight are already identical across ranks (via prior allgatherv/allreduce), and no communication happens after this reduction, all ranks must use the same RNG seed to produce identical results. Currently, init_plus_plus is given a per-rank-divergent seed from gen_64(), which breaks this invariant.
There was a problem hiding this comment.
We can generate a seed at rank 0 and broadcast it to other ranks? Would that be an acceptable solution? What implications would a sync here will have on load balancing?
There was a problem hiding this comment.
What implications would a sync here will have on load balancing?
At this step the workers just completed the allreduce operation. Broadcasting a seed should be relatively cheap to do.
We can generate a seed at rank 0 and broadcast it to other ranks? Would that be an acceptable solution?
I think that this solution could work. That said since earlier steps already introduced some randomness, we could theoretically perform this recluster step with a constant seed.
Also it would be great if we could double check that everything works as expected before we merge the PR. Especially that the KMeans algorithm actually starts with similar centroids initialization on all workers whatever the init mode chosen and whether there is a recluster step or not.
|
/ok to test 5fb247c |
This fixes the issue #1829.
Root cause
In the multi-GPU (kmeans_mg) KMeans implementation, multiple distinct random sampling operations were all initialized with the same seed (
params.rng_state.seed). Also all the threads were also initialized with the same seed. This caused each operation to generate identical random sequences, leading to biased or degenerate centroid initialization, particularly visible during KMeans++ and random init phases where independent draws are expected. A secondary seeding issue existed in single-GPU kmeansPlusPlus: the CPU and GPU RNGs were both seeded with the rawparams.rng_state.seed, which is incorrect in my opinion.User specified generator type was ignored at one place, which is also fixed by this PR
Fix
std::mt19937_64generator seeded withparams.rng_state.seed + (uint64_t(my_rank) << 32), ensuring per-rank uniqueness while remaining deterministic. This generator is passed through the init call chain (initRandom,initKMeansPlusPlus) and is used to derive independent seeds for each GPU RNG operation (RngState,shuffle_and_gather,fit_main, fallback random init), so no two phases share the same seed.std::mt19937withstd::mt19937_64for consistency, and derive the GPU RNG seed from it rather than reusing the raw seed directly.paramsargument read-only (and thus params.rng_state). May be we should separate rng_state fromparams`?The API changes suggested in the fix are a bit clunky. Better way to do it would be welcome.