Feature: HDBSCAN algorithm by Alexandr-Solovev · Pull Request #3593 · uxlfoundation/oneDAL

Alexandr-Solovev · 2026-04-08T08:08:10Z

Description

Checklist:

Completeness and readability

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

…GPU tests - Split monolithic single_task extract_clusters into 5 GPU kernels: K1 (single_task): Kruskal dendrogram via union-find K2 (parallel_for): Init node arrays from dendrogram K3 (single_task): Condensed tree + EOM stability selection K4 (parallel_for): Build dendro_parent from edges K5 (parallel_for): Label each point independently - Add Method template parameter to DAAL kernel (follows standard pattern) - Add hdbscan_types.h with Method enum (defaultDense) - Add CPU vs GPU comparison tests (permutation-invariant partition check) - Remove unused engines dependency from DAAL BUILD - Add scripts/ to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Alexandr-Solovev · 2026-04-24T06:23:31Z

reference intelex pr uxlfoundation/scikit-learn-intelex#3124

Add push trigger and remove repository == 'uxlfoundation/oneDAL' guards from both Linux and Windows jobs so the workflow runs on every commit in fork branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Rename defaultDense to bruteForceDense in DAAL Method enum (keep alias) - Simplify hdbscan_cluster_utils.h: extract subfunctions from 500-line method (sortMstEdges, buildKruskalDendrogram, buildCondensedTree, selectClusters, labelPoints) - Move convert_metric to shared compute_kernel_common.hpp header - Replace row_accessor with BlockDescriptor in CPU kernels where data is already DAAL table - Remove _on_host postfix from CPU-only functions, replace std::vector with dal::array - Split GPU build_condensed_tree_eom_kernel into two kernels for clarity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Vika-F · 2026-05-12T09:03:12Z

+*******************************************************************************/
+
+/*
+ * HDBSCAN implementation using a ball tree with Boruvka's MST algorithm.


Can you please add a reference to a human-readable version of the algorithm? For example:

Suggested change

* HDBSCAN implementation using a ball tree with Boruvka's MST algorithm.

* HDBSCAN implementation using a ball tree with Boruvka's Minimum Spanning Tree (MST) algorithm: https://arxiv.org/html/2412.07789v1

Vika-F · 2026-05-12T09:10:17Z

+ * <a name="DAAL-ENUM-ALGORITHMS__HDBSCAN__METRIC"></a>
+ * Available distance metrics for the HDBSCAN algorithm
+ */
+enum Metric


In KNN the types of the distances are not exposed to legacy DAAL API. enum class PairwiseDistanceType is used to configure the distance types instead:
https://github.com/uxlfoundation/oneDAL/blob/main/cpp/daal/src/algorithms/service_kernel_math.h#L101

Can you please use the similar approach in HDBSCAN?

Also, there is no much purpose in having a Method enum in DAAL API for HDBSCAN.

Let's move the Method enum into src folder and into daal::algorithms::hdbscan::internal namespace. It will have more sense, as only oneDAL API is available for the algorithm.

And "daal.h", "daal_win.h" should be left untouched.

Vika-F · 2026-05-12T09:23:37Z

+                                                                           NumericTable * ntNClusters, size_t minClusterSize, size_t minSamples,
+                                                                           int metric, double degree, int clusterSelection, bool allowSingleCluster,
+                                                                           double clusterSelectionEpsilon, size_t maxClusterSize, double alpha,
+                                                                           size_t leafSize)


It would be better to have the descriptions of the functions' arguments. The numeric table arguments also require the dimensions.

Suggested change

size_t leafSize)

/**

* Computest HDBSCAN ball-tree based clustering for the specified input data and parameters

*

* \param[in] ntData Input numeric table of size N x P that contains the data to cluster.

* \param[out] ntAssignments Output numeric table of size N x 1 that contains the cluster assignment for each data point.

* The cluster assignment is an integer value where -1 indicates that the data point is considered as noise,

* and non-negative integers indicate the cluster index (0 to C-1, where C is the number of clusters found)

* to which the data point belongs.

* \param[out] ntNClusters Output numeric table of size 1 x 1 that contains the number of clusters found.

* ...

*/

template <typename algorithmFPType, Method method, CpuType cpu>

services::Status HDBSCANBatchKernel<algorithmFPType, method, cpu>::compute(const NumericTable * ntData, NumericTable * ntAssignments,

NumericTable * ntNClusters, size_t minClusterSize, size_t minSamples,

int metric, double degree, int clusterSelection, bool allowSingleCluster,

double clusterSelectionEpsilon, size_t maxClusterSize, double alpha,

size_t leafSize)

Vika-F · 2026-05-12T10:12:22Z

+
+// Build tree and run core dist + MST, dispatched by metric
+// We need to build the tree with the same distance functor used for queries
+#define DISPATCH_BALL_TREE(DIST_FUNC)                                                                                                           \


Is it possible to replace this macro with a template function with a 'DistanceType' template parameter? Macros make code less debugging-friendly =]

Vika-F · 2026-05-12T10:46:55Z

+    int pivot2               = pivot1;
+    for (int i = begin; i < end; i++)
+    {
+        const algorithmFPType d = distFunc.pointDist(data + pivot1 * nCols, data + pointIndices[i] * nCols, nCols);


The code for pivot2 and pivot3 looks like a copy-paste and can be moved into a function.

This part of the code happens multiple times. Wouldn't it be faster to copy the data points pointIndices[begin], ... , pointIndices[end-1] into a buffer and compute blocks of distances to pivot1, pivot2, etc. using that block of points?

for (int i = begin; i < end; i++) { const algorithmFPType d = distFunc.pointDist(data + some_index * nCols, data + pointIndices[i] * nCols, nCols); ... }

In case of bf16 some conversion time can be saved also with this approach.

Vika-F · 2026-05-12T10:58:52Z

+    node.left  = buildBallTree(data, pointIndices, begin, mid, nCols, nodes, nextNode, maxLeafSize, distFunc);
+    node.right = buildBallTree(data, pointIndices, mid, end, nCols, nodes, nextNode, maxLeafSize, distFunc);


Might worth to parallelize it to some debt, i.e. run this two calls in parallel tasks. nextNode needs some synchronization in this case (atomic or something).

Vika-F · 2026-05-12T12:24:31Z

+    FPType * dists;
+    int * indices;


Suggested change

FPType * dists;

int * indices;

FPType * dists; // distances from the points in the heap to the queryPoint (should the queryPoint itself or its index be saved in the heap to make it more self-contained?)

int * indices; // indices of the points in the heap

Vika-F · 2026-05-12T12:31:09Z

+    int capacity;
+    int size;
+
+    void init(FPType * d, int * idx, int cap)


It would be better to make it a RAII data structure and replace this init(...) method with a normal constructor; and replace dist and indices pointers with TArray/TArrayScalable or similar types. In this case you won't need to define daal::TlsMem storages in the code that uses this KnnHeap structure and will reduce the threads synchronization costs.

Vika-F · 2026-05-12T12:33:48Z

+
+    FPType maxDist() const { return (size > 0) ? dists[0] : daal::services::internal::MaxVal<FPType>::get(); }
+
+    void push(FPType dist, int idx)


Some comments regarding the order that is maintained among the points in the KNN heap is needed here.

fixes

977b59e

Alexandr-Solovev added dpc++ Issue/PR related to DPC++ functionality new algorithm New algorithm or method in oneDAL labels Apr 8, 2026

Alexandr-Solovev added 4 commits April 13, 2026 10:31

Merge branch 'main' into dev/asolovev_hdbscan_ai

4417535

gpu migration claude

2c9be5b

upd for hdbscan

e67b8e6

fixes

f5880a5

Alexandr-Solovev force-pushed the dev/asolovev_hdbscan_ai branch from e008486 to f5880a5 Compare April 20, 2026 17:29

Merge branch 'main' into dev/asolovev_hdbscan_ai

8545aab

Alexandr-Solovev changed the title ~~SYCL: hdbscan implementation~~ Feature: HDBSCAN algorithm Apr 21, 2026

Alexandr-Solovev and others added 11 commits April 21, 2026 02:05

fixes

3f2a369

fixes

5abcf31

fixes

eb69333

Merge branch 'main' into dev/asolovev_hdbscan_ai

963f943

polishing hdbscan

44f01ce

fixes

4a8b1a3

fixes

53f96b8

fixes

253bbd4

fixes

06dc552

fixes

2673451

Alexandr-Solovev and others added 6 commits April 23, 2026 23:28

Enable nightly build on every push for fork testing

b4423ed

Add push trigger and remove repository == 'uxlfoundation/oneDAL' guards from both Linux and Windows jobs so the workflow runs on every commit in fork branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fixes

a1c6c57

fixes for toolkit

9c1109b

fixes

88a6ea2

fixes

7a81e29

fixes

1ce1be2

Alexandr-Solovev force-pushed the dev/asolovev_hdbscan_ai branch from 3e76bac to 1ce1be2 Compare April 28, 2026 08:17

Merge branch 'main' into dev/asolovev_hdbscan_ai

80203aa

Alexandr-Solovev and others added 17 commits May 4, 2026 10:25

Merge branch 'main' into dev/asolovev_hdbscan_ai

7ae245c

ball tree method

fe49425

fixes

834f225

fixes

897a892

remove some python related scripts

d41ecee

Merge branch 'main' into dev/asolovev_hdbscan_ai

56413d8

fixes for examples

6181c39

fixes

45d58f5

fixes

343dcae

fixes

031747a

fixes for comments

c8a28a4

fixes

67b0b58

Merge branch 'main' into dev/asolovev_hdbscan_ai

e35e94b

fix

1c71416

Merge branch 'main' into dev/asolovev_hdbscan_ai

00c4366

fixes for unnecessary code

59b0a25

Vika-F reviewed May 12, 2026

View reviewed changes

Alexandr-Solovev added 2 commits May 12, 2026 06:48

fixes

c38955c

refactoring

5c42f62

Alexandr-Solovev force-pushed the dev/asolovev_hdbscan_ai branch from f0ced09 to 5c42f62 Compare May 13, 2026 07:55

Merge branch 'main' into dev/asolovev_hdbscan_ai

631b124

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: HDBSCAN algorithm#3593

Feature: HDBSCAN algorithm#3593
Alexandr-Solovev wants to merge 44 commits into
uxlfoundation:mainfrom
Alexandr-Solovev:dev/asolovev_hdbscan_ai

Alexandr-Solovev commented Apr 8, 2026 •

edited

Loading

Uh oh!

Alexandr-Solovev commented Apr 24, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Vika-F May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	* HDBSCAN implementation using a ball tree with Boruvka's MST algorithm.
	* HDBSCAN implementation using a ball tree with Boruvka's Minimum Spanning Tree (MST) algorithm: https://arxiv.org/html/2412.07789v1

-                                                                           size_t leafSize)
+/**
+ * Computest HDBSCAN ball-tree based clustering for the specified input data and parameters
+ *
+ * \param[in]  ntData           Input numeric table of size N x P that contains the data to cluster.
+ * \param[out] ntAssignments    Output numeric table of size N x 1 that contains the cluster assignment for each data point.
+ *                              The cluster assignment is an integer value where -1 indicates that the data point is considered as noise,
+ *                              and non-negative integers indicate the cluster index (0 to C-1, where C is the number of clusters found)
+ *                              to which the data point belongs.
+ * \param[out] ntNClusters      Output numeric table of size 1 x 1 that contains the number of clusters found.
+ * ...
+ */
+template <typename algorithmFPType, Method method, CpuType cpu>
+services::Status HDBSCANBatchKernel<algorithmFPType, method, cpu>::compute(const NumericTable * ntData, NumericTable * ntAssignments,
+                                                                           NumericTable * ntNClusters, size_t minClusterSize, size_t minSamples,
+                                                                           int metric, double degree, int clusterSelection, bool allowSingleCluster,
+                                                                           double clusterSelectionEpsilon, size_t maxClusterSize, double alpha,
+                                                                           size_t leafSize)

		node.left = buildBallTree(data, pointIndices, begin, mid, nCols, nodes, nextNode, maxLeafSize, distFunc);
		node.right = buildBallTree(data, pointIndices, mid, end, nCols, nodes, nextNode, maxLeafSize, distFunc);


		FPType maxDist() const { return (size > 0) ? dists[0] : daal::services::internal::MaxVal<FPType>::get(); }

		void push(FPType dist, int idx)

Conversation

Alexandr-Solovev commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Alexandr-Solovev commented Apr 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alexandr-Solovev commented Apr 8, 2026 •

edited

Loading