Skip to content

Feature: HDBSCAN algorithm#3593

Draft
Alexandr-Solovev wants to merge 44 commits into
uxlfoundation:mainfrom
Alexandr-Solovev:dev/asolovev_hdbscan_ai
Draft

Feature: HDBSCAN algorithm#3593
Alexandr-Solovev wants to merge 44 commits into
uxlfoundation:mainfrom
Alexandr-Solovev:dev/asolovev_hdbscan_ai

Conversation

@Alexandr-Solovev
Copy link
Copy Markdown
Contributor

@Alexandr-Solovev Alexandr-Solovev commented Apr 8, 2026

Description

image
Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
  • I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
  • I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@Alexandr-Solovev Alexandr-Solovev added dpc++ Issue/PR related to DPC++ functionality new algorithm New algorithm or method in oneDAL labels Apr 8, 2026
@Alexandr-Solovev Alexandr-Solovev force-pushed the dev/asolovev_hdbscan_ai branch from e008486 to f5880a5 Compare April 20, 2026 17:29
@Alexandr-Solovev Alexandr-Solovev changed the title SYCL: hdbscan implementation Feature: HDBSCAN algorithm Apr 21, 2026
Alexandr-Solovev and others added 11 commits April 21, 2026 02:05
…GPU tests

- Split monolithic single_task extract_clusters into 5 GPU kernels:
  K1 (single_task): Kruskal dendrogram via union-find
  K2 (parallel_for): Init node arrays from dendrogram
  K3 (single_task): Condensed tree + EOM stability selection
  K4 (parallel_for): Build dendro_parent from edges
  K5 (parallel_for): Label each point independently
- Add Method template parameter to DAAL kernel (follows standard pattern)
- Add hdbscan_types.h with Method enum (defaultDense)
- Add CPU vs GPU comparison tests (permutation-invariant partition check)
- Remove unused engines dependency from DAAL BUILD
- Add scripts/ to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Alexandr-Solovev
Copy link
Copy Markdown
Contributor Author

reference intelex pr uxlfoundation/scikit-learn-intelex#3124

Alexandr-Solovev and others added 6 commits April 23, 2026 23:28
Add push trigger and remove repository == 'uxlfoundation/oneDAL'
guards from both Linux and Windows jobs so the workflow runs on
every commit in fork branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Alexandr-Solovev Alexandr-Solovev force-pushed the dev/asolovev_hdbscan_ai branch from 3e76bac to 1ce1be2 Compare April 28, 2026 08:17
Alexandr-Solovev and others added 17 commits May 4, 2026 10:25
- Rename defaultDense to bruteForceDense in DAAL Method enum (keep alias)
- Simplify hdbscan_cluster_utils.h: extract subfunctions from 500-line method
  (sortMstEdges, buildKruskalDendrogram, buildCondensedTree, selectClusters, labelPoints)
- Move convert_metric to shared compute_kernel_common.hpp header
- Replace row_accessor with BlockDescriptor in CPU kernels where data is already DAAL table
- Remove _on_host postfix from CPU-only functions, replace std::vector with dal::array
- Split GPU build_condensed_tree_eom_kernel into two kernels for clarity

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
*******************************************************************************/

/*
* HDBSCAN implementation using a ball tree with Boruvka's MST algorithm.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a reference to a human-readable version of the algorithm? For example:

Suggested change
* HDBSCAN implementation using a ball tree with Boruvka's MST algorithm.
* HDBSCAN implementation using a ball tree with Boruvka's Minimum Spanning Tree (MST) algorithm: https://arxiv.org/html/2412.07789v1

* <a name="DAAL-ENUM-ALGORITHMS__HDBSCAN__METRIC"></a>
* Available distance metrics for the HDBSCAN algorithm
*/
enum Metric
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In KNN the types of the distances are not exposed to legacy DAAL API. enum class PairwiseDistanceType is used to configure the distance types instead:
https://github.com/uxlfoundation/oneDAL/blob/main/cpp/daal/src/algorithms/service_kernel_math.h#L101

Can you please use the similar approach in HDBSCAN?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there is no much purpose in having a Method enum in DAAL API for HDBSCAN.

Let's move the Method enum into src folder and into daal::algorithms::hdbscan::internal namespace. It will have more sense, as only oneDAL API is available for the algorithm.

And "daal.h", "daal_win.h" should be left untouched.

NumericTable * ntNClusters, size_t minClusterSize, size_t minSamples,
int metric, double degree, int clusterSelection, bool allowSingleCluster,
double clusterSelectionEpsilon, size_t maxClusterSize, double alpha,
size_t leafSize)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have the descriptions of the functions' arguments. The numeric table arguments also require the dimensions.

Suggested change
size_t leafSize)
/**
* Computest HDBSCAN ball-tree based clustering for the specified input data and parameters
*
* \param[in] ntData Input numeric table of size N x P that contains the data to cluster.
* \param[out] ntAssignments Output numeric table of size N x 1 that contains the cluster assignment for each data point.
* The cluster assignment is an integer value where -1 indicates that the data point is considered as noise,
* and non-negative integers indicate the cluster index (0 to C-1, where C is the number of clusters found)
* to which the data point belongs.
* \param[out] ntNClusters Output numeric table of size 1 x 1 that contains the number of clusters found.
* ...
*/
template <typename algorithmFPType, Method method, CpuType cpu>
services::Status HDBSCANBatchKernel<algorithmFPType, method, cpu>::compute(const NumericTable * ntData, NumericTable * ntAssignments,
NumericTable * ntNClusters, size_t minClusterSize, size_t minSamples,
int metric, double degree, int clusterSelection, bool allowSingleCluster,
double clusterSelectionEpsilon, size_t maxClusterSize, double alpha,
size_t leafSize)


// Build tree and run core dist + MST, dispatched by metric
// We need to build the tree with the same distance functor used for queries
#define DISPATCH_BALL_TREE(DIST_FUNC) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to replace this macro with a template function with a 'DistanceType' template parameter? Macros make code less debugging-friendly =]

int pivot2 = pivot1;
for (int i = begin; i < end; i++)
{
const algorithmFPType d = distFunc.pointDist(data + pivot1 * nCols, data + pointIndices[i] * nCols, nCols);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The code for pivot2 and pivot3 looks like a copy-paste and can be moved into a function.
  2. This part of the code happens multiple times. Wouldn't it be faster to copy the data points pointIndices[begin], ... , pointIndices[end-1] into a buffer and compute blocks of distances to pivot1, pivot2, etc. using that block of points?
    for (int i = begin; i < end; i++)
    {
        const algorithmFPType d = distFunc.pointDist(data + some_index * nCols, data + pointIndices[i] * nCols, nCols);
    ...
    }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of bf16 some conversion time can be saved also with this approach.

Comment on lines +163 to +164
node.left = buildBallTree(data, pointIndices, begin, mid, nCols, nodes, nextNode, maxLeafSize, distFunc);
node.right = buildBallTree(data, pointIndices, mid, end, nCols, nodes, nextNode, maxLeafSize, distFunc);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might worth to parallelize it to some debt, i.e. run this two calls in parallel tasks. nextNode needs some synchronization in this case (atomic or something).

Comment on lines +197 to +198
FPType * dists;
int * indices;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FPType * dists;
int * indices;
FPType * dists; // distances from the points in the heap to the queryPoint (should the queryPoint itself or its index be saved in the heap to make it more self-contained?)
int * indices; // indices of the points in the heap

int capacity;
int size;

void init(FPType * d, int * idx, int cap)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to make it a RAII data structure and replace this init(...) method with a normal constructor; and replace dist and indices pointers with TArray/TArrayScalable or similar types. In this case you won't need to define daal::TlsMem storages in the code that uses this KnnHeap structure and will reduce the threads synchronization costs.


FPType maxDist() const { return (size > 0) ? dists[0] : daal::services::internal::MaxVal<FPType>::get(); }

void push(FPType dist, int idx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments regarding the order that is maintained among the points in the KNN heap is needed here.

@Alexandr-Solovev Alexandr-Solovev force-pushed the dev/asolovev_hdbscan_ai branch from f0ced09 to 5c42f62 Compare May 13, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dpc++ Issue/PR related to DPC++ functionality new algorithm New algorithm or method in oneDAL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants