Feature: HDBSCAN algorithm#3593
Conversation
e008486 to
f5880a5
Compare
…GPU tests - Split monolithic single_task extract_clusters into 5 GPU kernels: K1 (single_task): Kruskal dendrogram via union-find K2 (parallel_for): Init node arrays from dendrogram K3 (single_task): Condensed tree + EOM stability selection K4 (parallel_for): Build dendro_parent from edges K5 (parallel_for): Label each point independently - Add Method template parameter to DAAL kernel (follows standard pattern) - Add hdbscan_types.h with Method enum (defaultDense) - Add CPU vs GPU comparison tests (permutation-invariant partition check) - Remove unused engines dependency from DAAL BUILD - Add scripts/ to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
reference intelex pr uxlfoundation/scikit-learn-intelex#3124 |
Add push trigger and remove repository == 'uxlfoundation/oneDAL' guards from both Linux and Windows jobs so the workflow runs on every commit in fork branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3e76bac to
1ce1be2
Compare
- Rename defaultDense to bruteForceDense in DAAL Method enum (keep alias) - Simplify hdbscan_cluster_utils.h: extract subfunctions from 500-line method (sortMstEdges, buildKruskalDendrogram, buildCondensedTree, selectClusters, labelPoints) - Move convert_metric to shared compute_kernel_common.hpp header - Replace row_accessor with BlockDescriptor in CPU kernels where data is already DAAL table - Remove _on_host postfix from CPU-only functions, replace std::vector with dal::array - Split GPU build_condensed_tree_eom_kernel into two kernels for clarity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| *******************************************************************************/ | ||
|
|
||
| /* | ||
| * HDBSCAN implementation using a ball tree with Boruvka's MST algorithm. |
There was a problem hiding this comment.
Can you please add a reference to a human-readable version of the algorithm? For example:
| * HDBSCAN implementation using a ball tree with Boruvka's MST algorithm. | |
| * HDBSCAN implementation using a ball tree with Boruvka's Minimum Spanning Tree (MST) algorithm: https://arxiv.org/html/2412.07789v1 |
| * <a name="DAAL-ENUM-ALGORITHMS__HDBSCAN__METRIC"></a> | ||
| * Available distance metrics for the HDBSCAN algorithm | ||
| */ | ||
| enum Metric |
There was a problem hiding this comment.
In KNN the types of the distances are not exposed to legacy DAAL API. enum class PairwiseDistanceType is used to configure the distance types instead:
https://github.com/uxlfoundation/oneDAL/blob/main/cpp/daal/src/algorithms/service_kernel_math.h#L101
Can you please use the similar approach in HDBSCAN?
There was a problem hiding this comment.
Also, there is no much purpose in having a Method enum in DAAL API for HDBSCAN.
Let's move the Method enum into src folder and into daal::algorithms::hdbscan::internal namespace. It will have more sense, as only oneDAL API is available for the algorithm.
And "daal.h", "daal_win.h" should be left untouched.
| NumericTable * ntNClusters, size_t minClusterSize, size_t minSamples, | ||
| int metric, double degree, int clusterSelection, bool allowSingleCluster, | ||
| double clusterSelectionEpsilon, size_t maxClusterSize, double alpha, | ||
| size_t leafSize) |
There was a problem hiding this comment.
It would be better to have the descriptions of the functions' arguments. The numeric table arguments also require the dimensions.
| size_t leafSize) | |
| /** | |
| * Computest HDBSCAN ball-tree based clustering for the specified input data and parameters | |
| * | |
| * \param[in] ntData Input numeric table of size N x P that contains the data to cluster. | |
| * \param[out] ntAssignments Output numeric table of size N x 1 that contains the cluster assignment for each data point. | |
| * The cluster assignment is an integer value where -1 indicates that the data point is considered as noise, | |
| * and non-negative integers indicate the cluster index (0 to C-1, where C is the number of clusters found) | |
| * to which the data point belongs. | |
| * \param[out] ntNClusters Output numeric table of size 1 x 1 that contains the number of clusters found. | |
| * ... | |
| */ | |
| template <typename algorithmFPType, Method method, CpuType cpu> | |
| services::Status HDBSCANBatchKernel<algorithmFPType, method, cpu>::compute(const NumericTable * ntData, NumericTable * ntAssignments, | |
| NumericTable * ntNClusters, size_t minClusterSize, size_t minSamples, | |
| int metric, double degree, int clusterSelection, bool allowSingleCluster, | |
| double clusterSelectionEpsilon, size_t maxClusterSize, double alpha, | |
| size_t leafSize) |
|
|
||
| // Build tree and run core dist + MST, dispatched by metric | ||
| // We need to build the tree with the same distance functor used for queries | ||
| #define DISPATCH_BALL_TREE(DIST_FUNC) \ |
There was a problem hiding this comment.
Is it possible to replace this macro with a template function with a 'DistanceType' template parameter? Macros make code less debugging-friendly =]
| int pivot2 = pivot1; | ||
| for (int i = begin; i < end; i++) | ||
| { | ||
| const algorithmFPType d = distFunc.pointDist(data + pivot1 * nCols, data + pointIndices[i] * nCols, nCols); |
There was a problem hiding this comment.
- The code for
pivot2andpivot3looks like a copy-paste and can be moved into a function. - This part of the code happens multiple times. Wouldn't it be faster to copy the data points
pointIndices[begin], ... , pointIndices[end-1]into a buffer and compute blocks of distances topivot1,pivot2, etc. using that block of points?
for (int i = begin; i < end; i++)
{
const algorithmFPType d = distFunc.pointDist(data + some_index * nCols, data + pointIndices[i] * nCols, nCols);
...
}
There was a problem hiding this comment.
In case of bf16 some conversion time can be saved also with this approach.
| node.left = buildBallTree(data, pointIndices, begin, mid, nCols, nodes, nextNode, maxLeafSize, distFunc); | ||
| node.right = buildBallTree(data, pointIndices, mid, end, nCols, nodes, nextNode, maxLeafSize, distFunc); |
There was a problem hiding this comment.
Might worth to parallelize it to some debt, i.e. run this two calls in parallel tasks. nextNode needs some synchronization in this case (atomic or something).
| FPType * dists; | ||
| int * indices; |
There was a problem hiding this comment.
| FPType * dists; | |
| int * indices; | |
| FPType * dists; // distances from the points in the heap to the queryPoint (should the queryPoint itself or its index be saved in the heap to make it more self-contained?) | |
| int * indices; // indices of the points in the heap |
| int capacity; | ||
| int size; | ||
|
|
||
| void init(FPType * d, int * idx, int cap) |
There was a problem hiding this comment.
It would be better to make it a RAII data structure and replace this init(...) method with a normal constructor; and replace dist and indices pointers with TArray/TArrayScalable or similar types. In this case you won't need to define daal::TlsMem storages in the code that uses this KnnHeap structure and will reduce the threads synchronization costs.
|
|
||
| FPType maxDist() const { return (size > 0) ? dists[0] : daal::services::internal::MaxVal<FPType>::get(); } | ||
|
|
||
| void push(FPType dist, int idx) |
There was a problem hiding this comment.
Some comments regarding the order that is maintained among the points in the KNN heap is needed here.
f0ced09 to
5c42f62
Compare
Description
Checklist:
Completeness and readability
Testing
Performance