Incremental clustering system based on embedding similarity and approximate nearest neighbor (HNSW) search. Supports dynamic cluster creation, assignment, updating, and optional merging.
Stores statistical information about a cluster.
prototypeSize: Int— number of items in clustermeanSimilarity: Float— running mean similarity to prototypestdSimilarity: Float— similarity standard deviationlabel: String?— optional label
Represents a cluster.
prototypeId: Long— unique cluster identifierembedding: FloatArray— centroid/prototype vectormetadata: ClusterMetadata— cluster statistics
Output of clustering.
clusters: Map<ClusterId, Cluster>— final clustersassignments: Assignments— item → cluster mappingmerges: ClusterMerges?— optional merge relationships
ItemId = LongClusterId = LongAssignments = MutableMap<ItemId, ClusterId>MergeId = LongTargetClusters = List<ClusterId>ClusterMerges = MutableMap<MergeId, TargetClusters>
Online clustering using cosine similarity, adaptive thresholds, HNSW approximate nearest neighbor search, and optional voting-based fallback.
IncrementalClusterer(
existingClusters: Map<ClusterId, Cluster>? = null,
defaultThreshold: Float = 0.3f,
minClusterSize: Int = 2,
topK: Int = 5
)existingClusters— optional initial clustersdefaultThreshold— baseline similarity thresholdminClusterSize— threshold for adaptive behaviortopK— neighbors used for voting fallback
Processes embeddings incrementally and returns clustering results.
Behavior:
- builds ANN index (HNSW)
- assigns each item to best matching cluster
- applies adaptive thresholding
- falls back to voting when needed
- creates new clusters if required
- updates cluster prototypes incrementally
- optionally performs cluster merging
Resets all clusters, assignments, and ANN state.
-
Direct assignment Applied when similarity exceeds adaptive threshold
-
Voting fallback Uses top-K nearest neighbors and majority cluster vote
-
New cluster creation Used when no existing cluster matches criteria
Adaptive threshold is derived from:
- cluster size
- mean similarity
- similarity variance
- global cohesion statistics
It blends a baseline threshold with statistical adjustments to stabilize clustering across varying densities.
Each assignment updates:
- centroid embedding (incremental mean)
- mean similarity
- similarity variance
Optional post-processing step that merges clusters based on global similarity cohesion criteria.