Skip to content

Commit 6416fa9

Browse files
committed
Clarify Clostera scope in README
1 parent fd4a19c commit 6416fa9

1 file changed

Lines changed: 52 additions & 0 deletions

File tree

README.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,58 @@ CLOSTERA_SIMD=avx512
134134
CLOSTERA_SIMD=neon
135135
```
136136

137+
## Out of Scope
138+
139+
Clostera is a **billion-scale clustering library**, not a general vector-search stack, vector database, or distributed data-processing framework. Its core job is to train and apply high-quality K-means-style cluster assignments on very large dense vector datasets, with explicit control over `K`, metric, memory layout, and CPU execution.
140+
141+
The following tools are valuable in their own domains, but they solve different problems or target different operating constraints.
142+
143+
### Scikit-Learn `MiniBatchKMeans`
144+
145+
Scikit-learn is excellent for general machine-learning workflows, but it is not designed as a billion-vector clustering engine.
146+
147+
- **Python orchestration overhead:** at very large `N`, the control path and batching overhead become meaningful relative to the distance math.
148+
- **Limited low-level specialization:** scikit-learn does not target Clostera-style Rust kernels, out-of-core memmap flows, AVX2/AVX-512 dispatch, or native Apple Silicon NEON kernels.
149+
- **Different scale target:** `MiniBatchKMeans` is useful for approximate clustering on moderate data, but Clostera is built around single-machine 100M-1B vector workloads.
150+
151+
### ScaNN, HNSWlib, Annoy, and Similar ANN Libraries
152+
153+
Approximate-nearest-neighbor libraries are often confused with clustering libraries. They are not the same thing.
154+
155+
- **Retrieval vs. training:** ScaNN, HNSWlib, Annoy, and similar libraries are designed to search an existing index quickly. Clostera is designed to train centroids and assign points to clusters.
156+
- **Indexes are not K-means models:** ANN systems may use partitioning internally, but they generally do not expose iterative Lloyd-style centroid optimization as the primary API.
157+
- **No cluster objective:** these libraries optimize retrieval recall, latency, memory, or graph/index quality, not clustering objectives such as L2 inertia, cosine assignment quality, or label-based clustering metrics.
158+
159+
### Vector Databases
160+
161+
Milvus, Qdrant, Weaviate, Pinecone, and similar systems are retrieval platforms, not direct substitutes for Clostera.
162+
163+
- **Serving layer vs. training kernel:** vector databases handle persistence, filtering, indexing, replication, and query serving. Clostera handles compute-heavy clustering.
164+
- **Different success metric:** vector databases are usually judged by query latency, recall, ingestion, and operational features. Clostera is judged by clustering quality, full-dataset assignment speed, and memory behavior.
165+
166+
### Traditional Distributed Frameworks
167+
168+
General distributed frameworks such as Spark MLlib are outside Clostera's target design.
169+
170+
At 1B vectors with `D=256` and `float32`, the raw vector matrix is about **1 TB**. Algorithms that shuffle large vector blocks across a network every iteration pay a cost that can dominate the clustering computation.
171+
172+
Clostera instead targets **single-machine, high-memory, high-core-count execution**, where data locality, cache behavior, SIMD kernels, and out-of-core local storage can be controlled tightly.
173+
174+
### GPU-First Clustering Stacks
175+
176+
GPU clustering libraries can be excellent when the full working set and algorithm fit the GPU memory model. Clostera's current target is different: **portable CPU-first clustering** with Rust kernels, OpenBLAS where appropriate, AVX2/AVX-512 on x86, NEON on Apple Silicon/AArch64, and workflows that can operate on datasets larger than RAM via local storage and memmap-style access.
177+
178+
### What Clostera Is
179+
180+
Clostera is for users who have:
181+
182+
- a dense vector dataset,
183+
- a required metric, currently `l2` or `cos`,
184+
- a chosen `K`,
185+
- and a need to compute high-quality clusters quickly on a single machine.
186+
187+
It is not an ANN search library, not a vector database, not a Spark replacement, and not a general-purpose ML toolkit.
188+
137189
## What Auto Does
138190

139191
The current selector is intentionally simple and auditable. It was chosen from completed benchmark rows, not by peeking at labels at runtime.

0 commit comments

Comments
 (0)