You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+52Lines changed: 52 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -134,6 +134,58 @@ CLOSTERA_SIMD=avx512
134
134
CLOSTERA_SIMD=neon
135
135
```
136
136
137
+
## Out of Scope
138
+
139
+
Clostera is a **billion-scale clustering library**, not a general vector-search stack, vector database, or distributed data-processing framework. Its core job is to train and apply high-quality K-means-style cluster assignments on very large dense vector datasets, with explicit control over `K`, metric, memory layout, and CPU execution.
140
+
141
+
The following tools are valuable in their own domains, but they solve different problems or target different operating constraints.
142
+
143
+
### Scikit-Learn `MiniBatchKMeans`
144
+
145
+
Scikit-learn is excellent for general machine-learning workflows, but it is not designed as a billion-vector clustering engine.
146
+
147
+
-**Python orchestration overhead:** at very large `N`, the control path and batching overhead become meaningful relative to the distance math.
148
+
-**Limited low-level specialization:** scikit-learn does not target Clostera-style Rust kernels, out-of-core memmap flows, AVX2/AVX-512 dispatch, or native Apple Silicon NEON kernels.
149
+
-**Different scale target:**`MiniBatchKMeans` is useful for approximate clustering on moderate data, but Clostera is built around single-machine 100M-1B vector workloads.
150
+
151
+
### ScaNN, HNSWlib, Annoy, and Similar ANN Libraries
152
+
153
+
Approximate-nearest-neighbor libraries are often confused with clustering libraries. They are not the same thing.
154
+
155
+
-**Retrieval vs. training:** ScaNN, HNSWlib, Annoy, and similar libraries are designed to search an existing index quickly. Clostera is designed to train centroids and assign points to clusters.
156
+
-**Indexes are not K-means models:** ANN systems may use partitioning internally, but they generally do not expose iterative Lloyd-style centroid optimization as the primary API.
157
+
-**No cluster objective:** these libraries optimize retrieval recall, latency, memory, or graph/index quality, not clustering objectives such as L2 inertia, cosine assignment quality, or label-based clustering metrics.
158
+
159
+
### Vector Databases
160
+
161
+
Milvus, Qdrant, Weaviate, Pinecone, and similar systems are retrieval platforms, not direct substitutes for Clostera.
162
+
163
+
-**Serving layer vs. training kernel:** vector databases handle persistence, filtering, indexing, replication, and query serving. Clostera handles compute-heavy clustering.
164
+
-**Different success metric:** vector databases are usually judged by query latency, recall, ingestion, and operational features. Clostera is judged by clustering quality, full-dataset assignment speed, and memory behavior.
165
+
166
+
### Traditional Distributed Frameworks
167
+
168
+
General distributed frameworks such as Spark MLlib are outside Clostera's target design.
169
+
170
+
At 1B vectors with `D=256` and `float32`, the raw vector matrix is about **1 TB**. Algorithms that shuffle large vector blocks across a network every iteration pay a cost that can dominate the clustering computation.
171
+
172
+
Clostera instead targets **single-machine, high-memory, high-core-count execution**, where data locality, cache behavior, SIMD kernels, and out-of-core local storage can be controlled tightly.
173
+
174
+
### GPU-First Clustering Stacks
175
+
176
+
GPU clustering libraries can be excellent when the full working set and algorithm fit the GPU memory model. Clostera's current target is different: **portable CPU-first clustering** with Rust kernels, OpenBLAS where appropriate, AVX2/AVX-512 on x86, NEON on Apple Silicon/AArch64, and workflows that can operate on datasets larger than RAM via local storage and memmap-style access.
177
+
178
+
### What Clostera Is
179
+
180
+
Clostera is for users who have:
181
+
182
+
- a dense vector dataset,
183
+
- a required metric, currently `l2` or `cos`,
184
+
- a chosen `K`,
185
+
- and a need to compute high-quality clusters quickly on a single machine.
186
+
187
+
It is not an ANN search library, not a vector database, not a Spark replacement, and not a general-purpose ML toolkit.
188
+
137
189
## What Auto Does
138
190
139
191
The current selector is intentionally simple and auditable. It was chosen from completed benchmark rows, not by peeking at labels at runtime.
0 commit comments