Clarify Clostera scope in README

ponythewhite · ponythewhite · commit 6416fa9802ed · 2026-05-04T16:38:44.000+02:00
diff --git a/README.md b/README.md
@@ -134,6 +134,58 @@ CLOSTERA_SIMD=avx512
 CLOSTERA_SIMD=neon
 ```
 
+## Out of Scope
+
+Clostera is a **billion-scale clustering library**, not a general vector-search stack, vector database, or distributed data-processing framework. Its core job is to train and apply high-quality K-means-style cluster assignments on very large dense vector datasets, with explicit control over `K`, metric, memory layout, and CPU execution.
+
+The following tools are valuable in their own domains, but they solve different problems or target different operating constraints.
+
+### Scikit-Learn `MiniBatchKMeans`
+
+Scikit-learn is excellent for general machine-learning workflows, but it is not designed as a billion-vector clustering engine.
+
+- **Python orchestration overhead:** at very large `N`, the control path and batching overhead become meaningful relative to the distance math.
+- **Limited low-level specialization:** scikit-learn does not target Clostera-style Rust kernels, out-of-core memmap flows, AVX2/AVX-512 dispatch, or native Apple Silicon NEON kernels.
+- **Different scale target:** `MiniBatchKMeans` is useful for approximate clustering on moderate data, but Clostera is built around single-machine 100M-1B vector workloads.
+
+### ScaNN, HNSWlib, Annoy, and Similar ANN Libraries
+
+Approximate-nearest-neighbor libraries are often confused with clustering libraries. They are not the same thing.
+
+- **Retrieval vs. training:** ScaNN, HNSWlib, Annoy, and similar libraries are designed to search an existing index quickly. Clostera is designed to train centroids and assign points to clusters.
+- **Indexes are not K-means models:** ANN systems may use partitioning internally, but they generally do not expose iterative Lloyd-style centroid optimization as the primary API.
+- **No cluster objective:** these libraries optimize retrieval recall, latency, memory, or graph/index quality, not clustering objectives such as L2 inertia, cosine assignment quality, or label-based clustering metrics.
+
+### Vector Databases
+
+Milvus, Qdrant, Weaviate, Pinecone, and similar systems are retrieval platforms, not direct substitutes for Clostera.
+
+- **Serving layer vs. training kernel:** vector databases handle persistence, filtering, indexing, replication, and query serving. Clostera handles compute-heavy clustering.
+- **Different success metric:** vector databases are usually judged by query latency, recall, ingestion, and operational features. Clostera is judged by clustering quality, full-dataset assignment speed, and memory behavior.
+
+### Traditional Distributed Frameworks
+
+General distributed frameworks such as Spark MLlib are outside Clostera's target design.
+
+At 1B vectors with `D=256` and `float32`, the raw vector matrix is about **1 TB**. Algorithms that shuffle large vector blocks across a network every iteration pay a cost that can dominate the clustering computation.
+
+Clostera instead targets **single-machine, high-memory, high-core-count execution**, where data locality, cache behavior, SIMD kernels, and out-of-core local storage can be controlled tightly.
+
+### GPU-First Clustering Stacks
+
+GPU clustering libraries can be excellent when the full working set and algorithm fit the GPU memory model. Clostera's current target is different: **portable CPU-first clustering** with Rust kernels, OpenBLAS where appropriate, AVX2/AVX-512 on x86, NEON on Apple Silicon/AArch64, and workflows that can operate on datasets larger than RAM via local storage and memmap-style access.
+
+### What Clostera Is
+
+Clostera is for users who have:
+
+- a dense vector dataset,
+- a required metric, currently `l2` or `cos`,
+- a chosen `K`,
+- and a need to compute high-quality clusters quickly on a single machine.
+
+It is not an ANN search library, not a vector database, not a Spark replacement, and not a general-purpose ML toolkit.
+
 ## What Auto Does
 
 The current selector is intentionally simple and auditable. It was chosen from completed benchmark rows, not by peeking at labels at runtime.