You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,7 +42,7 @@ clusterer = clostera.Clusterer(k=256, fastest=True) # K = number of clusters
42
42
labels = clusterer.fit_transform(vectors)
43
43
```
44
44
45
-
`fastest=True` turns off OPQ and uses the plain PQ path. That is the right choice when end-to-end throughput matters more than reconstruction quality. The main speed win is in encoder training and encoding; the final PQk-means assignment stage itself is already fast in both modes.
45
+
`fastest=True` turns off OPQ and uses the plain PQ path. That is the right choice when end-to-end throughput matters more than reconstruction quality. The main speed win is in encoder training and encoding; the final compressed assignment stage itself is already fast in both modes.
46
46
47
47
### Out-of-core from parquet
48
48
@@ -59,7 +59,7 @@ The original repository proved a powerful idea: by clustering in PQ code space i
59
59
60
60
`clostera` asks the obvious follow-up question:
61
61
62
-
> what happens if you rebuild PQk-means properly for modern hardware and modern Python workflows?
62
+
> what happens if you rebuild the original `pqkmeans` project properly for modern hardware and modern Python workflows?
63
63
64
64
On the committed deterministic `10M x 2048` checkpoint, the answer is not subtle.
65
65
@@ -524,7 +524,7 @@ The classes below expose the encoder/clusterer split directly. Reach for them wh
524
524
| --- | --- | --- | --- |
525
525
|`encoder`|`PQEncoder`|`required`| Trained encoder that defines the codebooks. |
526
526
|`k`|`int \| None`|`None`| Number of target clusters. Here `K` means the number of clusters. `None` enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space. |
527
-
|`iterations`|`int`|`20`| Number of PQk-means update rounds. |
527
+
|`iterations`|`int`|`20`| Number of clustering update rounds. |
528
528
|`seed`|`int`|`0`| Deterministic seed for cluster-center initialization. |
529
529
|`verbose`|`bool`|`False`| Emit inertia diagnostics during fitting. |
@@ -545,10 +545,10 @@ The classes below expose the encoder/clusterer split directly. Reach for them wh
545
545
|`num_subquantizers`|`int \| None`|`None`| Optional encoder-side PQ subspace count when `encoder` is omitted. |
546
546
|`codebook_size`|`int`|`256`| Optional encoder-side codebook size when `encoder` is omitted. |
547
547
|`encoder_iterations`|`int`|`20`| Encoder training iterations used when `encoder` is omitted. |
548
-
|`seed`|`int`|`0`| Deterministic seed shared by the implicit encoder and the PQk-means clusterer. |
548
+
|`seed`|`int`|`0`| Deterministic seed shared by the implicit encoder and the clusterer. |
549
549
|`opq_iterations`|`int`|`3`| OPQ refinement steps used by the implicit encoder. |
550
550
|`k`|`int \| None`|`None`| Number of target clusters. Here `K` means the number of clusters. `None` enables Rust-side automatic number-of-clusters selection over candidate values in PQ code space. |
551
-
|`iterations`|`int`|`20`| Number of PQk-means update rounds. |
551
+
|`iterations`|`int`|`20`| Number of clustering update rounds. |
552
552
|`verbose`|`bool`|`False`| Emit inertia diagnostics during fitting. |
Copy file name to clipboardExpand all lines: scripts/generate_demo_notebook.py
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -57,7 +57,7 @@ def build_notebook() -> dict:
57
57
markdown_cell(
58
58
"""# clostera Tutorial
59
59
60
-
This notebook is a **hands-on tutorial** for using `clostera`, the Rust implementation of PQk-means. It focuses on the public API and the workflows you are most likely to use in practice:
60
+
This notebook is a **hands-on tutorial** for using `clostera`, the Rust rewrite of the original `pqkmeans` project. It focuses on the public API and the workflows you are most likely to use in practice:
61
61
62
62
1. Use the high-level `Clusterer` API
63
63
2. Cluster with a known number of clusters (`K`)
@@ -185,7 +185,7 @@ def build_notebook() -> dict:
185
185
markdown_cell(
186
186
"""## 4. Need maximum throughput? Use `fastest=True`
187
187
188
-
`fastest=True` turns off OPQ and uses the plain PQ path. That usually gives the best end-to-end throughput, at the cost of somewhat worse reconstruction quality. The main speed win is in encoder training and encoding, not in the final PQk-means assignment loop itself.
188
+
`fastest=True` turns off OPQ and uses the plain PQ path. That usually gives the best end-to-end throughput, at the cost of somewhat worse reconstruction quality. The main speed win is in encoder training and encoding, not in the final compressed assignment loop itself.
0 commit comments