@@ -95,16 +95,17 @@ export function resolveBucketReportLimit(limit?: number): number {
9595}
9696
9797/**
98- * Estimate the true distinct row count of a bucket from a sample of its operations.
98+ * Estimate the true distinct row count of a bucket from a random sample of its operations.
9999 *
100- * Each operation is included in the sample with probability `r = sampledOps / operations`, so a row with
101- * `k` operations is seen with probability `1 - (1 - r)^k`. Assuming operations are spread roughly evenly
102- * across rows (so each of `R` rows has about `operations / R` of them), the expected number of distinct
103- * rows in the sample is `R * (1 - (1 - r)^(operations / R))`. This is monotonic in `R`, so we binary-search
104- * for the `R` that matches the observed distinct count.
100+ * The signal is repetition: a sample that keeps landing on the same rows means few rows, while a sample
101+ * where every operation lands on a new row means many. Formally, each operation is included in the sample
102+ * with probability `r = sampledOps / operations`, so a row with `k` operations appears with probability
103+ * `1 - (1 - r)^k`. Assuming operations are spread roughly evenly across `R` rows (`k = operations / R`),
104+ * the expected number of distinct rows in the sample is `R * (1 - (1 - r)^(operations / R))`. That grows
105+ * with `R`, so a binary search finds the `R` matching the observed distinct count.
105106 *
106- * The naive `distinctRows / r` over-counts rows (and so under-states fragmentation) whenever the sample
107- * already covered most rows - exactly the highly- fragmented buckets the report exists to surface.
107+ * The naive `distinctRows / r` ignores repetition and over-counts rows (under-stating fragmentation) on
108+ * exactly the highly fragmented buckets the report exists to surface.
108109 *
109110 * Pure (no I/O) so it is unit-testable; storage adapters supply the sampled counts.
110111 */
0 commit comments