Skip to content

Commit 674bd85

Browse files
committed
Clean up comments for readability
1 parent 62016e9 commit 674bd85

2 files changed

Lines changed: 15 additions & 16 deletions

File tree

modules/module-mongodb-storage/src/storage/implementation/MongoSyncBucketStorage.ts

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -634,14 +634,12 @@ export abstract class MongoSyncBucketStorage
634634
/**
635635
* How many operations to sample when estimating a bucket's row count.
636636
*
637-
* {@link storage.estimateDistinctRows} recovers the true row count from how often the sample lands on the
638-
* same row twice ("collisions"). A bucket with `R` rows produces collisions only once the sample size
639-
* approaches `sqrt(R)`, and needs roughly `sqrt(100 * R)` before they carry a usable signal. `R` is unknown
640-
* up front but is bounded by the operation count, so sampling `sqrt(200 * operations)` operations yields on
641-
* the order of 100 expected collisions even in the worst case of one row per operation - enough to keep the
642-
* estimate stable rather than swinging with sampling noise. Clamped to [MIN, MAX] to bound per-bucket cost;
643-
* above the MAX-implied width the estimate degrades gracefully (only for buckets both very wide and barely
644-
* fragmented, which are not the fragmented offenders the report exists to surface).
637+
* {@link storage.estimateDistinctRows} infers the row count from how often the sample lands on the same
638+
* row twice, so the sample must be large enough to contain such repeats. Sampling `sqrt(200 * operations)`
639+
* operations yields on the order of 100 expected repeats even in the worst case of one row per operation,
640+
* which keeps the estimate stable instead of swinging with sampling noise. The clamp bounds per-bucket
641+
* cost; past the cap only very wide, barely fragmented buckets lose accuracy, and those are not the
642+
* offenders the report exists to surface.
645643
*/
646644
protected bucketRowSampleTarget(operations: number): number {
647645
const target = Math.ceil(Math.sqrt(200 * operations));

packages/service-core/src/storage/bucket-report.ts

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -95,16 +95,17 @@ export function resolveBucketReportLimit(limit?: number): number {
9595
}
9696

9797
/**
98-
* Estimate the true distinct row count of a bucket from a sample of its operations.
98+
* Estimate the true distinct row count of a bucket from a random sample of its operations.
9999
*
100-
* Each operation is included in the sample with probability `r = sampledOps / operations`, so a row with
101-
* `k` operations is seen with probability `1 - (1 - r)^k`. Assuming operations are spread roughly evenly
102-
* across rows (so each of `R` rows has about `operations / R` of them), the expected number of distinct
103-
* rows in the sample is `R * (1 - (1 - r)^(operations / R))`. This is monotonic in `R`, so we binary-search
104-
* for the `R` that matches the observed distinct count.
100+
* The signal is repetition: a sample that keeps landing on the same rows means few rows, while a sample
101+
* where every operation lands on a new row means many. Formally, each operation is included in the sample
102+
* with probability `r = sampledOps / operations`, so a row with `k` operations appears with probability
103+
* `1 - (1 - r)^k`. Assuming operations are spread roughly evenly across `R` rows (`k = operations / R`),
104+
* the expected number of distinct rows in the sample is `R * (1 - (1 - r)^(operations / R))`. That grows
105+
* with `R`, so a binary search finds the `R` matching the observed distinct count.
105106
*
106-
* The naive `distinctRows / r` over-counts rows (and so under-states fragmentation) whenever the sample
107-
* already covered most rows - exactly the highly-fragmented buckets the report exists to surface.
107+
* The naive `distinctRows / r` ignores repetition and over-counts rows (under-stating fragmentation) on
108+
* exactly the highly fragmented buckets the report exists to surface.
108109
*
109110
* Pure (no I/O) so it is unit-testable; storage adapters supply the sampled counts.
110111
*/

0 commit comments

Comments
 (0)