Skip to content

Commit 2038f1c

Browse files
committed
store: Fix mistake in handling of histogram bounds in VidBatcher
When setting upa VidBatcher we have both accurate values for the range of vids as well as Postgres' estimate of bounds for a histogram with roughly the same number of entries in each bucket. As an example, say we have min and max of 1 and 100, and histogram bounds [5, 50, 96]. We used to then add min and max to these bounds resulting in an ogive over [1, 5, 50, 96, 100]. With that, it seems that there is a bucket [1, 5] with just as many entries as the bucket [5, 50], which is not what the Posgres staistics indicate. Using this ogive will cause e.g. pruning to increase batch size quickly as it tries to get out of the [1, 5] bucket resulting in a batch size that is way too big for the next bucket and a batch that can take a very long time. The first and last entry of the bounds are Postgres' estimate of the min and max. We now simply replace the first and last bound with our known min and max, resulting in an ogive over [1, 50, 100], which reflects the statistics much more accurately and avoids impossibly short buckets.
1 parent 900f10a commit 2038f1c

1 file changed

Lines changed: 50 additions & 6 deletions

File tree

store/postgres/src/vid_batcher.rs

Lines changed: 50 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -136,12 +136,26 @@ impl VidBatcher {
136136
) -> Result<Self, StoreError> {
137137
let start = range.min;
138138

139-
let bounds = bounds
140-
.into_iter()
141-
.filter(|bound| range.min < *bound && range.max > *bound)
142-
.chain(vec![range.min, range.max].into_iter())
143-
.collect::<Vec<_>>();
144-
139+
let bounds = {
140+
// Keep only histogram bounds that are relevent for the range
141+
let mut bounds = bounds
142+
.into_iter()
143+
.filter(|bound| range.min <= *bound && range.max >= *bound)
144+
.collect::<Vec<_>>();
145+
// The first and last entry in `bounds` are Postgres' estimates
146+
// of the min and max `vid` values in the table. We use the
147+
// actual min and max `vid` values from the `vid_range` instead
148+
let len = bounds.len();
149+
if len > 1 {
150+
bounds[0] = range.min;
151+
bounds[len - 1] = range.max;
152+
} else {
153+
// If Postgres doesn't have a histogram, just use one bucket
154+
// from min to max
155+
bounds = vec![range.min, range.max];
156+
}
157+
bounds
158+
};
145159
let mut ogive = if range.is_empty() {
146160
None
147161
} else {
@@ -363,6 +377,17 @@ mod tests {
363377
}
364378
}
365379

380+
impl std::fmt::Debug for Batcher {
381+
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
382+
f.debug_struct("Batcher")
383+
.field("start", &self.vid.start)
384+
.field("end", &self.vid.end)
385+
.field("size", &self.vid.batch_size.size)
386+
.field("duration", &self.vid.batch_size.target.as_secs())
387+
.finish()
388+
}
389+
}
390+
366391
#[test]
367392
fn simple() {
368393
let bounds = vec![10, 20, 30, 40, 49];
@@ -414,4 +439,23 @@ mod tests {
414439
batcher.at(360, 359, 80);
415440
batcher.step(360, 359, S010);
416441
}
442+
443+
#[test]
444+
fn vid_batcher_adjusts_bounds() {
445+
// The first and last entry in `bounds` are estimats of the min and
446+
// max that are slightly off compared to the actual min and max we
447+
// put in `vid_range`. Check that `VidBatcher` uses the actual min
448+
// and max from `vid_range`.
449+
let bounds = vec![639, 20_000, 40_000, 60_000, 80_000, 90_000];
450+
let vid_range = VidRange::new(1, 100_000);
451+
let batch_size = AdaptiveBatchSize {
452+
size: 1000,
453+
target: S100,
454+
};
455+
456+
let vid_batcher = VidBatcher::new(bounds, vid_range, batch_size).unwrap();
457+
let ogive = vid_batcher.ogive.as_ref().unwrap();
458+
assert_eq!(1, ogive.start());
459+
assert_eq!(100_000, ogive.end());
460+
}
417461
}

0 commit comments

Comments
 (0)