We have this code in our sampling compressor (which has existed for a long time, even before the most recent changes):
let after = scheme
.compress(compressor, &mut sample_data, sample_ctx)?
.nbytes();
let before = sample_data.array().nbytes();
let ratio = before as f64 / after as f64;
tracing::debug!("estimate_compression_ratio_with_sampling(compressor={scheme:#?}) = {ratio}",);
Ok(ratio)
If I add some checks:
let mut sample_data = ArrayAndStats::new(sample_array, scheme.stats_options());
let cascade_history = ctx.cascade_history().to_vec();
let sample_ctx = ctx.with_sampling();
let before = sample_data.array().nbytes();
let after = scheme
.compress(compressor, &mut sample_data, sample_ctx)?
.nbytes();
if after == 0 {
tracing::warn!(
scheme = %scheme.id(),
?cascade_history,
"sample compressed to 0 bytes, which should only happen for constant arrays",
);
}
we get hundreds of warnings saying that bitpacking compresses to 0 bytes. This also means that the ratio ends up being infinity, which we interpret as invalid.
Here are some examples. These lists are in order of descent, so the first in the list is the parent scheme.
WARN vortex_compressor::estimate: vortex-compressor/src/estimate.rs:129: sample compressed to 0 bytes, which should only happen for constant arrays scheme=vortex.int.bitpacking
cascade_history=[
(SchemeId { name: "vortex.string.dict" }, 0),
(SchemeId { name: "vortex.string.fsst" }, 0),
(SchemeId { name: "vortex.int.dict" }, 1)]
utf8 encoded as dict with values as fsst, fsst lengths as dict, and the codes as bitpacked
WARN vortex_compressor::estimate: vortex-compressor/src/estimate.rs:129: sample compressed to 0 bytes, which should only happen for constant arrays scheme=vortex.int.bitpacking
cascade_history=[
(SchemeId { name: "vortex.string.fsst" }, 0),
(SchemeId { name: "vortex.int.rle" }, 1)]
utf8 encoded as fsst with lengths as rle, and then rle indices as bitpacking
We have this code in our sampling compressor (which has existed for a long time, even before the most recent changes):
If I add some checks:
we get hundreds of warnings saying that bitpacking compresses to 0 bytes. This also means that the ratio ends up being infinity, which we interpret as invalid.
Here are some examples. These lists are in order of descent, so the first in the list is the parent scheme.
utf8 encoded as dict with values as fsst, fsst lengths as dict, and the codes as bitpacked
utf8 encoded as fsst with lengths as rle, and then rle indices as bitpacking