fix: return None from SUM when no values were collected#2878
Conversation
db4b308 to
f62c345
Compare
fd8250a to
b29a7be
Compare
b29a7be to
65a9e03
Compare
## What Return `None` from `IntermediateSum::finalize()` when `count == 0`. ## Why `IntermediateSum::finalize()` returned `Some(0.0)` even when all documents had missing/NULL values, unlike MIN, MAX, and AVG which all return `None` for `count == 0`. The `0.0` came from `IntermediateStats`' default sum initialization. Consumers (like ParadeDB) that map `None` to SQL `NULL` were incorrectly getting `0` for SUM on all-NULL groups. Fixes paradedb/paradedb#4621 ## How Check `count == 0` before returning the sum value, returning `None` instead of `Some(0.0)`. ## Tests - Verify SUM returns `None` when all values are missing/NULL - Verify SUM still returns correct values for non-empty groups - Verify MIN, MAX, AVG behavior unchanged Upstream PR: quickwit-oss#2878
## What Return `None` from `IntermediateSum::finalize()` when `count == 0`. ## Why `IntermediateSum::finalize()` returned `Some(0.0)` even when all documents had missing/NULL values, unlike MIN, MAX, and AVG which all return `None` for `count == 0`. The `0.0` came from `IntermediateStats`' default sum initialization. Consumers (like ParadeDB) that map `None` to SQL `NULL` were incorrectly getting `0` for SUM on all-NULL groups. Fixes paradedb/paradedb#4621 ## How Check `count == 0` before returning the sum value, returning `None` instead of `Some(0.0)`. ## Tests - Verify SUM returns `None` when all values are missing/NULL - Verify SUM still returns correct values for non-empty groups - Verify MIN, MAX, AVG behavior unchanged Upstream PR: quickwit-oss#2878
IntermediateSum::finalize() returned Some(0.0) even when count==0 (all documents had missing/NULL values). This differs from MIN, MAX, and AVG which all return None for count==0. The 0.0 came from IntermediateStats' default sum initialization. Consumers (like ParadeDB) that map None to SQL NULL were incorrectly getting 0 for SUM on all-NULL groups. Fixes paradedb/paradedb#4621
65a9e03 to
32d07c0
Compare
| /// `IntermediateMin`, `IntermediateMax`, and `IntermediateAvg`. | ||
| pub fn finalize(&self) -> Option<f64> { | ||
| Some(self.stats.finalize().sum) | ||
| let stats = self.stats.finalize(); |
There was a problem hiding this comment.
Shouldn't we change sum on the Stats struct to Option instead?
Update: actually that would make de/serialization a bit weird with serde
There was a problem hiding this comment.
I don't understand the spec. Sum of 0 elements is defined. Why do we return an Option?
What does elastic return?
There was a problem hiding this comment.
on the struct itself we have some freedom how to express null, only the serialization needs to follow elastic search spec
There was a problem hiding this comment.
I meant the sum of no number is defined. It is 0. Why are we having this discussion? Is this about reproducing a glitch in Elasticsearch?
There was a problem hiding this comment.
Confirmation of the behavior
https://onecompiler.com/postgresql/44jr6em28
Aggregation are not meant to align themselves with SQL. We want elasticsearch behavior by default.
There was a problem hiding this comment.
Yes, but we match elastic search behavior only on the serialized input and output, not on the intermediate formats.
Currently the change is a breaking change, because of
/// Sum metric result.
Sum(SingleMetricResult),
/// Single-metric aggregations use this common result structure.
///
/// Main reason to wrap it in value is to match elasticsearch output structure.
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct SingleMetricResult {
/// The value of the single value metric.
pub value: Option<f64>,
}Currently sum is weird (on main), SingleMetricResult is Option, but it's never None.
I think we should fix that one way or the other.
@mdashti Can you check how elasticsearch behaves?
There was a problem hiding this comment.
On flipping Stats.sum itself to Option, agree the serde dance isn't worth it. Kept IntermediateStats untouched and branched only inside IntermediateSum::finalize so the broader stats path stays as-is.
There was a problem hiding this comment.
Checked ES on this. Sum on an empty / all-missing bucket comes back as "value": 0, while min/max/avg return "value": null. So strict ES parity does mean Some(0.0) for sum, and you're right that this PR diverges on the wire.
The reason we leaned toward None: ParadeDB's SQL SUM over zero rows is NULL, and Some(0.0) is indistinguishable from a bucket that genuinely sums to zero, so the SQL boundary loses that signal. Hence the asymmetry feels worse than the ES divergence from where we sit.
Re the breaking-change point, fair. SingleMetricResult.value is already Option<f64> for sum but never None on main, so consumers that already match on Option won't blow up. Anything coercing null → 0 to mirror the ES wire format will see a diff though.
Pulled in 4486166 a small doc note on finalize calling out the ES divergence and the SQL motivation, so future readers see the trade-off without rediscovering this thread.
Happy to take this whichever way you want, keep None, revert to Some(0.0) and just document the Option-but-never-None quirk, or gate the None behind something opt-in.
There was a problem hiding this comment.
we could add a none_if_no_match: Option to let users opt-in into a postgres-like behavior if they want. We don't want to differ from ES, but we could add our own extensions. wdyt?
@trinity-1686a Love the opt-in framing. Pushed 4daa35d in that shape.
SumAggregationgets a newnone_if_no_match: Option<bool>(skipped on serialize when unset, so JSON stays clean for the ES default path).IntermediateSum::finalizestill returnsOption<f64>internally so the Rust side stays parallel to min/max/avg.- The ES-vs-SQL choice now lives at the boundary in
IntermediateMetricResult::into_final_metric_result:Noneis coerced toSome(0.0)by default, and the flag opts out of that coercion. AddedAggregationVariants::as_sum()to read it. - Two end-to-end tests on an empty index, one for the ES default (
"value": 0) and one fornone_if_no_match: true("value": null). Fullaggregation::suite green locally.
So out of the box the wire format is back to matching ES, and pg-like behaviour just sets the flag on its side.
## What Return `None` from `IntermediateSum::finalize()` when `count == 0`. ## Why `IntermediateSum::finalize()` returned `Some(0.0)` even when all documents had missing/NULL values, unlike MIN, MAX, and AVG which all return `None` for `count == 0`. The `0.0` came from `IntermediateStats`' default sum initialization. Consumers (like ParadeDB) that map `None` to SQL `NULL` were incorrectly getting `0` for SUM on all-NULL groups. Fixes paradedb/paradedb#4621 ## How Check `count == 0` before returning the sum value, returning `None` instead of `Some(0.0)`. ## Tests - Verify SUM returns `None` when all values are missing/NULL - Verify SUM still returns correct values for non-empty groups - Verify MIN, MAX, AVG behavior unchanged Upstream PR: quickwit-oss#2878
## What Return `None` from `IntermediateSum::finalize()` when `count == 0`. ## Why `IntermediateSum::finalize()` returned `Some(0.0)` even when all documents had missing/NULL values, unlike MIN, MAX, and AVG which all return `None` for `count == 0`. The `0.0` came from `IntermediateStats`' default sum initialization. Consumers (like ParadeDB) that map `None` to SQL `NULL` were incorrectly getting `0` for SUM on all-NULL groups. Fixes paradedb/paradedb#4621 ## How Check `count == 0` before returning the sum value, returning `None` instead of `Some(0.0)`. ## Tests - Verify SUM returns `None` when all values are missing/NULL - Verify SUM still returns correct values for non-empty groups - Verify MIN, MAX, AVG behavior unchanged Upstream PR: quickwit-oss#2878
Surface the trade-off in the doc comment so future reviewers see why this differs from ES (which returns "value": 0 for sum over empty/all-missing buckets) and what consumers (ParadeDB SQL NULL) the None variant is meant to serve.
mdashti
left a comment
There was a problem hiding this comment.
@fulmicoton @PSeitz Thanks for the comments.
| /// `IntermediateMin`, `IntermediateMax`, and `IntermediateAvg`. | ||
| pub fn finalize(&self) -> Option<f64> { | ||
| Some(self.stats.finalize().sum) | ||
| let stats = self.stats.finalize(); |
There was a problem hiding this comment.
On flipping Stats.sum itself to Option, agree the serde dance isn't worth it. Kept IntermediateStats untouched and branched only inside IntermediateSum::finalize so the broader stats path stays as-is.
| /// `IntermediateMin`, `IntermediateMax`, and `IntermediateAvg`. | ||
| pub fn finalize(&self) -> Option<f64> { | ||
| Some(self.stats.finalize().sum) | ||
| let stats = self.stats.finalize(); |
There was a problem hiding this comment.
Checked ES on this. Sum on an empty / all-missing bucket comes back as "value": 0, while min/max/avg return "value": null. So strict ES parity does mean Some(0.0) for sum, and you're right that this PR diverges on the wire.
The reason we leaned toward None: ParadeDB's SQL SUM over zero rows is NULL, and Some(0.0) is indistinguishable from a bucket that genuinely sums to zero, so the SQL boundary loses that signal. Hence the asymmetry feels worse than the ES divergence from where we sit.
Re the breaking-change point, fair. SingleMetricResult.value is already Option<f64> for sum but never None on main, so consumers that already match on Option won't blow up. Anything coercing null → 0 to mirror the ES wire format will see a diff though.
Pulled in 4486166 a small doc note on finalize calling out the ES divergence and the SQL motivation, so future readers see the trade-off without rediscovering this thread.
Happy to take this whichever way you want, keep None, revert to Some(0.0) and just document the Option-but-never-None quirk, or gate the None behind something opt-in.
|
we could add a |
Switch the default serialized output of `sum` on empty / all-missing buckets back to `"value": 0` to match Elasticsearch, and gate the SQL-style `"value": null` behavior behind a new `none_if_no_match: Option<bool>` flag on `SumAggregation`. `IntermediateSum::finalize` still returns `Option<f64>` internally so the Rust API stays parallel to min/max/avg, but the ES-vs-SQL choice is made at the boundary in `IntermediateMetricResult::into_final_metric_result`: `None` is coerced to `Some(0.0)` unless `none_if_no_match` is set on the aggregation request. Adds `AggregationVariants::as_sum()` accessor for that boundary check and two end-to-end tests covering both the default ES behavior and the opt-in null behavior on an empty index.
mdashti
left a comment
There was a problem hiding this comment.
@trinity-1686a @fulmicoton Please take another look.
| /// `IntermediateMin`, `IntermediateMax`, and `IntermediateAvg`. | ||
| pub fn finalize(&self) -> Option<f64> { | ||
| Some(self.stats.finalize().sum) | ||
| let stats = self.stats.finalize(); |
There was a problem hiding this comment.
we could add a none_if_no_match: Option to let users opt-in into a postgres-like behavior if they want. We don't want to differ from ES, but we could add our own extensions. wdyt?
@trinity-1686a Love the opt-in framing. Pushed 4daa35d in that shape.
SumAggregationgets a newnone_if_no_match: Option<bool>(skipped on serialize when unset, so JSON stays clean for the ES default path).IntermediateSum::finalizestill returnsOption<f64>internally so the Rust side stays parallel to min/max/avg.- The ES-vs-SQL choice now lives at the boundary in
IntermediateMetricResult::into_final_metric_result:Noneis coerced toSome(0.0)by default, and the flag opts out of that coercion. AddedAggregationVariants::as_sum()to read it. - Two end-to-end tests on an empty index, one for the ES default (
"value": 0) and one fornone_if_no_match: true("value": null). Fullaggregation::suite green locally.
So out of the box the wire format is back to matching ES, and pg-like behaviour just sets the flag on its side.
The flag's purpose is described well enough by "SQL-style consumers"; no need to call out a specific downstream.
What
Return
NonefromIntermediateSum::finalize()whencount == 0.Why
IntermediateSum::finalize()returnedSome(0.0)even when all documents had missing/NULL values, unlike MIN, MAX, and AVG which all returnNoneforcount == 0. The0.0came fromIntermediateStats' default sum initialization. Consumers (like ParadeDB) that mapNoneto SQLNULLwere incorrectly getting0for SUM on all-NULL groups.How
Check
count == 0before returning the sum value, returningNoneinstead ofSome(0.0).Tests
Nonewhen all values are missing/NULL