Skip to content

Commit fb85e35

Browse files
neilconwaymartin-g
andauthored
fix: Prefer numeric in type coercion for comparisons (#20426)
## Which issue does this PR close? - Closes #15161. ## Rationale for this change In a comparison between a numeric column and a string literal (e.g., `WHERE int_col < '10'`), we previously coerced the numeric column to be a string type. This resulted in doing a lexicographic comparison, which results in incorrect query results. Instead, we split type coercion into two situations: type coercion for comparisons (including `IN` lists, `BETWEEN`, and `CASE WHEN`), where we want string->numeric coercion, and type coercion for places like `UNION` or `CASE ... THEN/ELSE`, where DataFusion's traditional behavior has been to tolerate type mismatching by coercing values to strings. Here is a (not necessarily exhaustive) summary of the behavioral changes (old -> new): ``` Comparisons (=, <, >, etc.): float_col = '5' : string (wrong: '5'!='5.0') -> numeric int_col > '100' : string (wrong: '325'<'100') -> numeric int_col = 'hello' : string, always false -> cast error int_col = '' : string, always false -> cast error int_col = '99.99' : string, always false -> cast error Dict(Int) = '5' : string -> numeric REE(Int) = '5' : string -> numeric struct(int)=struct(str): int field to Utf8 -> str field to int IN lists: float_col IN ('1.0') : string (wrong: '1.0'!='1') -> numeric str_col IN ('a', 1) : coerce to Utf8 -> coerce to Int64 CASE: CASE str WHEN float : coerce to Utf8 -> coerce to Float LIKE / regex: Dict(Int) LIKE '%5%' : coerce to Utf8 -> error (matches int) REE(Int) LIKE '%5%' : coerce to Utf8 -> error (matches int) Dict(Int) ~ '5' : coerce to Utf8 -> error (matches int) REE(Int) ~ '5' : error (no REE) -> error (REE added) REE(Utf8) ~ '5' : error (no REE) -> works (REE added) ``` ## What changes are included in this PR? * Update `comparison_coercion` to coerce strings to numerics * Remove previous `comparison_coercion_numeric` function * Add a new function, `type_union_coercion`, and use it when appropriate * Add support for REE types with regexp operators (this was unsupported for no good reason I can see) * Add unit and SLT tests for new coercion behavior * Update existing SLT tests for changes in coercion behavior * Fix the ClickBench unparser tests to avoid comparing int fields with non-numeric string literals ## Are these changes tested? Yes. New tests added, existing tests pass. ## Are there any user-facing changes? Yes, see table above. In most cases the new behavior should be more sensible and less error-prone, but it will likely break some user code. --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
1 parent 777ecb8 commit fb85e35

File tree

24 files changed

+1065
-393
lines changed

24 files changed

+1065
-393
lines changed

datafusion/core/src/physical_planner.rs

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3574,21 +3574,17 @@ mod tests {
35743574
}
35753575

35763576
#[tokio::test]
3577-
async fn in_list_types() -> Result<()> {
3578-
// expression: "a in ('a', 1)"
3577+
async fn in_list_types_mixed_string_int_error() -> Result<()> {
3578+
// expression: "c1 in ('a', 1)" where c1 is Utf8
35793579
let list = vec![lit("a"), lit(1i64)];
35803580
let logical_plan = test_csv_scan()
35813581
.await?
3582-
// filter clause needs the type coercion rule applied
35833582
.filter(col("c12").lt(lit(0.05)))?
35843583
.project(vec![col("c1").in_list(list, false)])?
35853584
.build()?;
3586-
let execution_plan = plan(&logical_plan).await?;
3587-
// verify that the plan correctly adds cast from Int64(1) to Utf8, and the const will be evaluated.
3588-
3589-
let expected = r#"expr: BinaryExpr { left: BinaryExpr { left: Column { name: "c1", index: 0 }, op: Eq, right: Literal { value: Utf8("a"), field: Field { name: "lit", data_type: Utf8 } }, fail_on_overflow: false }"#;
3585+
let e = plan(&logical_plan).await.unwrap_err().to_string();
35903586

3591-
assert_contains!(format!("{execution_plan:?}"), expected);
3587+
assert_contains!(&e, "Cannot cast string 'a' to value of Int64 type");
35923588

35933589
Ok(())
35943590
}

datafusion/core/tests/expr_api/mod.rs

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -342,20 +342,26 @@ fn test_create_physical_expr_nvl2() {
342342

343343
#[tokio::test]
344344
async fn test_create_physical_expr_coercion() {
345-
// create_physical_expr does apply type coercion and unwrapping in cast
345+
// create_physical_expr applies type coercion (and can unwrap/fold
346+
// literal casts). Comparison coercion prefers numeric types, so
347+
// string/int comparisons cast the string side to the numeric type.
346348
//
347-
// expect the cast on the literals
348-
// compare string function to int `id = 1`
349-
create_expr_test(col("id").eq(lit(1i32)), "id@0 = CAST(1 AS Utf8)");
350-
create_expr_test(lit(1i32).eq(col("id")), "CAST(1 AS Utf8) = id@0");
351-
// compare int col to string literal `i = '202410'`
352-
// Note this casts the column (not the field)
353-
create_expr_test(col("i").eq(lit("202410")), "CAST(i@1 AS Utf8) = 202410");
354-
create_expr_test(lit("202410").eq(col("i")), "202410 = CAST(i@1 AS Utf8)");
355-
// however, when simplified the casts on i should removed
356-
// https://github.com/apache/datafusion/issues/14944
357-
create_simplified_expr_test(col("i").eq(lit("202410")), "CAST(i@1 AS Utf8) = 202410");
358-
create_simplified_expr_test(lit("202410").eq(col("i")), "CAST(i@1 AS Utf8) = 202410");
349+
// string column vs int literal: id (Utf8) is cast to Int32
350+
create_expr_test(col("id").eq(lit(1i32)), "CAST(id@0 AS Int32) = 1");
351+
create_expr_test(lit(1i32).eq(col("id")), "1 = CAST(id@0 AS Int32)");
352+
// int column vs string literal: the string literal is cast to Int64
353+
create_expr_test(col("i").eq(lit("202410")), "i@1 = CAST(202410 AS Int64)");
354+
create_expr_test(lit("202410").eq(col("i")), "CAST(202410 AS Int64) = i@1");
355+
// The simplifier operates on the logical expression before type
356+
// coercion adds the CAST, so the output is unchanged.
357+
create_simplified_expr_test(
358+
col("i").eq(lit("202410")),
359+
"i@1 = CAST(202410 AS Int64)",
360+
);
361+
create_simplified_expr_test(
362+
lit("202410").eq(col("i")),
363+
"i@1 = CAST(202410 AS Int64)",
364+
);
359365
}
360366

361367
/// Evaluates the specified expr as an aggregate and compares the result to the

datafusion/core/tests/sql/unparser.rs

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -142,16 +142,26 @@ fn tpch_queries() -> Vec<TestQuery> {
142142
}
143143

144144
/// Create a new SessionContext for testing that has all Clickbench tables registered.
145+
///
146+
/// Registers the raw Parquet as `hits_raw`, then creates a `hits` view that
147+
/// casts `EventDate` from UInt16 (day-offset) to DATE. This mirrors the
148+
/// approach used by the benchmark runner in `benchmarks/src/clickbench.rs`.
145149
async fn clickbench_test_context() -> Result<SessionContext> {
146150
let ctx = SessionContext::new();
147151
ctx.register_parquet(
148-
"hits",
152+
"hits_raw",
149153
"tests/data/clickbench_hits_10.parquet",
150154
ParquetReadOptions::default(),
151155
)
152156
.await?;
153-
// Sanity check we found the table by querying it's schema, it should not be empty
154-
// Otherwise if the path is wrong the tests will all fail in confusing ways
157+
ctx.sql(
158+
r#"CREATE VIEW hits AS
159+
SELECT * EXCEPT ("EventDate"),
160+
CAST(CAST("EventDate" AS INTEGER) AS DATE) AS "EventDate"
161+
FROM hits_raw"#,
162+
)
163+
.await?;
164+
// Sanity check we found the table by querying its schema
155165
let df = ctx.sql("SELECT * FROM hits LIMIT 1").await?;
156166
assert!(
157167
!df.schema().fields().is_empty(),

datafusion/expr-common/src/interval_arithmetic.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ use std::fmt::{self, Display, Formatter};
2222
use std::ops::{AddAssign, SubAssign};
2323

2424
use crate::operator::Operator;
25-
use crate::type_coercion::binary::{BinaryTypeCoercer, comparison_coercion_numeric};
25+
use crate::type_coercion::binary::{BinaryTypeCoercer, comparison_coercion};
2626

2727
use arrow::compute::{CastOptions, cast_with_options};
2828
use arrow::datatypes::{
@@ -730,7 +730,7 @@ impl Interval {
730730
(self.lower.clone(), self.upper.clone(), rhs.clone())
731731
} else {
732732
let maybe_common_type =
733-
comparison_coercion_numeric(&self.data_type(), &rhs.data_type());
733+
comparison_coercion(&self.data_type(), &rhs.data_type());
734734
assert_or_internal_err!(
735735
maybe_common_type.is_some(),
736736
"Data types must be compatible for containment checks, lhs:{}, rhs:{}",

datafusion/expr-common/src/signature.rs

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ pub enum Arity {
158158
pub enum TypeSignature {
159159
/// One or more arguments of a common type out of a list of valid types.
160160
///
161-
/// For functions that take no arguments (e.g. `random()` see [`TypeSignature::Nullary`]).
161+
/// For functions that take no arguments (e.g. `random()`), see [`TypeSignature::Nullary`].
162162
///
163163
/// # Examples
164164
///
@@ -184,38 +184,39 @@ pub enum TypeSignature {
184184
Uniform(usize, Vec<DataType>),
185185
/// One or more arguments with exactly the specified types in order.
186186
///
187-
/// For functions that take no arguments (e.g. `random()`) use [`TypeSignature::Nullary`].
187+
/// For functions that take no arguments (e.g. `random()`), use [`TypeSignature::Nullary`].
188188
Exact(Vec<DataType>),
189189
/// One or more arguments belonging to the [`TypeSignatureClass`], in order.
190190
///
191191
/// [`Coercion`] contains not only the desired type but also the allowed
192192
/// casts. For example, if you expect a function has string type, but you
193193
/// also allow it to be casted from binary type.
194194
///
195-
/// For functions that take no arguments (e.g. `random()`) see [`TypeSignature::Nullary`].
195+
/// For functions that take no arguments (e.g. `random()`), see [`TypeSignature::Nullary`].
196196
Coercible(Vec<Coercion>),
197197
/// One or more arguments coercible to a single, comparable type.
198198
///
199199
/// Each argument will be coerced to a single type using the
200-
/// coercion rules described in [`comparison_coercion_numeric`].
200+
/// coercion rules described in [`comparison_coercion`].
201201
///
202202
/// # Examples
203203
///
204204
/// If the `nullif(1, 2)` function is called with `i32` and `i64` arguments
205205
/// the types will both be coerced to `i64` before the function is invoked.
206206
///
207207
/// If the `nullif('1', 2)` function is called with `Utf8` and `i64` arguments
208-
/// the types will both be coerced to `Utf8` before the function is invoked.
208+
/// the types will both be coerced to `Int64` before the function is invoked
209+
/// (numeric is preferred over string).
209210
///
210211
/// Note:
211-
/// - For functions that take no arguments (e.g. `random()` see [`TypeSignature::Nullary`]).
212+
/// - For functions that take no arguments (e.g. `random()`), see [`TypeSignature::Nullary`].
212213
/// - If all arguments have type [`DataType::Null`], they are coerced to `Utf8`
213214
///
214-
/// [`comparison_coercion_numeric`]: crate::type_coercion::binary::comparison_coercion_numeric
215+
/// [`comparison_coercion`]: crate::type_coercion::binary::comparison_coercion
215216
Comparable(usize),
216217
/// One or more arguments of arbitrary types.
217218
///
218-
/// For functions that take no arguments (e.g. `random()`) use [`TypeSignature::Nullary`].
219+
/// For functions that take no arguments (e.g. `random()`), use [`TypeSignature::Nullary`].
219220
Any(usize),
220221
/// Matches exactly one of a list of [`TypeSignature`]s.
221222
///
@@ -233,7 +234,7 @@ pub enum TypeSignature {
233234
///
234235
/// See [`NativeType::is_numeric`] to know which type is considered numeric
235236
///
236-
/// For functions that take no arguments (e.g. `random()`) use [`TypeSignature::Nullary`].
237+
/// For functions that take no arguments (e.g. `random()`), use [`TypeSignature::Nullary`].
237238
///
238239
/// [`NativeType::is_numeric`]: datafusion_common::types::NativeType::is_numeric
239240
Numeric(usize),
@@ -246,7 +247,7 @@ pub enum TypeSignature {
246247
/// For example, if a function is called with (utf8, large_utf8), all
247248
/// arguments will be coerced to `LargeUtf8`
248249
///
249-
/// For functions that take no arguments (e.g. `random()` use [`TypeSignature::Nullary`]).
250+
/// For functions that take no arguments (e.g. `random()`), use [`TypeSignature::Nullary`].
250251
String(usize),
251252
/// No arguments
252253
Nullary,

0 commit comments

Comments
 (0)