feat: add try_cast function for safe type conversion#6960
feat: add try_cast function for safe type conversion#6960XuQianJin-Stars wants to merge 2 commits into
Conversation
Greptile SummaryThis PR implements
Confidence Score: 3/5Merging as-is silently renames output columns for all unaliased cast expressions and breaks any Python expression-visitor that has not added Three independent issues affect existing behavior without any opt-in: unaliased src/daft-dsl/src/expr/mod.rs (display_name format), src/daft-dsl/src/visitor.rs (visit_try_cast dispatch), and src/daft-stats/src/table_stats.rs (try_cast stats handling) all need attention before merging. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User calls try_cast] --> B{Source}
B -->|Python API| C[Expression.try_cast / Series.try_cast]
B -->|SQL| D[TRY_CAST / SAFE_CAST in planner.rs]
C --> E["Expr::Cast(expr, dtype, true)"]
D --> E
E --> F[RecordBatch eval_expression]
F --> G[Series::try_cast]
G --> H{Bulk cast attempt}
H -->|Success| I[Return casted Series]
H -->|Failure| J[Element-wise fallback loop]
J --> K{Per-element cast}
K -->|Ok| L[Append element]
K -->|Err| M[Append null]
L --> N[Series::concat all elements]
M --> N
N --> I
Reviews (1): Last reviewed commit: "feat: add try_cast function for safe typ..." | Re-trigger Greptile |
| Self::Cast(expr, dtype, try_cast) => { | ||
| let child_id = expr.display_name(schema)?; | ||
| Ok(format!("{child_id} to {dtype}")) | ||
| let prefix = if *try_cast { "try_cast" } else { "cast" }; | ||
| Ok(format!("{child_id} to {dtype} ({prefix})")) | ||
| } |
There was a problem hiding this comment.
display_name format change silently renames unaliased cast columns
The format for a regular cast expression changed from "a to Int64" to "a to Int64 (cast)". Since display_name feeds into the output column name when no alias is provided (see daft-local-execution/src/intermediate_ops/project.rs), this is a silent behavioral regression: any user who does df.select(col("a").cast(dtype)) without an explicit alias will see their column renamed, breaking downstream code that accesses it by name.
| Expr::Cast(col, dtype, _try_cast) => self | ||
| .eval_expression(&BoundExpr::new_unchecked(col.clone()))? | ||
| .cast(dtype), |
There was a problem hiding this comment.
Stats evaluation errors for
try_cast expressions instead of returning Missing
When _try_cast is true, a failure in ColumnRangeStatistics::cast propagates as an error instead of gracefully falling back to Missing. For a try_cast expression that cannot cast the statistical bounds (e.g., casting string bounds to int), the stats layer should return Missing rather than hard-failing, matching the semantics of try_cast at the data level.
| Expr::Cast(col, dtype, _try_cast) => self | |
| .eval_expression(&BoundExpr::new_unchecked(col.clone()))? | |
| .cast(dtype), | |
| Expr::Cast(col, dtype, try_cast) => { | |
| let stats = self.eval_expression(&BoundExpr::new_unchecked(col.clone()))?; | |
| if *try_cast { | |
| stats.cast(dtype).unwrap_or(ColumnRangeStatistics::Missing) | |
| } else { | |
| stats.cast(dtype)? | |
| } | |
| } |
| fn visit_cast( | ||
| &self, | ||
| expr: &ExprRef, | ||
| data_type: &DataType, | ||
| try_cast: bool, | ||
| ) -> PyVisitorResult<'py> { | ||
| let attr = if try_cast { | ||
| "visit_try_cast" | ||
| } else { | ||
| "visit_cast" | ||
| }; | ||
| let args = (self.to_expr(expr)?, self.to_data_type(data_type)?); | ||
| self.visitor.call_method1(attr, args) | ||
| } |
There was a problem hiding this comment.
Visitor dispatches to
visit_try_cast, breaking existing Python visitors
When a try_cast expression is visited, visit_try_cast is called on the Python visitor object. Any existing Python visitor that implements visit_cast but not visit_try_cast will receive an AttributeError at runtime as soon as it encounters a try_cast node in the expression tree. A backward-compatible approach would be to check whether the visitor has visit_try_cast and fall back to visit_cast (passing the try_cast flag as a third argument) if it does not.
| // Try bulk cast first, return directly if successful | ||
| match self.cast(datatype) { | ||
| Ok(casted) => Ok(casted), | ||
| Err(_) => { | ||
| // Bulk cast failed, try element-wise | ||
| let len = self.len(); | ||
| let mut results: Vec<Series> = Vec::with_capacity(len); | ||
|
|
||
| for i in 0..len { | ||
| let element = self.slice(i, i + 1)?; | ||
| match element.cast(datatype) { | ||
| Ok(casted) => results.push(casted), | ||
| Err(_) => { | ||
| results.push(Series::full_null(self.name(), datatype, 1)); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Concatenate all single-element Series | ||
| if results.is_empty() { | ||
| Ok(Series::empty(self.name(), datatype)) | ||
| } else { | ||
| let ref_results: Vec<&Series> = results.iter().collect(); | ||
| Series::concat(&ref_results) | ||
| } | ||
| } |
There was a problem hiding this comment.
Element-wise fallback is O(n) slice+cast+concat and can be very slow at scale
When the bulk cast fails the implementation falls back to a loop that calls self.slice(i, i+1), element.cast(datatype), and Series::concat for every row. For a column with millions of rows that all need the fallback (e.g., mixed strings cast to int), this creates millions of single-element Series objects and concatenation calls, which will be orders of magnitude slower than a vectorized implementation. Worth at least noting in a comment, or documenting as a known limitation.
2ebd43d to
a97e312
Compare
Implements try_cast which returns null instead of raising an error when type conversion fails. This is the Spark-compatible TRY_CAST function. Changes: - Extended Expr::Cast variant with a bool flag (true = try_cast mode) - Added try_cast method to Series (element-wise fallback on failure) - Added TRY_CAST SQL syntax support via CastKind::TryCast/SafeCast - Added Python API: Expression.try_cast(), Series.try_cast(), daft.functions.try_cast() - Updated all Expr::Cast match sites across the codebase - Added comprehensive tests
1c28e46 to
f4122ca
Compare
Implements try_cast which returns null instead of raising an error when type conversion fails. This is the Spark-compatible
TRY_CASTfunction.Changes:
TRY_CASTSQL syntax support viaCastKind::TryCast/SafeCastExpression.try_cast(),Series.try_cast(),daft.functions.try_cast()Expr::Castmatch sites across the codebaseChanges Made
Rust Core (daft-dsl, daft-core):
src/daft-dsl/src/expr/mod.rs— ExtendedExpr::Cast(ExprRef, DataType, bool)with try_cast flagsrc/daft-dsl/src/expr/visitor.rs— Updated visitor pattern for new Cast signaturesrc/daft-dsl/src/python.rs— Addedtry_cast()PyO3 bindingsrc/daft-core/src/series/ops/cast.rs— ImplementedSeries::try_cast()with bulk-first, element-wise fallback strategySQL Support (daft-sql):
src/daft-sql/src/planner.rs— HandleCastKind::TryCastandCastKind::SafeCastin SQL plannerLogical/Physical Plan:
src/daft-logical-plan/src/ops/project.rs— Updated semantic ID replacement for Castsrc/daft-logical-plan/src/optimization/rules/push_down_projection.rs— Updated projection pushdownsrc/daft-physical-plan/src/translation/translate.rs— Updated physical plan translationPython API:
daft/expressions/expressions.py— AddedExpression.try_cast()methoddaft/series.py— AddedSeries.try_cast()methoddaft/functions/__init__.py— Addedtry_cast()top-level functionTests:
tests/series/test_try_cast.py— Comprehensive test coverage for various type conversion scenariosRelated Issues
Closes #6959