Skip to content

feat: add try_cast function for safe type conversion#6960

Open
XuQianJin-Stars wants to merge 2 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/try-cast
Open

feat: add try_cast function for safe type conversion#6960
XuQianJin-Stars wants to merge 2 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/try-cast

Conversation

@XuQianJin-Stars
Copy link
Copy Markdown
Contributor

Implements try_cast which returns null instead of raising an error when type conversion fails. This is the Spark-compatible TRY_CAST function.

Changes:

  • Extended Expr::Cast variant with a bool flag (true = try_cast mode)
  • Added try_cast method to Series (element-wise fallback on failure)
  • Added TRY_CAST SQL syntax support via CastKind::TryCast/SafeCast
  • Added Python API: Expression.try_cast(), Series.try_cast(), daft.functions.try_cast()
  • Updated all Expr::Cast match sites across the codebase
  • Added comprehensive tests

Changes Made

Rust Core (daft-dsl, daft-core):

  • src/daft-dsl/src/expr/mod.rs — Extended Expr::Cast(ExprRef, DataType, bool) with try_cast flag
  • src/daft-dsl/src/expr/visitor.rs — Updated visitor pattern for new Cast signature
  • src/daft-dsl/src/python.rs — Added try_cast() PyO3 binding
  • src/daft-core/src/series/ops/cast.rs — Implemented Series::try_cast() with bulk-first, element-wise fallback strategy

SQL Support (daft-sql):

  • src/daft-sql/src/planner.rs — Handle CastKind::TryCast and CastKind::SafeCast in SQL planner

Logical/Physical Plan:

  • src/daft-logical-plan/src/ops/project.rs — Updated semantic ID replacement for Cast
  • src/daft-logical-plan/src/optimization/rules/push_down_projection.rs — Updated projection pushdown
  • src/daft-physical-plan/src/translation/translate.rs — Updated physical plan translation

Python API:

  • daft/expressions/expressions.py — Added Expression.try_cast() method
  • daft/series.py — Added Series.try_cast() method
  • daft/functions/__init__.py — Added try_cast() top-level function

Tests:

  • tests/series/test_try_cast.py — Comprehensive test coverage for various type conversion scenarios

Related Issues

Closes #6959

@XuQianJin-Stars XuQianJin-Stars requested a review from a team as a code owner May 19, 2026 03:07
@github-actions github-actions Bot added the feat label May 19, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR implements try_cast — a safe type-conversion operation that returns null instead of raising an error on failure — across the Rust core, SQL planner, logical/physical plan layers, and Python API, following Spark's TRY_CAST semantics.

  • Core implementation: Expr::Cast gains a bool flag to distinguish cast from try_cast; Series::try_cast uses a bulk-first strategy with an O(n) element-wise fallback on failure.
  • SQL support: CastKind::TryCast and CastKind::SafeCast are routed to expr.try_cast() in the SQL planner.
  • Breaking side-effects: the display_name format for regular cast silently changes from \"a to Int64\" to \"a to Int64 (cast)\", renaming unaliased cast columns; the expression visitor dispatches try_cast nodes to visit_try_cast, which will cause AttributeError in any existing Python visitor that does not implement that method.

Confidence Score: 3/5

Merging as-is silently renames output columns for all unaliased cast expressions and breaks any Python expression-visitor that has not added visit_try_cast.

Three independent issues affect existing behavior without any opt-in: unaliased cast columns are silently renamed because the display_name format was changed for the non-try path too; the expression visitor hard-dispatches to visit_try_cast, crashing existing visitor implementations that only handle visit_cast; and stats evaluation propagates errors for try_cast nodes instead of returning Missing. The core try_cast logic and SQL integration are solid, but these collateral regressions need to be resolved before merging.

src/daft-dsl/src/expr/mod.rs (display_name format), src/daft-dsl/src/visitor.rs (visit_try_cast dispatch), and src/daft-stats/src/table_stats.rs (try_cast stats handling) all need attention before merging.

Important Files Changed

Filename Overview
src/daft-dsl/src/expr/mod.rs Extends Expr::Cast with a bool flag and updates all match sites. display_name format change silently renames unaliased cast columns from "a to Int64" to "a to Int64 (cast)", which is a behavioral regression.
src/daft-dsl/src/visitor.rs Routes try_cast nodes to visit_try_cast; existing Python visitors that only implement visit_cast will receive AttributeError at runtime when encountering try_cast expressions.
src/daft-stats/src/table_stats.rs Ignores the try_cast flag during statistics evaluation; a failing cast() call on stats will propagate as an error rather than returning Missing, contradicting try_cast semantics.
src/daft-core/src/series/ops/cast.rs Adds try_cast with a bulk-first / element-wise fallback strategy. Correctness looks fine for small series, but the O(n) slice+concat loop is a performance hazard for large columns.
src/daft-sql/src/planner.rs Adds SQL TRY_CAST / SAFE_CAST support by routing CastKind::TryCast
src/daft-recordbatch/src/lib.rs Correctly dispatches to series.try_cast() or series.cast() in both async and sync expression evaluation paths.
tests/series/test_try_cast.py Good coverage for core scenarios (success, failure, nulls, empty series, SQL, Expression API). Missing tests for numeric overflow/precision edge cases.
daft/functions/misc.py Adds try_cast top-level function with proper docstring and example; mirrors the cast implementation correctly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User calls try_cast] --> B{Source}
    B -->|Python API| C[Expression.try_cast / Series.try_cast]
    B -->|SQL| D[TRY_CAST / SAFE_CAST in planner.rs]
    C --> E["Expr::Cast(expr, dtype, true)"]
    D --> E
    E --> F[RecordBatch eval_expression]
    F --> G[Series::try_cast]
    G --> H{Bulk cast attempt}
    H -->|Success| I[Return casted Series]
    H -->|Failure| J[Element-wise fallback loop]
    J --> K{Per-element cast}
    K -->|Ok| L[Append element]
    K -->|Err| M[Append null]
    L --> N[Series::concat all elements]
    M --> N
    N --> I
Loading

Reviews (1): Last reviewed commit: "feat: add try_cast function for safe typ..." | Re-trigger Greptile

Comment on lines +1751 to 1755
Self::Cast(expr, dtype, try_cast) => {
let child_id = expr.display_name(schema)?;
Ok(format!("{child_id} to {dtype}"))
let prefix = if *try_cast { "try_cast" } else { "cast" };
Ok(format!("{child_id} to {dtype} ({prefix})"))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 display_name format change silently renames unaliased cast columns

The format for a regular cast expression changed from "a to Int64" to "a to Int64 (cast)". Since display_name feeds into the output column name when no alias is provided (see daft-local-execution/src/intermediate_ops/project.rs), this is a silent behavioral regression: any user who does df.select(col("a").cast(dtype)) without an explicit alias will see their column renamed, breaking downstream code that accesses it by name.

Comment thread src/daft-stats/src/table_stats.rs Outdated
Comment on lines 152 to 154
Expr::Cast(col, dtype, _try_cast) => self
.eval_expression(&BoundExpr::new_unchecked(col.clone()))?
.cast(dtype),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stats evaluation errors for try_cast expressions instead of returning Missing

When _try_cast is true, a failure in ColumnRangeStatistics::cast propagates as an error instead of gracefully falling back to Missing. For a try_cast expression that cannot cast the statistical bounds (e.g., casting string bounds to int), the stats layer should return Missing rather than hard-failing, matching the semantics of try_cast at the data level.

Suggested change
Expr::Cast(col, dtype, _try_cast) => self
.eval_expression(&BoundExpr::new_unchecked(col.clone()))?
.cast(dtype),
Expr::Cast(col, dtype, try_cast) => {
let stats = self.eval_expression(&BoundExpr::new_unchecked(col.clone()))?;
if *try_cast {
stats.cast(dtype).unwrap_or(ColumnRangeStatistics::Missing)
} else {
stats.cast(dtype)?
}
}

Comment on lines +92 to 105
fn visit_cast(
&self,
expr: &ExprRef,
data_type: &DataType,
try_cast: bool,
) -> PyVisitorResult<'py> {
let attr = if try_cast {
"visit_try_cast"
} else {
"visit_cast"
};
let args = (self.to_expr(expr)?, self.to_data_type(data_type)?);
self.visitor.call_method1(attr, args)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Visitor dispatches to visit_try_cast, breaking existing Python visitors

When a try_cast expression is visited, visit_try_cast is called on the Python visitor object. Any existing Python visitor that implements visit_cast but not visit_try_cast will receive an AttributeError at runtime as soon as it encounters a try_cast node in the expression tree. A backward-compatible approach would be to check whether the visitor has visit_try_cast and fall back to visit_cast (passing the try_cast flag as a third argument) if it does not.

Comment on lines +22 to +47
// Try bulk cast first, return directly if successful
match self.cast(datatype) {
Ok(casted) => Ok(casted),
Err(_) => {
// Bulk cast failed, try element-wise
let len = self.len();
let mut results: Vec<Series> = Vec::with_capacity(len);

for i in 0..len {
let element = self.slice(i, i + 1)?;
match element.cast(datatype) {
Ok(casted) => results.push(casted),
Err(_) => {
results.push(Series::full_null(self.name(), datatype, 1));
}
}
}

// Concatenate all single-element Series
if results.is_empty() {
Ok(Series::empty(self.name(), datatype))
} else {
let ref_results: Vec<&Series> = results.iter().collect();
Series::concat(&ref_results)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Element-wise fallback is O(n) slice+cast+concat and can be very slow at scale

When the bulk cast fails the implementation falls back to a loop that calls self.slice(i, i+1), element.cast(datatype), and Series::concat for every row. For a column with millions of rows that all need the fallback (e.g., mixed strings cast to int), this creates millions of single-element Series objects and concatenation calls, which will be orders of magnitude slower than a vectorized implementation. Worth at least noting in a comment, or documenting as a known limitation.

@XuQianJin-Stars XuQianJin-Stars force-pushed the feat/try-cast branch 3 times, most recently from 2ebd43d to a97e312 Compare May 19, 2026 03:24
Implements try_cast which returns null instead of raising an error when
type conversion fails. This is the Spark-compatible TRY_CAST function.

Changes:
- Extended Expr::Cast variant with a bool flag (true = try_cast mode)
- Added try_cast method to Series (element-wise fallback on failure)
- Added TRY_CAST SQL syntax support via CastKind::TryCast/SafeCast
- Added Python API: Expression.try_cast(), Series.try_cast(), daft.functions.try_cast()
- Updated all Expr::Cast match sites across the codebase
- Added comprehensive tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add try_cast function for safe type conversion (Spark-compatible TRY_CAST)

1 participant