Summary
During review of PR #3004 (which adds basic to_csv support), fuzz testing revealed several edge cases that are not handled correctly. These should be addressed in follow-up work after the initial implementation is merged.
Bugs Found
1. Null value not quoted when it contains special characters
When the nullValue option contains the delimiter or other special characters (e.g., "N,A"), it's written unquoted, corrupting the CSV output.
| Expected (Spark) |
Actual (Comet) |
"N,A",world |
N,A,world |
hello,"N,A" |
hello,N,A |
Location: native/spark-expr/src/csv_funcs/to_csv.rs:164-171
Fix: Check if null_value contains special characters and quote/escape it appropriately.
2. Whitespace trimming applied incorrectly
When ignoreLeadingWhiteSpace=false or ignoreTrailingWhiteSpace=false, strings containing whitespace plus special characters are incorrectly handled. The code trims whitespace before checking if quoting is needed.
| Expected (Spark) |
Actual (Comet) |
\" (preserved whitespace with escaped quote) |
"" (empty) |
Location: native/spark-expr/src/csv_funcs/to_csv.rs:176-183
Fix: Review the order of operations - quoting determination should consider the original (untrimmed) value.
3. Decimal formatting mismatch
Spark uses scientific notation for small decimal values, while Comet uses fixed-point notation.
| Expected (Spark) |
Actual (Comet) |
0E-18 |
0.000000000000000000 |
Fix: Align decimal-to-string casting with Spark's formatting behavior.
4. NPE with single-column struct (needs investigation)
NullPointerException occurs when processing single-column structs with certain null patterns. This may be a Spark-side issue with how Comet's output is handled, but needs investigation.
Reproduction
Fuzz tests were added in CometCsvExpressionSuite.scala that reproduce these issues:
to_csv - edge case: delimiter in null value representation
to_csv - fuzz test: comprehensive random data and options
to_csv - edge case: numeric boundary values
to_csv - edge case: single column struct
Related
Summary
During review of PR #3004 (which adds basic
to_csvsupport), fuzz testing revealed several edge cases that are not handled correctly. These should be addressed in follow-up work after the initial implementation is merged.Bugs Found
1. Null value not quoted when it contains special characters
When the
nullValueoption contains the delimiter or other special characters (e.g.,"N,A"), it's written unquoted, corrupting the CSV output."N,A",worldN,A,worldhello,"N,A"hello,N,ALocation:
native/spark-expr/src/csv_funcs/to_csv.rs:164-171Fix: Check if
null_valuecontains special characters and quote/escape it appropriately.2. Whitespace trimming applied incorrectly
When
ignoreLeadingWhiteSpace=falseorignoreTrailingWhiteSpace=false, strings containing whitespace plus special characters are incorrectly handled. The code trims whitespace before checking if quoting is needed.\"(preserved whitespace with escaped quote)""(empty)Location:
native/spark-expr/src/csv_funcs/to_csv.rs:176-183Fix: Review the order of operations - quoting determination should consider the original (untrimmed) value.
3. Decimal formatting mismatch
Spark uses scientific notation for small decimal values, while Comet uses fixed-point notation.
0E-180.000000000000000000Fix: Align decimal-to-string casting with Spark's formatting behavior.
4. NPE with single-column struct (needs investigation)
NullPointerExceptionoccurs when processing single-column structs with certain null patterns. This may be a Spark-side issue with how Comet's output is handled, but needs investigation.Reproduction
Fuzz tests were added in
CometCsvExpressionSuite.scalathat reproduce these issues:to_csv - edge case: delimiter in null value representationto_csv - fuzz test: comprehensive random data and optionsto_csv - edge case: numeric boundary valuesto_csv - edge case: single column structRelated
to_csvimplementation