Experimental: Native CSV files read by kazantsev-maksim · Pull Request #3044 · apache/datafusion-comet

kazantsev-maksim · 2026-01-06T15:36:28Z

Which issue does this PR close?

N/A

Rationale for this change

Added an experimental implementation of native CSV file reading (currently only for DataSourceV2 version)

Required improvements:

Conduct more benchmark tests
Test reading files from S3/HDFS (currently only tested on local files)

Results of simple benchmark test (1 iteration): native_csv_read.txt

How are these changes tested?

Added new unit test
Added simple benchmark test

# Conflicts: # native/core/src/execution/planner.rs # native/proto/src/proto/operator.proto # spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala

This reverts commit 768b3e9.

comphead · 2026-01-06T17:27:35Z

nice, would love to see benches )

parthchandra · 2026-01-10T02:37:54Z

Shouldn't CSV be a file format and part of ScanExec?

codecov-commenter · 2026-01-10T02:59:56Z

Codecov Report

❌ Patch coverage is 85.07463% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.01%. Comparing base (f09f8af) to head (139ecff).
⚠️ Report is 859 commits behind head on main.

Files with missing lines	Patch %	Lines
...pache/spark/sql/comet/CometCsvNativeScanExec.scala	84.12%	4 Missing and 6 partials ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	78.57%	2 Missing and 4 partials ⚠️
...cala/org/apache/comet/serde/operator/package.scala	92.85%	0 Missing and 2 partials ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala	50.00%	0 Missing and 1 partial ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3044      +/-   ##
============================================
+ Coverage     56.12%   60.01%   +3.88%     
- Complexity      976     1424     +448     
============================================
  Files           119      170      +51     
  Lines         11743    15687    +3944     
  Branches       2251     2607     +356     
============================================
+ Hits           6591     9414    +2823     
- Misses         4012     4952     +940     
- Partials       1140     1321     +181

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kazantsev-maksim · 2026-01-11T16:10:33Z

Thanks @parthchandra, you are absolutely right. In the first phase, I wanted to implement it only for DataSourceV2 to check the performance improvement. I hope to finish the benchmark tests in the coming days.

kazantsev-maksim · 2026-01-14T17:36:02Z

Run TPC-H test on my local machine. The very long Spark query execution time is related to some queries being restarted during processing.

kazantsev-maksim · 2026-01-14T17:44:38Z

    parser.add_argument("--name", required=True, help="Prefix for result file e.g. spark/comet/gluten")
    parser.add_argument("--query", required=False, type=int, help="Specific query number to run (1-based). If not specified, all queries will be run.")
    parser.add_argument("--write", required=False, help="Path to save query results to, in Parquet format.")
+    parser.add_argument("--format", required=True, default="parquet", help="Input file format (parquet, csv, json)")


It is necessary to add the ability to pass CSV reading options.

For CSV:

tpcbench.py
--name spark
--benchmark tpch
--data $TPCH_DATA
--queries $TPCH_QUERIES
--output .
--iterations 1
--format csv
--options '{"header": "true", "delimiter": ","}'

kazantsev-maksim · 2026-01-14T19:04:31Z

I ran a micro-benchmark test to measure the read speed for each TPC-H table.

Results: native_csv_read.txt

kazantsev-maksim · 2026-01-17T17:35:26Z

@comphead @parthchandra @andygrove @mbutrovich I would like to get your feedback — does this PR make sense?

parthchandra · 2026-01-20T21:46:37Z

I ran a micro-benchmark test to measure the read speed for each TPC-H table.

Results: native_csv_read.txt

That's a nice speedup so I think that it is worth it.

parthchandra

lgtm, pending ci

The csv_scan.rs file added in apache#3044 uses datafusion_datasource but the dependency was not added to core/Cargo.toml. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* build: optimize CI cache usage and add fast lint gate This PR addresses cache storage approaching its 10GB limit by: 1. Cache optimization (saves ~2+ GB): - Remove Java version from cargo cache key (Rust target is JDK-independent) - Use actions/cache/restore + actions/cache/save pattern - Only save cache on main branch, not on PRs 2. Reduce Rust test matrix: - Consolidate from 2 jobs (Java 11 + Java 17) to 1 job (Java 17) - Rust code is JDK-independent, so no coverage lost 3. Add fast lint gate (~30 seconds): - New lint job runs cargo fmt --check before expensive builds - build-native and linux-test-rust depend on lint passing - Fail fast on formatting errors instead of waiting 5-10 minutes - macOS lint runs on ubuntu-latest for cost efficiency Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: add missing datafusion-datasource dependency The csv_scan.rs file added in #3044 uses datafusion_datasource but the dependency was not added to core/Cargo.toml. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * build: merge TPC-DS/TPC-H correctness tests into pr_build_linux These workflows verify that benchmark queries produce correct results (not actual performance benchmarks), so they can use the CI build profile and share the native library artifact from build-native. Changes: - Add verify-benchmark-results-tpch job to pr_build_linux - Add verify-benchmark-results-tpcds job to pr_build_linux (3 join strategies) - Delete standalone benchmark-tpcds.yml and benchmark-tpch.yml workflows - Jobs reuse native library artifact instead of rebuilding This eliminates 4+ redundant native builds per PR. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: compile test classes before generating TPC data The GenTPCHData and GenTPCDSData classes are test classes that need to be compiled before running exec:java. Added a build step to compile the project (including test classes) before data generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Kazantsev Maksim and others added 27 commits November 26, 2025 20:14

Work

1c19b51

Merge remote-tracking branch 'origin/main' into native_csv_read

062296f

Work

0d9355f

Merge remote-tracking branch 'origin/main' into native_csv_read

4479678

Work

6c12812

work

b601956

Merge remote-tracking branch 'origin/main' into native_csv_read

2ffd9cb

# Conflicts: # native/core/src/execution/planner.rs # native/proto/src/proto/operator.proto # spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala

work

c685235

impl map_from_entries

768b3e9

Revert "impl map_from_entries"

c68c342

This reverts commit 768b3e9.

Merge branch 'apache:main' into main

d887555

Merge branch 'apache:main' into main

231aa90

Merge remote-tracking branch 'origin/main' into native_csv_read

2be0069

work

7ea16ee

work

c521006

Merge branch 'apache:main' into main

9500bbb

Merge branch 'apache:main' into main

9577481

Merge remote-tracking branch 'origin/main' into native_csv_read

8796a68

WIP

0f06936

WIP

033ba8b

WIP

dafa0de

Work

1809df8

Merge branch 'apache:main' into main

3791557

Merge branch 'apache:main' into main

7c2f082

Merge branch 'apache:main' into main

609a605

Final approach

d8c7760

Fix workflows

88aeb33

kazantsev-maksim marked this pull request as draft January 6, 2026 15:38

Kazantsev Maksim added 2 commits January 6, 2026 19:46

Fix fmt

b2a0c28

Fix params

ba98e37

Kazantsev Maksim added 2 commits January 6, 2026 20:02

Fix tests

cd449c5

Fix rust fmt

a1801c1

Kazantsev Maksim and others added 2 commits January 6, 2026 21:57

Fix fmt

65251eb

Merge branch 'apache:main' into main

a151b2c

kazantsev-maksim and others added 3 commits January 10, 2026 01:06

Merge branch 'apache:main' into main

ad3e7f5

Merge remote-tracking branch 'origin/main' into native_csv_read

da73c27

Fix tests

a21ae07

Kazantsev Maksim and others added 3 commits January 14, 2026 21:34

Run tpch

bb5debe

Merge branch 'apache:main' into main

ea92e4b

Merge remote-tracking branch 'origin/main' into native_csv_read

2cc6a98

kazantsev-maksim commented Jan 14, 2026

View reviewed changes

Add spark options to tpcbench.py

139ecff

kazantsev-maksim marked this pull request as ready for review January 15, 2026 15:31

parthchandra approved these changes Jan 20, 2026

View reviewed changes

andygrove merged commit f538424 into apache:main Jan 23, 2026
131 checks passed

This was referenced Jan 23, 2026

build: add missing datafusion-datasource dependency #3252

Merged

build: optimize CI cache usage and add fast lint gate #3251

Merged

kazantsev-maksim deleted the native_csv_read branch January 23, 2026 14:37

andygrove mentioned this pull request Jan 23, 2026

Comet 0.13.0 Release (January 2026) #2845

Closed

andygrove mentioned this pull request Jan 30, 2026

Implement native parsing of CSV files #882

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental: Native CSV files read#3044

Experimental: Native CSV files read#3044
andygrove merged 40 commits into
apache:mainfrom
kazantsev-maksim:native_csv_read

kazantsev-maksim commented Jan 6, 2026 •

edited

Loading

Uh oh!

comphead commented Jan 6, 2026

Uh oh!

parthchandra commented Jan 10, 2026

Uh oh!

codecov-commenter commented Jan 10, 2026 •

edited

Loading

Uh oh!

kazantsev-maksim commented Jan 11, 2026

Uh oh!

kazantsev-maksim commented Jan 14, 2026 •

edited

Loading

Uh oh!

kazantsev-maksim Jan 14, 2026

Uh oh!

kazantsev-maksim Jan 15, 2026

Uh oh!

kazantsev-maksim commented Jan 14, 2026 •

edited

Loading

Uh oh!

kazantsev-maksim commented Jan 17, 2026

Uh oh!

parthchandra commented Jan 20, 2026

Uh oh!

parthchandra left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kazantsev-maksim commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

How are these changes tested?

Uh oh!

comphead commented Jan 6, 2026

Uh oh!

parthchandra commented Jan 10, 2026

Uh oh!

codecov-commenter commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kazantsev-maksim commented Jan 11, 2026

Uh oh!

kazantsev-maksim commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kazantsev-maksim Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

kazantsev-maksim Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

kazantsev-maksim commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kazantsev-maksim commented Jan 17, 2026

Uh oh!

parthchandra commented Jan 20, 2026

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kazantsev-maksim commented Jan 6, 2026 •

edited

Loading

codecov-commenter commented Jan 10, 2026 •

edited

Loading

kazantsev-maksim commented Jan 14, 2026 •

edited

Loading

kazantsev-maksim commented Jan 14, 2026 •

edited

Loading