Experimental: Native CSV files read#3044
Conversation
# Conflicts: # native/core/src/execution/planner.rs # native/proto/src/proto/operator.proto # spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala
This reverts commit 768b3e9.
|
nice, would love to see benches ) |
|
Shouldn't CSV be a file format and part of |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3044 +/- ##
============================================
+ Coverage 56.12% 60.01% +3.88%
- Complexity 976 1424 +448
============================================
Files 119 170 +51
Lines 11743 15687 +3944
Branches 2251 2607 +356
============================================
+ Hits 6591 9414 +2823
- Misses 4012 4952 +940
- Partials 1140 1321 +181 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks @parthchandra, you are absolutely right. In the first phase, I wanted to implement it only for DataSourceV2 to check the performance improvement. I hope to finish the benchmark tests in the coming days. |
| parser.add_argument("--name", required=True, help="Prefix for result file e.g. spark/comet/gluten") | ||
| parser.add_argument("--query", required=False, type=int, help="Specific query number to run (1-based). If not specified, all queries will be run.") | ||
| parser.add_argument("--write", required=False, help="Path to save query results to, in Parquet format.") | ||
| parser.add_argument("--format", required=True, default="parquet", help="Input file format (parquet, csv, json)") |
There was a problem hiding this comment.
It is necessary to add the ability to pass CSV reading options.
There was a problem hiding this comment.
For CSV:
tpcbench.py
--name spark
--benchmark tpch
--data $TPCH_DATA
--queries $TPCH_QUERIES
--output .
--iterations 1
--format csv
--options '{"header": "true", "delimiter": ","}'
|
I ran a micro-benchmark test to measure the read speed for each TPC-H table. Results: native_csv_read.txt |
|
@comphead @parthchandra @andygrove @mbutrovich I would like to get your feedback — does this PR make sense? |
That's a nice speedup so I think that it is worth it. |
The csv_scan.rs file added in apache#3044 uses datafusion_datasource but the dependency was not added to core/Cargo.toml. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* build: optimize CI cache usage and add fast lint gate This PR addresses cache storage approaching its 10GB limit by: 1. Cache optimization (saves ~2+ GB): - Remove Java version from cargo cache key (Rust target is JDK-independent) - Use actions/cache/restore + actions/cache/save pattern - Only save cache on main branch, not on PRs 2. Reduce Rust test matrix: - Consolidate from 2 jobs (Java 11 + Java 17) to 1 job (Java 17) - Rust code is JDK-independent, so no coverage lost 3. Add fast lint gate (~30 seconds): - New lint job runs cargo fmt --check before expensive builds - build-native and linux-test-rust depend on lint passing - Fail fast on formatting errors instead of waiting 5-10 minutes - macOS lint runs on ubuntu-latest for cost efficiency Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: add missing datafusion-datasource dependency The csv_scan.rs file added in #3044 uses datafusion_datasource but the dependency was not added to core/Cargo.toml. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * build: merge TPC-DS/TPC-H correctness tests into pr_build_linux These workflows verify that benchmark queries produce correct results (not actual performance benchmarks), so they can use the CI build profile and share the native library artifact from build-native. Changes: - Add verify-benchmark-results-tpch job to pr_build_linux - Add verify-benchmark-results-tpcds job to pr_build_linux (3 join strategies) - Delete standalone benchmark-tpcds.yml and benchmark-tpch.yml workflows - Jobs reuse native library artifact instead of rebuilding This eliminates 4+ redundant native builds per PR. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: compile test classes before generating TPC data The GenTPCHData and GenTPCDSData classes are test classes that need to be compiled before running exec:java. Added a build step to compile the project (including test classes) before data generation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>


Which issue does this PR close?
Rationale for this change
Added an experimental implementation of native CSV file reading (currently only for DataSourceV2 version)
Required improvements:
Results of simple benchmark test (1 iteration): native_csv_read.txt
How are these changes tested?