Adding Use of arrow's has_true() / has_false()#21806
Conversation
|
run benchmarks |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (ca7e0a8) to 067ba4b (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (ca7e0a8) to 067ba4b (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
Hi @adriangb , requesting you to review. Thanks |
|
run benchmarks |
|
Hi @adriangb, thanks for the request (#21806 (comment)). You already have 15 pending benchmark job(s), and this request would add 3 more — exceeding the per-user queue limit of 15. Please wait for some of your current runs to finish before submitting more. File an issue against this benchmark runner |
|
run benchmarks |
adriangb
left a comment
There was a problem hiding this comment.
Some more nits: please restore the comments and double check the usage in replace.rs.
Thanks for working on this!
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (8777c09) to 42cd2fa (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (8777c09) to 42cd2fa (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (8777c09) to 42cd2fa (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
It looks like it can't load the lineitem table |
|
run benchmark tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (8777c09) to 42cd2fa (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
Benches look good. @raushanprabhakar1 if you can address my feedback we can move to merge this |
maybe we should add a sanity check to the benchmark runner / better error message (so it fails with a message that says the files aren't there rather than failing on the expected output check) 🤔 |
Yes, I’ve run into this issue before. Even with just DataFusion CLI if you pointed to an empty directory, it will give you a table with no columns instead of some sort of error that says there’s no data here. I can’t think of any use case where you’d want to infer a schema with no columns instead of en error saying there’s no data. |
|
Hi @adriangb , Thanks for your feedback, i have addressed all your feedbacks, please do review, Thanks. |
adriangb
left a comment
There was a problem hiding this comment.
Small nit. Otherwise LGTM once CI passes!
Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
|
Thanks @raushanprabhakar1 ! |
…ema (#21965) ## Which issue does this PR close? - Closes #. ## Rationale for this change When you point `CREATE EXTERNAL TABLE` at an empty directory (or one that does not exist yet) without specifying an explicit column list, DataFusion silently creates a table with **0 columns**. Any query against that table then fails with a confusing "column not found" / "no such column" error that gives no hint that the underlying issue is actually that schema inference had nothing to look at. This is the same root cause as the discussion on #21806 (comment) — that thread covered it from the angle of benchmark runners hitting it, but the confusion is not specific to benchmarks. Failing at `CREATE EXTERNAL TABLE` time with a clear, actionable message seemed like the right fix overall. ## What changes are included in this PR? `ListingOptions::infer_schema` now returns a `Plan` error when the location yields no files (after the existing 0-byte filter), telling the user to either add data files or declare an explicit schema: ``` Error during planning: No files found at file:///tmp/empty_dir/. Cannot infer schema from an empty location; either add data files or declare an explicit schema for the table. ``` Pre-declaring an empty table with an explicit schema (e.g. `CREATE EXTERNAL TABLE t(x int) STORED AS PARQUET LOCATION '...'` for later `INSERT`) still works — the inference path is only triggered when no schema is provided. ## Are these changes tested? Yes. New cases in `datafusion/sqllogictest/test_files/ddl.slt` cover: - Parquet, CSV, and JSON over an empty location without an explicit schema → all return the new `Plan` error. - An empty location with an explicit schema → still works and queries cleanly. - Schema inference still succeeds once files exist at the location, so the new check does not regress the happy path. ## Are there any user-facing changes? Yes — `CREATE EXTERNAL TABLE ... LOCATION '<empty-dir>'` without an explicit schema now errors at planning time instead of creating a 0-column table. Anyone relying on the previous behavior must add an explicit schema declaration. The error message tells them how. ## Use of AI This code was written fully by AI. @adriangb gave it a detailed plan and reviewed the code by hand once this PR was opened and CI was green. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
Closes #21784
Rationale for this change
Apache Arrow added
BooleanArray::has_true()andhas_false()so callers can answer “any true/false?” without a full bit count. That can short-circuit and avoid unnecessary work compared to patterns liketrue_count() == 0ortrue_count() > 0.This PR applies those APIs across DataFusion where the logic is purely existential (or equivalent via null-safe “all true” / “no true” checks), matching the audit suggested in the issue.
What changes are included in this PR?
has_true()/has_false()(andnull_count()where needed), including:array_has, list replace), Sparkarray_containsnull-semantics fast pathevaluate_selection, binary AND/OR short-circuit, CASE/IN list loopsscatterfast pathsmetadata.rs,has_any_exact_match)true_count()/false_count()where an actual count is required (row counts, metrics, selectivity,to_array(n), etc.)arrow::array::Arraywherenull_count()is used onBooleanArrayin trait-heavy pathsAre these changes tested?
Existing tests cover this behavior; the edits are semantics-preserving refactors (same conditions, cheaper primitives). No new tests were added.
Are there any user-facing changes?
No. Behavior should be unchanged; this is an internal performance/clarity improvement.