fix: raise AmbiguousReference error for duplicate column names in subquery#21236
fix: raise AmbiguousReference error for duplicate column names in subquery#21236xiedeyantu wants to merge 4 commits intoapache:mainfrom
Conversation
xudong963
left a comment
There was a problem hiding this comment.
Could you please add some slt tests? multiple joins (2, 3 etc) with duplicated columns
| let aliases = unique_field_aliases(plan.schema().fields()); | ||
| let is_projection_needed = aliases.iter().any(Option::is_some); | ||
|
|
||
| // Collect the set of unqualified field names that are ambiguous in this |
There was a problem hiding this comment.
Here marks a name as ambiguous when unique_field_aliases provides a rename for it. But unique_field_aliases renames all duplicates — so if there are 3 columns named age, the 2nd and 3rd get renamed.
The code collects field.name() for each renamed field, which means it collects "age" twice and puts it in the set. That works correctly due to HashSet dedup, but the first age (which was NOT renamed) is also ambiguous and only ends up in the set because one of the later duplicates shares its name. This is coincidentally correct but fragile — if unique_field_aliases ever changed to rename ALL duplicates (including the first), or if it renamed to something other than name:N, the logic could break. 🤔
A cleaner approach: count occurrences of each name and mark any name appearing 2+ times.
There was a problem hiding this comment.
@xudong963 Thank you very much for your review and excellent suggestions. My initial intention for this PR was to eliminate the unique naming convention—specifically, the name:N format—but it appears to play a critical role internally (it would be great if you could provide some further details on this). Consequently, I introduced an additional ambiguous_names field to track duplicate column names. To be honest, I felt this approach lacked elegance, but I couldn't come up with a better alternative at the time. Having reviewed your suggestions, however, I now believe they offer a superior solution; I will proceed to refactor this PR based on that approach. I will also add the corresponding SLT tests.
|
@xudong963 I have made the changes based on your suggestions and added tests. There is one failing item in the CI, but it appears unrelated to the current PR, as the |
xudong963
left a comment
There was a problem hiding this comment.
Looks like nested derived tables still lose the ambiguity marker after one extra SELECT * wrapper: such as SELECT age FROM ( SELECT * FROM ( SELECT * FROM t_left JOIN t_right ON t_left.id = t_right.id ) AS s1 ) AS s2;
And for join, how about the case SELECT age FROM (SELECT * FROM t_left JOIN t_right ON t_left.id = t_right.id) sub JOIN seed ON true
@xudong963 Thanks! Good catch, I have fixed this issue and added tests for nested derived tables and subquery JOIN cases. |
xudong963
left a comment
There was a problem hiding this comment.
ORDER BY still drops the ambiguity marker when it falls back from the select-list schema to the underlying FROM schema. order_by_to_sort_expr clones the projected schema and then calls DFSchema::merge, but merge still never unions other_schema.ambiguous_names
For example:
SELECT score
FROM (SELECT * FROM t_left JOIN t_right ON t_left.id = t_right.id) sub
ORDER BY age;
@xudong963 Thanks! I have fixed this issue via commit 5ab14b2. |
|
run benchmark sql_planner |
alamb
left a comment
There was a problem hiding this comment.
thanks @xiedeyantu and @xudong963 - I am worried about the implications of adding a new field to DFSchema for what appears to be a fairly narrow usecase (as in the cost will be high but the benefit relatively small)
Is there some way to improve the messages without having to pre-compute this field? For example, perhaps compute the ambiguous names (only) when producing the error message
datafusion/common/src/dfschema.rs
Outdated
| /// source (e.g. a derived-table subquery) contained multiple columns with | ||
| /// the same unqualified name. Any attempt to reference these names without | ||
| /// a qualifier should produce an [`SchemaError::AmbiguousReference`] error. | ||
| ambiguous_names: HashSet<String>, |
There was a problem hiding this comment.
DFSchema is used many places in this codebase -- I am worried that adding a new HashSet will cause non trivial slowdown in planning. I will run some benchmarks on this one
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing ambiguous-names (5ab14b2) to bc2b36c (merge-base) diff File an issue against this benchmark runner |
@alamb I haven't thought of a better approach for now. If this solution isn’t ideal, you can close this PR first, and I’ll keep thinking about whether there might be a better alternative. |
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
@alamb @xudong963 I see that the test results are now available. I'm not entirely sure how to interpret them, but based on a rough comparison with the baseline branch, it appears there is some impact. I'm not certain whether this discrepancy is acceptable. If you feel it is not, we can temporarily close this PR while I look for alternative solutions. |
Yes I agree -- my interpretation of the benchmark results show that this PR causes a small (1-3%) slowdown in SQL planning. I don't think it is acceptable to slow down all query planning to improve an error message. (It is fine to slow down planning when the error was going to be generated, but we shouldn't take all good queries too) Thank you very much for looking into this |
alamb
left a comment
There was a problem hiding this comment.
I think we need to avoid performance regressions before merging this
Of course! I’ll think about other possible ways to fix this issue. Thanks for your expert feedback. |
5ab14b2 to
76150b2
Compare
|
run benchmark sql_planner |
|
Hi @xiedeyantu, thanks for the request (#21236 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
|
@alamb @xudong963 Could you help me rerun the benchmark? |
|
run benchmark sql_planner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing ambiguous-names (4e8a1a5) to 5ba06ac (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
@alamb This seems to be related to commit |
|
run benchmark sql_planner |
|
🤖 Criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing ambiguous-names (0c8b96d) to 5ba06ac (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
When two joined tables share a column with the same name (e.g.
age), aSELECT *inside a derived table subquery produces duplicate column names. Previously, referencing such a column by its unqualified name from the outer query silently succeeded instead of raising an ambiguity error, violating standard SQL semantics.What changes are included in this PR?
ambiguous_names: HashSet<String>field toDFSchemato track column names that are structurally ambiguous in a given schema context.DFSchema::with_ambiguous_names(builder) andDFSchema::ambiguous_names(accessor) methods.SubqueryAlias::try_new, afterunique_field_aliasesrenames duplicate columns to keep the Arrow schema valid, the original (pre-rename) names are collected intoambiguous_namesand attached to the output schema.DFSchema::qualified_field_with_unqualified_name, any lookup of an ambiguous name now immediately returnsSchemaError::AmbiguousReference.Column::normalize_with_schemas_and_ambiguity_check, even a single structural match is rejected when the containing schema has flagged the name as ambiguous.bad_extension_plannersnapshot test to include the newambiguous_namesfield in theDFSchemadebug output.Are these changes tested?
The existing
join_with_ambiguous_column,order_by_ambiguous_name, andgroup_by_ambiguous_nametests continue to pass. A new test case covering the reported scenario (select age from (SELECT * FROM a join b on a.aid = b.bid) as t) should be added todatafusion/sql/tests/sql_integration.rs.Are there any user-facing changes?
Yes. Queries that previously silently resolved an ambiguous column reference through a derived-table subquery will now receive a
Schema error: Ambiguous reference to unqualified field <name>error, consistent with standard SQL behavior and with how DataFusion already handles the same ambiguity at the direct JOIN level.