Skip to content

Commit aadae6b

Browse files
Fix/support duplicate column names #6543 (#21126)
## Which issue does this PR close? - Closes #6543. ## Rationale for this change We're building a SQL engine on top of DataFusion and hit this while running TPC-DS benchmarks — Q39 fails during planning with: ``` Projections require unique expression names but the expression "CAST(inv1.cov AS Decimal128(30, 10))" at position 4 and "inv1.cov" at position 10 have the same name. ``` The underlying issue is that `CAST` is transparent to `schema_name()`, so both expressions resolve to `inv1.cov`. But this also affects simpler cases like `SELECT 1, 1` or `SELECT x, x FROM t` — all of which PostgreSQL, Trino, and SQLite handle without errors. Looking at the issue discussion, @alamb suggested adding auto-aliases in the SQL planner: > "I think that is actually a pretty neat idea -- specifically add the aliases in the SQL planner. I would be happy to review such a PR" That's what this PR does. ### TPC-DS Q39 reproduction The query joins two CTEs that both produce columns named `cov`, `mean`, etc. When the planner applies implicit casts during type coercion, the cast-wrapped and original expressions end up with the same schema name: ```sql WITH inv1 AS ( SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy, stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS cov FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy, stddev_samp(inv_quantity_on_hand) AS stdev, avg(inv_quantity_on_hand) AS mean FROM inventory, item, warehouse, date_dim WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = w_warehouse_sk AND inv_date_sk = d_date_sk AND d_year = 2001 GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1 ), inv2 AS ( SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy, stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS cov FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy, stddev_samp(inv_quantity_on_hand) AS stdev, avg(inv_quantity_on_hand) AS mean FROM inventory, item, warehouse, date_dim WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = w_warehouse_sk AND inv_date_sk = d_date_sk AND d_year = 2001 AND d_moy = 2 GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1 ) SELECT inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, inv1.cov, inv2.w_warehouse_sk, inv2.i_item_sk, inv2.d_moy, inv2.mean, inv2.cov FROM inv1 JOIN inv2 ON inv1.i_item_sk = inv2.i_item_sk AND inv1.w_warehouse_sk = inv2.w_warehouse_sk ORDER BY inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, inv1.cov, inv2.d_moy, inv2.mean, inv2.cov; ``` ## What changes are included in this PR? A dedup pass in `SqlToRel` that runs right after `prepare_select_exprs()` and before `self.project()`. It detects duplicate `schema_name()` values and wraps the second (and subsequent) occurrences in an `Alias` with a `:{N}` suffix: ```sql SELECT x AS c1, y AS c1 FROM t; -- produces columns: c1, c1:1 ``` The actual code is small (~45 lines of logic across 2 files): - `datafusion/sql/src/utils.rs` — new `deduplicate_select_expr_names()` function - `datafusion/sql/src/select.rs` — one call site between `prepare_select_exprs()` and `self.project()` I intentionally kept this scoped to the SQL planner only: - `validate_unique_names("Projections")` in `builder.rs` is untouched, so the Rust API (`LogicalPlanBuilder::project`) still rejects duplicates - No changes to the optimizer, physical planner, or DFSchema - `validate_unique_names("Windows")` is unchanged **Known limitation:** `SELECT *, x FROM t` still errors when `x` overlaps with `*`, because wildcard expansion happens after this dedup pass (inside `project_with_validation`). Happy to address that in a follow-up if desired. ## Are these changes tested? New sqllogictest file (`duplicate_column_alias.slt`) with 13 test cases covering: - Basic duplicate aliases, literals, and same-column-twice - Subquery with duplicate names - ORDER BY resolving to first occurrence - CTE join (TPC-DS Q39 pattern) - Three-way duplicates - CAST producing same schema_name as original column - GROUP BY and aggregates with duplicates - ORDER BY positional reference to the renamed column - `iszero(0.0), iszero(-0.0)` (reported in the issue by @jatin510) - UNION with duplicate column names - Wildcard limitation documented as explicit `query error` test Updated existing tests in `sql_integration.rs` (5 tests), `aggregate.slt`, and `unnest.slt` that previously asserted the "Projections require unique" error. ## Are there any user-facing changes? Yes, this is a behavior change: - SQL queries with duplicate expression names now succeed instead of erroring - Duplicate columns get a `:{N}` suffix in the output (e.g., `cov`, `cov:1`) - First occurrence keeps its original name, so ORDER BY / HAVING references still work - The programmatic Rust API is unchanged
1 parent 1a0af76 commit aadae6b

File tree

6 files changed

+265
-35
lines changed

6 files changed

+265
-35
lines changed

datafusion/sql/src/select.rs

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,9 @@ use crate::planner::{ContextProvider, PlannerContext, SqlToRel};
2323
use crate::query::to_order_by_exprs_with_select;
2424
use crate::utils::{
2525
CheckColumnsMustReferenceAggregatePurpose, CheckColumnsSatisfyExprsPurpose,
26-
check_columns_satisfy_exprs, extract_aliases, rebase_expr, resolve_aliases_to_exprs,
27-
resolve_columns, resolve_positions_to_exprs, rewrite_recursive_unnests_bottom_up,
26+
check_columns_satisfy_exprs, deduplicate_select_expr_names, extract_aliases,
27+
rebase_expr, resolve_aliases_to_exprs, resolve_columns, resolve_positions_to_exprs,
28+
rewrite_recursive_unnests_bottom_up,
2829
};
2930

3031
use datafusion_common::error::DataFusionErrorBuilder;
@@ -109,6 +110,10 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
109110
planner_context,
110111
)?;
111112

113+
// Auto-suffix duplicate expression names (e.g. cov, cov → cov, cov:1)
114+
// before projection so that the unique-name constraint is satisfied.
115+
let select_exprs = deduplicate_select_expr_names(select_exprs);
116+
112117
// Having and group by clause may reference aliases defined in select projection
113118
let projected_plan = self.project(base_plan.clone(), select_exprs)?;
114119
let select_exprs = projected_plan.expressions();

datafusion/sql/src/utils.rs

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ use datafusion_expr::builder::get_struct_unnested_columns;
3333
use datafusion_expr::expr::{
3434
Alias, GroupingSet, Unnest, WindowFunction, WindowFunctionParams,
3535
};
36+
use datafusion_expr::select_expr::SelectExpr;
3637
use datafusion_expr::utils::{expr_as_column_expr, find_column_exprs};
3738
use datafusion_expr::{
3839
ColumnUnnestList, Expr, ExprSchemable, LogicalPlan, col, expr_vec_fmt,
@@ -633,6 +634,38 @@ fn push_projection_dedupl(projection: &mut Vec<Expr>, expr: Expr) {
633634
projection.push(expr);
634635
}
635636
}
637+
638+
/// Auto-suffix duplicate SELECT expression names with `:{count}`.
639+
///
640+
/// The first occurrence keeps its original name so that ORDER BY / HAVING
641+
/// references resolve correctly. Wildcards are left untouched because they
642+
/// are expanded later in `project_with_validation`.
643+
///
644+
/// Duplicates are detected by the schema name of each expression, which
645+
/// identifies logically identical expressions before column normalization.
646+
pub(crate) fn deduplicate_select_expr_names(exprs: Vec<SelectExpr>) -> Vec<SelectExpr> {
647+
let mut seen: HashMap<String, usize> = HashMap::new();
648+
exprs
649+
.into_iter()
650+
.map(|select_expr| match select_expr {
651+
SelectExpr::Expression(expr) => {
652+
let name = expr.schema_name().to_string();
653+
let count = seen.entry(name.clone()).or_insert(0);
654+
let result = if *count > 0 {
655+
let (_qualifier, field_name) = expr.qualified_name();
656+
SelectExpr::Expression(expr.alias(format!("{field_name}:{count}")))
657+
} else {
658+
SelectExpr::Expression(expr)
659+
};
660+
*count += 1;
661+
result
662+
}
663+
// Leave wildcards alone — they are expanded later
664+
other => other,
665+
})
666+
.collect()
667+
}
668+
636669
/// The context is we want to rewrite unnest() into InnerProjection->Unnest->OuterProjection
637670
/// Given an expression which contains unnest expr as one of its children,
638671
/// Try transform depends on unnest type

datafusion/sql/tests/sql_integration.rs

Lines changed: 34 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -789,11 +789,13 @@ fn select_column_does_not_exist() {
789789
#[test]
790790
fn select_repeated_column() {
791791
let sql = "SELECT age, age FROM person";
792-
let err = logical_plan(sql).expect_err("query should have failed");
793-
792+
let plan = logical_plan(sql).unwrap();
794793
assert_snapshot!(
795-
err.strip_backtrace(),
796-
@r#"Error during planning: Projections require unique expression names but the expression "person.age" at position 0 and "person.age" at position 1 have the same name. Consider aliasing ("AS") one of them."#
794+
plan,
795+
@r"
796+
Projection: person.age, person.age AS age:1
797+
TableScan: person
798+
"
797799
);
798800
}
799801

@@ -1531,11 +1533,14 @@ fn select_simple_aggregate_column_does_not_exist() {
15311533
#[test]
15321534
fn select_simple_aggregate_repeated_aggregate() {
15331535
let sql = "SELECT MIN(age), MIN(age) FROM person";
1534-
let err = logical_plan(sql).expect_err("query should have failed");
1535-
1536+
let plan = logical_plan(sql).unwrap();
15361537
assert_snapshot!(
1537-
err.strip_backtrace(),
1538-
@r#"Error during planning: Projections require unique expression names but the expression "min(person.age)" at position 0 and "min(person.age)" at position 1 have the same name. Consider aliasing ("AS") one of them."#
1538+
plan,
1539+
@r"
1540+
Projection: min(person.age), min(person.age) AS min(person.age):1
1541+
Aggregate: groupBy=[[]], aggr=[[min(person.age)]]
1542+
TableScan: person
1543+
"
15391544
);
15401545
}
15411546

@@ -1584,11 +1589,14 @@ fn select_from_typed_string_values() {
15841589
#[test]
15851590
fn select_simple_aggregate_repeated_aggregate_with_repeated_aliases() {
15861591
let sql = "SELECT MIN(age) AS a, MIN(age) AS a FROM person";
1587-
let err = logical_plan(sql).expect_err("query should have failed");
1588-
1592+
let plan = logical_plan(sql).unwrap();
15891593
assert_snapshot!(
1590-
err.strip_backtrace(),
1591-
@r#"Error during planning: Projections require unique expression names but the expression "min(person.age) AS a" at position 0 and "min(person.age) AS a" at position 1 have the same name. Consider aliasing ("AS") one of them."#
1594+
plan,
1595+
@r"
1596+
Projection: min(person.age) AS a, min(person.age) AS a AS a:1
1597+
Aggregate: groupBy=[[]], aggr=[[min(person.age)]]
1598+
TableScan: person
1599+
"
15921600
);
15931601
}
15941602

@@ -1625,11 +1633,14 @@ fn select_simple_aggregate_with_groupby_with_aliases() {
16251633
#[test]
16261634
fn select_simple_aggregate_with_groupby_with_aliases_repeated() {
16271635
let sql = "SELECT state AS a, MIN(age) AS a FROM person GROUP BY state";
1628-
let err = logical_plan(sql).expect_err("query should have failed");
1629-
1636+
let plan = logical_plan(sql).unwrap();
16301637
assert_snapshot!(
1631-
err.strip_backtrace(),
1632-
@r#"Error during planning: Projections require unique expression names but the expression "person.state AS a" at position 0 and "min(person.age) AS a" at position 1 have the same name. Consider aliasing ("AS") one of them."#
1638+
plan,
1639+
@r"
1640+
Projection: person.state AS a, min(person.age) AS a AS a:1
1641+
Aggregate: groupBy=[[person.state]], aggr=[[min(person.age)]]
1642+
TableScan: person
1643+
"
16331644
);
16341645
}
16351646

@@ -1750,11 +1761,14 @@ fn select_simple_aggregate_with_groupby_can_use_alias() {
17501761
#[test]
17511762
fn select_simple_aggregate_with_groupby_aggregate_repeated() {
17521763
let sql = "SELECT state, MIN(age), MIN(age) FROM person GROUP BY state";
1753-
let err = logical_plan(sql).expect_err("query should have failed");
1754-
1764+
let plan = logical_plan(sql).unwrap();
17551765
assert_snapshot!(
1756-
err.strip_backtrace(),
1757-
@r#"Error during planning: Projections require unique expression names but the expression "min(person.age)" at position 1 and "min(person.age)" at position 2 have the same name. Consider aliasing ("AS") one of them."#
1766+
plan,
1767+
@r"
1768+
Projection: person.state, min(person.age), min(person.age) AS min(person.age):1
1769+
Aggregate: groupBy=[[person.state]], aggr=[[min(person.age)]]
1770+
TableScan: person
1771+
"
17581772
);
17591773
}
17601774

datafusion/sqllogictest/test_files/aggregate.slt

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -7941,21 +7941,16 @@ select count(), count() * count() from t;
79417941
----
79427942
2 4
79437943

7944-
# DataFusion error: Error during planning: Projections require unique expression names but the expression "count\(Int64\(1\)\)" at position 0 and "count\(Int64\(1\)\)" at position 1 have the same name\. Consider aliasing \("AS"\) one of them\.
7945-
query error
7944+
# Duplicate aggregate expressions are now auto-suffixed
7945+
query II
79467946
select count(1), count(1) from t;
7947+
----
7948+
2 2
79477949

7948-
# DataFusion error: Error during planning: Projections require unique expression names but the expression "count\(Int64\(1\)\)" at position 0 and "count\(Int64\(1\)\)" at position 1 have the same name\. Consider aliasing \("AS"\) one of them\.
7949-
query error
7950-
explain select count(1), count(1) from t;
7951-
7952-
# DataFusion error: Error during planning: Projections require unique expression names but the expression "count\(Int64\(1\) AS \)" at position 0 and "count\(Int64\(1\) AS \)" at position 1 have the same name\. Consider aliasing \("AS"\) one of them\.
7953-
query error
7950+
query II
79547951
select count(), count() from t;
7955-
7956-
# DataFusion error: Error during planning: Projections require unique expression names but the expression "count\(Int64\(1\) AS \)" at position 0 and "count\(Int64\(1\) AS \)" at position 1 have the same name\. Consider aliasing \("AS"\) one of them\.
7957-
query error
7958-
explain select count(), count() from t;
7952+
----
7953+
2 2
79597954

79607955
query II
79617956
select count(1), count(2) from t;
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
# Tests for duplicate column names/aliases in projections.
19+
# DataFusion auto-suffixes duplicates with :{count} (e.g. cov, cov:1).
20+
21+
# Setup
22+
statement ok
23+
CREATE TABLE t(x INT, y INT) AS VALUES (1, 2), (3, 4);
24+
25+
#
26+
# Basic duplicate alias
27+
#
28+
query II
29+
SELECT x AS c1, y AS c1 FROM t;
30+
----
31+
1 2
32+
3 4
33+
34+
#
35+
# Duplicate literal expressions
36+
#
37+
query II
38+
SELECT 1, 1;
39+
----
40+
1 1
41+
42+
#
43+
# Same column selected twice
44+
#
45+
query II
46+
SELECT x, x FROM t;
47+
----
48+
1 1
49+
3 3
50+
51+
#
52+
# Subquery with duplicate column names
53+
#
54+
query II
55+
SELECT * FROM (SELECT x AS c1, y AS c1 FROM t);
56+
----
57+
1 2
58+
3 4
59+
60+
#
61+
# ORDER BY referencing a duplicated alias resolves to first occurrence
62+
#
63+
query II
64+
SELECT x AS c1, y AS c1 FROM t ORDER BY c1;
65+
----
66+
1 2
67+
3 4
68+
69+
#
70+
# CTE join producing duplicate column names (TPC-DS Q39 pattern)
71+
#
72+
statement ok
73+
CREATE TABLE inv(warehouse_sk INT, item_sk INT, moy INT, cov DOUBLE) AS VALUES
74+
(1, 10, 1, 1.5),
75+
(1, 10, 2, 2.0),
76+
(2, 20, 1, 0.8),
77+
(2, 20, 2, 1.2);
78+
79+
query IIIRIIIR
80+
WITH inv1 AS (
81+
SELECT warehouse_sk, item_sk, moy, cov FROM inv WHERE moy = 1
82+
),
83+
inv2 AS (
84+
SELECT warehouse_sk, item_sk, moy, cov FROM inv WHERE moy = 2
85+
)
86+
SELECT inv1.warehouse_sk, inv1.item_sk, inv1.moy, inv1.cov,
87+
inv2.warehouse_sk, inv2.item_sk, inv2.moy, inv2.cov
88+
FROM inv1 JOIN inv2
89+
ON inv1.item_sk = inv2.item_sk AND inv1.warehouse_sk = inv2.warehouse_sk
90+
ORDER BY inv1.warehouse_sk, inv1.item_sk;
91+
----
92+
1 10 1 1.5 1 10 2 2
93+
2 20 1 0.8 2 20 2 1.2
94+
95+
#
96+
# Three-way duplicate
97+
#
98+
query III
99+
SELECT x AS a, y AS a, x + y AS a FROM t;
100+
----
101+
1 2 3
102+
3 4 7
103+
104+
#
105+
# CAST produces same schema_name as original column (TPC-DS Q39 pattern).
106+
# CAST is transparent to schema_name, so CAST(x AS DOUBLE) and x
107+
# both have schema_name "x" — this must be deduped.
108+
#
109+
query RI
110+
SELECT CAST(x AS DOUBLE), x FROM t;
111+
----
112+
1 1
113+
3 3
114+
115+
#
116+
# GROUP BY with duplicate expressions in SELECT
117+
#
118+
query II
119+
SELECT x, x FROM t GROUP BY x;
120+
----
121+
1 1
122+
3 3
123+
124+
#
125+
# Aggregate with GROUP BY producing duplicate column names
126+
#
127+
query III
128+
SELECT x, SUM(y) AS total, SUM(y) AS total FROM t GROUP BY x ORDER BY x;
129+
----
130+
1 2 2
131+
3 4 4
132+
133+
#
134+
# ORDER BY referencing the second (renamed) column by position
135+
#
136+
query II
137+
SELECT y AS c1, x AS c1 FROM t ORDER BY 2;
138+
----
139+
2 1
140+
4 3
141+
142+
#
143+
# Function calls that produce the same schema_name after argument
144+
# normalization (reported in issue #6543 for iszero).
145+
#
146+
query BB
147+
SELECT iszero(0.0), iszero(-0.0);
148+
----
149+
true true
150+
151+
#
152+
# Duplicate expressions inside a UNION subquery
153+
#
154+
query II
155+
SELECT * FROM (SELECT x AS a, y AS a FROM t UNION ALL SELECT y AS a, x AS a FROM t) ORDER BY 1, 2;
156+
----
157+
1 2
158+
2 1
159+
3 4
160+
4 3
161+
162+
#
163+
# Known limitation: wildcard expansion happens after dedup, so
164+
# SELECT *, col FROM t still errors when col overlaps with *.
165+
# This will be addressed in a follow-up PR.
166+
#
167+
query error DataFusion error: Error during planning: Projections require unique expression names but the expression "t\.x" at position 0 and "t\.x" at position 2 have the same name\. Consider aliasing \("AS"\) one of them\.
168+
SELECT *, x FROM t;
169+
170+
# Cleanup
171+
statement ok
172+
DROP TABLE t;
173+
174+
statement ok
175+
DROP TABLE inv;

datafusion/sqllogictest/test_files/unnest.slt

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -547,8 +547,16 @@ select unnest(column1) from (select * from (values([1,2,3]), ([4,5,6])) limit 1
547547
5
548548
6
549549

550-
query error DataFusion error: Error during planning: Projections require unique expression names but the expression "UNNEST\(unnest_table.column1\)" at position 0 and "UNNEST\(unnest_table.column1\)" at position 1 have the same name. Consider aliasing \("AS"\) one of them.
550+
query II
551551
select unnest(column1), unnest(column1) from unnest_table;
552+
----
553+
1 1
554+
2 2
555+
3 3
556+
4 4
557+
5 5
558+
6 6
559+
12 12
552560

553561
query II
554562
select unnest(column1), unnest(column1) u1 from unnest_table;

0 commit comments

Comments
 (0)