Skip to content

Commit 4c04a4d

Browse files
authored
Fix transpose VARCHAR length drift in unpivot/pivot lowering (#5479)
* Fix transpose VARCHAR length drift in unpivot/pivot lowering PPL `transpose` lowers via `RelBuilder.unpivot()` + `pivot()`. The unpivot synthesizes a VALUES leaf carrying axis literals (the input field names), e.g. `VALUES('firstname'), ('age'), ('balance')`. Calcite types each RexLiteral as `CHAR(literalLen)` and types the VALUES column as `CHAR(maxAxisLiteralLen)` — the longest literal wins on column-level type inference. This bites the analytics-engine route end-to-end: 1. After unpivot the `column` axis column is `CHAR(9)` (from "firstname"). Through Calcite's TRIM (`TO_VARYING`) it becomes `VARCHAR(9)`. 2. The `_value_transpose_` value column is built from `CAST(input_field AS VARCHAR)` — unbounded VARCHAR. 3. `MAX(_value_transpose_)` aggCall is created with declared return type = unbounded VARCHAR (inferred from arg 0 at call-construction time). 4. The downstream non-prefix groupSet aggregate (`group=[{1}], MAX($0)`) splits into PARTIAL/FINAL on the analytics path. PARTIAL hoists group keys to the output prefix, so FINAL's `argList=[0]` reads the group-key slot — `VARCHAR(9)` — instead of the agg-state slot. Calcite's `Aggregate.<init>` then runs `typeMatchesInferred` and rejects the plan: declared `VARCHAR` ≠ inferred `VARCHAR(9)`. 5. Even when the aggregate validation passes, the substrait/Arrow path sees `FixedChar(maxAxisLiteralLen)` schema vs runtime arrays whose actual values are shorter (e.g. "age" with length 3) and trips `Row field type (FixedChar{length=3}) does not match schema field type (FixedChar{length=9})`. Two fixes, both in the lowering site: * Build every axis literal at the same `CHAR(maxAxisLiteralLen)` type. Calcite then space-pads the shorter literals at value-construction time, so the runtime CHAR vector and the declared schema both have the same fixed length. The downstream TRIM strips the padding. * Wrap the trimmed-axis group key in an explicit `CAST(... AS VARCHAR)` to unbounded VARCHAR. This makes the group key type match `_value_transpose_`'s unbounded VARCHAR end-to-end, so the aggregate's row-type check sees consistent types regardless of which side the analytics-engine split rule places the group key on. These have to live in sql plugin, not in the analytics-engine planner: the typing decisions are made by Calcite's `RelBuilder.unpivot()` implementation when it constructs the VALUES leaf — long before any analytics-engine rule sees the plan. By the time the plan reaches the analytics-engine route, the precision drift is already baked into the RelDataType chain. Fixing it downstream would require pattern-matching on transpose-shaped sub-trees inside the planner, which is fragile and mis-attributes the root cause. The lowering author owns the type contract for the operators it emits. Adds: - `testTransposeColumnAxisUsesUnboundedVarchar` regression assertion pinning the output `column` field's type to unbounded VARCHAR. Catches any future change that re-introduces axis-literal precision into the group key. - Updated plan-shape assertions across the existing transpose tests to reflect the padded axis literals (`'cnt '`, `'COMM '`, etc.) and the `CAST(TRIM(...) AS VARCHAR)` group key. Verified end-to-end: `CalciteTransposeCommandIT` 5/5 pass with `tests.analytics.parquet_indices=true`. Signed-off-by: Songkan Tang <songkant@amazon.com> * Update explain_transpose.yaml expected plan for new lowering shape CalciteExplainIT.testTransposeExplain regenerated against the updated transpose lowering: axis literals are now padded to a uniform CHAR(N) (N = max axis literal length, i.e. 14 for 'account_number'), and the group-key TRIM output is wrapped in a CAST(... AS VARCHAR) to unbounded VARCHAR. The plan-shape diff exactly mirrors the documented behavior change in the parent commit: * `$f20=[TRIM(...)]` → `$f20=[CAST(TRIM(...)):VARCHAR NOT NULL]` * axis literals e.g. 'firstname' → 'firstname ' (padded) * LogicalValues row type tuples are correspondingly padded Verified locally: `./gradlew :integ-test:integTestRemote --tests "*CalciteExplainIT.testTransposeExplain"` passes. Signed-off-by: Songkan Tang <songkant@amazon.com> --------- Signed-off-by: Songkan Tang <songkant@amazon.com>
1 parent 050baee commit 4c04a4d

3 files changed

Lines changed: 85 additions & 72 deletions

File tree

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -915,6 +915,9 @@ public RelNode visitTranspose(
915915
RelBuilder b = context.relBuilder;
916916
RexBuilder rx = context.rexBuilder;
917917
RelDataType varchar = rx.getTypeFactory().createSqlType(SqlTypeName.VARCHAR);
918+
int axisLiteralLength = fieldNames.stream().mapToInt(String::length).max().orElse(0);
919+
RelDataType axisLiteralType =
920+
rx.getTypeFactory().createSqlType(SqlTypeName.CHAR, axisLiteralLength);
918921

919922
// Step 1: ROW_NUMBER
920923
b.projectPlus(
@@ -932,18 +935,22 @@ public RelNode visitTranspose(
932935
.map(
933936
f ->
934937
Map.entry(
935-
ImmutableList.of(rx.makeLiteral(f)),
938+
ImmutableList.of(
939+
(RexLiteral) rx.makeLiteral(f, axisLiteralType, false, false)),
936940
ImmutableList.of((RexNode) rx.makeCast(varchar, b.field(f), true))))
937941
.collect(Collectors.toList()));
938942

939943
// Step 3: Trim spaces from columnName column before pivot
940944

941945
RexNode trimmedColumnName =
942-
context.rexBuilder.makeCall(
943-
SqlStdOperatorTable.TRIM,
944-
context.rexBuilder.makeFlag(SqlTrimFunction.Flag.BOTH),
945-
context.rexBuilder.makeLiteral(" "),
946-
b.field(columnName));
946+
context.rexBuilder.makeCast(
947+
varchar,
948+
context.rexBuilder.makeCall(
949+
SqlStdOperatorTable.TRIM,
950+
context.rexBuilder.makeFlag(SqlTrimFunction.Flag.BOTH),
951+
context.rexBuilder.makeLiteral(" "),
952+
b.field(columnName)),
953+
true);
947954

948955
// Step 4: PIVOT
949956
b.pivot(

integ-test/src/test/resources/expectedOutput/calcite/explain_transpose.yaml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,21 @@ calcite:
33
LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])
44
LogicalProject(column_names=[$0], row 1=[$1], row 2=[$2], row 3=[$3], row 4=[$4])
55
LogicalAggregate(group=[{1}], row 1_null=[MAX($0) FILTER $2], row 2_null=[MAX($0) FILTER $3], row 3_null=[MAX($0) FILTER $4], row 4_null=[MAX($0) FILTER $5])
6-
LogicalProject(_value_transpose_=[CAST($19):VARCHAR NOT NULL], $f20=[TRIM(FLAG(BOTH), ' ', $18)], $f21=[=($17, 1)], $f22=[=($17, 2)], $f23=[=($17, 3)], $f24=[=($17, 4)])
6+
LogicalProject(_value_transpose_=[CAST($19):VARCHAR NOT NULL], $f20=[CAST(TRIM(FLAG(BOTH), ' ', $18)):VARCHAR NOT NULL], $f21=[=($17, 1)], $f22=[=($17, 2)], $f23=[=($17, 3)], $f24=[=($17, 4)])
77
LogicalFilter(condition=[IS NOT NULL($19)])
8-
LogicalProject(account_number=[$0], firstname=[$1], address=[$2], balance=[$3], gender=[$4], city=[$5], employer=[$6], state=[$7], age=[$8], email=[$9], lastname=[$10], _id=[$11], _index=[$12], _score=[$13], _maxscore=[$14], _sort=[$15], _routing=[$16], _row_number_transpose_=[$17], column_names=[$18], _value_transpose_=[CASE(=($18, 'account_number'), CAST($0):VARCHAR NOT NULL, =($18, 'firstname'), CAST($1):VARCHAR NOT NULL, =($18, 'address'), CAST($2):VARCHAR NOT NULL, =($18, 'balance'), CAST($3):VARCHAR NOT NULL, =($18, 'gender'), CAST($4):VARCHAR NOT NULL, =($18, 'city'), CAST($5):VARCHAR NOT NULL, =($18, 'employer'), CAST($6):VARCHAR NOT NULL, =($18, 'state'), CAST($7):VARCHAR NOT NULL, =($18, 'age'), CAST($8):VARCHAR NOT NULL, =($18, 'email'), CAST($9):VARCHAR NOT NULL, =($18, 'lastname'), CAST($10):VARCHAR NOT NULL, null:NULL)])
8+
LogicalProject(account_number=[$0], firstname=[$1], address=[$2], balance=[$3], gender=[$4], city=[$5], employer=[$6], state=[$7], age=[$8], email=[$9], lastname=[$10], _id=[$11], _index=[$12], _score=[$13], _maxscore=[$14], _sort=[$15], _routing=[$16], _row_number_transpose_=[$17], column_names=[$18], _value_transpose_=[CASE(=($18, 'account_number'), CAST($0):VARCHAR NOT NULL, =($18, 'firstname '), CAST($1):VARCHAR NOT NULL, =($18, 'address '), CAST($2):VARCHAR NOT NULL, =($18, 'balance '), CAST($3):VARCHAR NOT NULL, =($18, 'gender '), CAST($4):VARCHAR NOT NULL, =($18, 'city '), CAST($5):VARCHAR NOT NULL, =($18, 'employer '), CAST($6):VARCHAR NOT NULL, =($18, 'state '), CAST($7):VARCHAR NOT NULL, =($18, 'age '), CAST($8):VARCHAR NOT NULL, =($18, 'email '), CAST($9):VARCHAR NOT NULL, =($18, 'lastname '), CAST($10):VARCHAR NOT NULL, null:NULL)])
99
LogicalJoin(condition=[true], joinType=[inner])
1010
LogicalProject(account_number=[$0], firstname=[$1], address=[$2], balance=[$3], gender=[$4], city=[$5], employer=[$6], state=[$7], age=[$8], email=[$9], lastname=[$10], _id=[$11], _index=[$12], _score=[$13], _maxscore=[$14], _sort=[$15], _routing=[$16], _row_number_transpose_=[ROW_NUMBER() OVER ()])
1111
LogicalSort(fetch=[5])
1212
CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_account]])
13-
LogicalValues(tuples=[[{ 'account_number' }, { 'firstname' }, { 'address' }, { 'balance' }, { 'gender' }, { 'city' }, { 'employer' }, { 'state' }, { 'age' }, { 'email' }, { 'lastname' }]])
13+
LogicalValues(tuples=[[{ 'account_number' }, { 'firstname ' }, { 'address ' }, { 'balance ' }, { 'gender ' }, { 'city ' }, { 'employer ' }, { 'state ' }, { 'age ' }, { 'email ' }, { 'lastname ' }]])
1414
physical: |
1515
EnumerableLimit(fetch=[10000])
1616
EnumerableAggregate(group=[{1}], row 1_null=[MAX($0) FILTER $2], row 2_null=[MAX($0) FILTER $3], row 3_null=[MAX($0) FILTER $4], row 4_null=[MAX($0) FILTER $5])
17-
EnumerableCalc(expr#0..12=[{inputs}], expr#13=['account_number'], expr#14=[=($t12, $t13)], expr#15=[CAST($t0):VARCHAR NOT NULL], expr#16=['firstname'], expr#17=[=($t12, $t16)], expr#18=[CAST($t1):VARCHAR NOT NULL], expr#19=['address'], expr#20=[=($t12, $t19)], expr#21=[CAST($t2):VARCHAR NOT NULL], expr#22=['balance'], expr#23=[=($t12, $t22)], expr#24=[CAST($t3):VARCHAR NOT NULL], expr#25=['gender'], expr#26=[=($t12, $t25)], expr#27=[CAST($t4):VARCHAR NOT NULL], expr#28=['city'], expr#29=[=($t12, $t28)], expr#30=[CAST($t5):VARCHAR NOT NULL], expr#31=['employer'], expr#32=[=($t12, $t31)], expr#33=[CAST($t6):VARCHAR NOT NULL], expr#34=['state'], expr#35=[=($t12, $t34)], expr#36=[CAST($t7):VARCHAR NOT NULL], expr#37=['age'], expr#38=[=($t12, $t37)], expr#39=[CAST($t8):VARCHAR NOT NULL], expr#40=['email'], expr#41=[=($t12, $t40)], expr#42=[CAST($t9):VARCHAR NOT NULL], expr#43=['lastname'], expr#44=[=($t12, $t43)], expr#45=[CAST($t10):VARCHAR NOT NULL], expr#46=[null:NULL], expr#47=[CASE($t14, $t15, $t17, $t18, $t20, $t21, $t23, $t24, $t26, $t27, $t29, $t30, $t32, $t33, $t35, $t36, $t38, $t39, $t41, $t42, $t44, $t45, $t46)], expr#48=[CAST($t47):VARCHAR NOT NULL], expr#49=[FLAG(BOTH)], expr#50=[' '], expr#51=[TRIM($t49, $t50, $t12)], expr#52=[1], expr#53=[=($t11, $t52)], expr#54=[2], expr#55=[=($t11, $t54)], expr#56=[3], expr#57=[=($t11, $t56)], expr#58=[4], expr#59=[=($t11, $t58)], _value_transpose_=[$t48], $f20=[$t51], $f21=[$t53], $f22=[$t55], $f23=[$t57], $f24=[$t59])
17+
EnumerableCalc(expr#0..12=[{inputs}], expr#13=['account_number'], expr#14=[=($t12, $t13)], expr#15=[CAST($t0):VARCHAR NOT NULL], expr#16=['firstname '], expr#17=[=($t12, $t16)], expr#18=[CAST($t1):VARCHAR NOT NULL], expr#19=['address '], expr#20=[=($t12, $t19)], expr#21=[CAST($t2):VARCHAR NOT NULL], expr#22=['balance '], expr#23=[=($t12, $t22)], expr#24=[CAST($t3):VARCHAR NOT NULL], expr#25=['gender '], expr#26=[=($t12, $t25)], expr#27=[CAST($t4):VARCHAR NOT NULL], expr#28=['city '], expr#29=[=($t12, $t28)], expr#30=[CAST($t5):VARCHAR NOT NULL], expr#31=['employer '], expr#32=[=($t12, $t31)], expr#33=[CAST($t6):VARCHAR NOT NULL], expr#34=['state '], expr#35=[=($t12, $t34)], expr#36=[CAST($t7):VARCHAR NOT NULL], expr#37=['age '], expr#38=[=($t12, $t37)], expr#39=[CAST($t8):VARCHAR NOT NULL], expr#40=['email '], expr#41=[=($t12, $t40)], expr#42=[CAST($t9):VARCHAR NOT NULL], expr#43=['lastname '], expr#44=[=($t12, $t43)], expr#45=[CAST($t10):VARCHAR NOT NULL], expr#46=[null:NULL], expr#47=[CASE($t14, $t15, $t17, $t18, $t20, $t21, $t23, $t24, $t26, $t27, $t29, $t30, $t32, $t33, $t35, $t36, $t38, $t39, $t41, $t42, $t44, $t45, $t46)], expr#48=[CAST($t47):VARCHAR NOT NULL], expr#49=[FLAG(BOTH)], expr#50=[' '], expr#51=[TRIM($t49, $t50, $t12)], expr#52=[CAST($t51):VARCHAR NOT NULL], expr#53=[1], expr#54=[=($t11, $t53)], expr#55=[2], expr#56=[=($t11, $t55)], expr#57=[3], expr#58=[=($t11, $t57)], expr#59=[4], expr#60=[=($t11, $t59)], _value_transpose_=[$t48], $f20=[$t52], $f21=[$t54], $f22=[$t56], $f23=[$t58], $f24=[$t60])
1818
EnumerableNestedLoopJoin(condition=[true], joinType=[inner])
1919
EnumerableWindow(window#0=[window(rows between UNBOUNDED PRECEDING and CURRENT ROW aggs [ROW_NUMBER()])])
2020
CalciteEnumerableIndexScan(table=[[OpenSearch, opensearch-sql_test_index_account]], PushDownContext=[[PROJECT->[account_number, firstname, address, balance, gender, city, employer, state, age, email, lastname], LIMIT->5], OpenSearchRequestBuilder(sourceBuilder={"from":0,"size":5,"timeout":"1m","_source":{"includes":["account_number","firstname","address","balance","gender","city","employer","state","age","email","lastname"]}}, requestedTotalSize=5, pageSize=null, startFrom=0)])
21-
EnumerableCalc(expr#0=[{inputs}], expr#1=[Sarg['account_number', 'address':CHAR(14), 'age':CHAR(14), 'balance':CHAR(14), 'city':CHAR(14), 'email':CHAR(14), 'employer':CHAR(14), 'firstname':CHAR(14), 'gender':CHAR(14), 'lastname':CHAR(14), 'state':CHAR(14)]:CHAR(14)], expr#2=[SEARCH($t0, $t1)], column_names=[$t0], $condition=[$t2])
22-
EnumerableValues(tuples=[[{ 'account_number' }, { 'firstname' }, { 'address' }, { 'balance' }, { 'gender' }, { 'city' }, { 'employer' }, { 'state' }, { 'age' }, { 'email' }, { 'lastname' }]])
21+
EnumerableCalc(expr#0=[{inputs}], expr#1=[Sarg['account_number', 'address ', 'age ', 'balance ', 'city ', 'email ', 'employer ', 'firstname ', 'gender ', 'lastname ', 'state ']:CHAR(14)], expr#2=[SEARCH($t0, $t1)], column_names=[$t0], $condition=[$t2])
22+
EnumerableValues(tuples=[[{ 'account_number' }, { 'firstname ' }, { 'address ' }, { 'balance ' }, { 'gender ' }, { 'city ' }, { 'employer ' }, { 'state ' }, { 'age ' }, { 'email ' }, { 'lastname ' }]])
23+

0 commit comments

Comments
 (0)