Skip to content

Commit 2ec0ab5

Browse files
authored
perf(logical-plan): box CreateExternalTable / CreateFunction in DdlStatement (-45% LogicalPlan size) (apache#22733)
## Which issue does this PR close? Closes apache#22732. ## Rationale for this change `LogicalPlan` is 320 bytes today, sized by the `Ddl(DdlStatement)` variant — which is itself 320 bytes because it carries `CreateExternalTable` (312 bytes) and `CreateFunction` (288 bytes). Every non-DDL variant is at most 176 bytes (`Join`). Every SELECT query pays the full 320-byte payload on every `mem::take` / `mem::swap` / `Arc<LogicalPlan>` write during planning, even though it never instantiates a DDL node. Profiling `sql_planner` (samply, macOS aarch64) showed ~13% of CPU in `libsystem_platform.dylib` (memcpy / memmove). Shrinking the most frequently-moved type directly attacks that pool. ## What changes are included in this PR? Box the two oversized DDL variants: ```rust pub enum DdlStatement { CreateExternalTable(Box<CreateExternalTable>), // … CreateFunction(Box<CreateFunction>), // … } ``` After this change: - `DdlStatement` drops from **320 → ~152 bytes** (max variant is now `CreateIndex` at 144). - `LogicalPlan` drops from **320 → 176 bytes (–45%)** — the enum discriminant fits inside `Join`'s alignment padding, so the enum is exactly the same width as `Join`. DDL plan construction takes one extra `Box::new(...)` allocation per DDL statement — negligible because DDL plans are one-shot and not on the per-query hot path. The diff is mechanical: construction sites add `Box::new(...)`, the few field-destructuring patterns are converted to a `let CreateExternalTable { … } = ce.as_ref();` shape, and the FFI/proto crates need one `*cmd` / `Box::new(cmd)` adjustment at the type boundaries. A new `test_size_of_logical_plan` unit test pins `size_of::<LogicalPlan>() == 176` and asserts `DdlStatement` stays smaller than `Join`, so future variant growth that would re-balloon the enum trips the test rather than silently regressing the planning hot path. (Same shape as the existing `test_size_of_expr` in `expr.rs`.) ## Are these changes tested? - 711 optimizer + 217 expr + 86 sql + 7 proto unit tests pass. - SLT `create_external_table` / `create_function` / `ddl` pass. - `cargo clippy --workspace --all-targets -- -D warnings` clean. ## Are there any user-facing changes? Yes (semantic API only). Code that pattern-matches `DdlStatement::CreateExternalTable(CreateExternalTable { … })` must change to bind the box and deref. Code that constructs these variants must wrap with `Box::new(...)`. No behavioral change.
1 parent e1d8d46 commit 2ec0ab5

7 files changed

Lines changed: 205 additions & 55 deletions

File tree

datafusion/core/src/execution/context/mod.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -714,7 +714,7 @@ impl SessionContext {
714714
Box::pin(self.drop_schema(cmd)).await
715715
}
716716
DdlStatement::CreateFunction(cmd) => {
717-
Box::pin(self.create_function(cmd)).await
717+
Box::pin(self.create_function(*cmd)).await
718718
}
719719
DdlStatement::DropFunction(cmd) => {
720720
Box::pin(self.drop_function(cmd)).await

datafusion/expr/src/logical_plan/ddl.rs

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,10 @@ use sqlparser::ast::Ident;
3838
/// Various types of DDL (CREATE / DROP) catalog manipulation
3939
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
4040
pub enum DdlStatement {
41-
/// Creates an external table.
42-
CreateExternalTable(CreateExternalTable),
41+
/// Creates an external table. Boxed to keep `LogicalPlan` enum size down
42+
/// — `CreateExternalTable` is ~312 bytes, dwarfing every other variant
43+
/// in the plan tree and forcing the whole enum to that width.
44+
CreateExternalTable(Box<CreateExternalTable>),
4345
/// Creates an in memory table.
4446
CreateMemoryTable(CreateMemoryTable),
4547
/// Creates a new view.
@@ -56,8 +58,9 @@ pub enum DdlStatement {
5658
DropView(DropView),
5759
/// Drops a catalog schema
5860
DropCatalogSchema(DropCatalogSchema),
59-
/// Create function statement
60-
CreateFunction(CreateFunction),
61+
/// Create function statement. Boxed for the same reason as
62+
/// [`Self::CreateExternalTable`] (~288 bytes).
63+
CreateFunction(Box<CreateFunction>),
6164
/// Drop function statement
6265
DropFunction(DropFunction),
6366
}
@@ -66,9 +69,7 @@ impl DdlStatement {
6669
/// Get a reference to the logical plan's schema
6770
pub fn schema(&self) -> &DFSchemaRef {
6871
match self {
69-
DdlStatement::CreateExternalTable(CreateExternalTable { schema, .. }) => {
70-
schema
71-
}
72+
DdlStatement::CreateExternalTable(ce) => &ce.schema,
7273
DdlStatement::CreateMemoryTable(CreateMemoryTable { input, .. })
7374
| DdlStatement::CreateView(CreateView { input, .. }) => input.schema(),
7475
DdlStatement::CreateCatalogSchema(CreateCatalogSchema { schema, .. }) => {
@@ -79,7 +80,7 @@ impl DdlStatement {
7980
DdlStatement::DropTable(DropTable { schema, .. }) => schema,
8081
DdlStatement::DropView(DropView { schema, .. }) => schema,
8182
DdlStatement::DropCatalogSchema(DropCatalogSchema { schema, .. }) => schema,
82-
DdlStatement::CreateFunction(CreateFunction { schema, .. }) => schema,
83+
DdlStatement::CreateFunction(cf) => &cf.schema,
8384
DdlStatement::DropFunction(DropFunction { schema, .. }) => schema,
8485
}
8586
}
@@ -131,11 +132,9 @@ impl DdlStatement {
131132
impl Display for Wrapper<'_> {
132133
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
133134
match self.0 {
134-
DdlStatement::CreateExternalTable(CreateExternalTable {
135-
name,
136-
constraints,
137-
..
138-
}) => {
135+
DdlStatement::CreateExternalTable(ce) => {
136+
let name = &ce.name;
137+
let constraints = &ce.constraints;
139138
if constraints.is_empty() {
140139
write!(f, "CreateExternalTable: {name:?}")
141140
} else {
@@ -191,7 +190,8 @@ impl DdlStatement {
191190
"DropCatalogSchema: {name:?} if not exist:={if_exists} cascade:={cascade}"
192191
)
193192
}
194-
DdlStatement::CreateFunction(CreateFunction { name, .. }) => {
193+
DdlStatement::CreateFunction(cf) => {
194+
let name = &cf.name;
195195
write!(f, "CreateFunction: name {name:?}")
196196
}
197197
DdlStatement::DropFunction(DropFunction { name, .. }) => {

datafusion/expr/src/logical_plan/plan.rs

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4740,6 +4740,45 @@ mod tests {
47404740
use insta::{assert_debug_snapshot, assert_snapshot};
47414741
use std::hash::DefaultHasher;
47424742

4743+
/// `LogicalPlan` is moved/swapped on every step of the planning hot path
4744+
/// (every `mem::take` in an in-place rewriter, every `Arc<LogicalPlan>`
4745+
/// write, every owned `map_*` traversal). Its size is set by the largest
4746+
/// variant, so an oversized variant balloons cost for every other variant.
4747+
///
4748+
/// Today the size-setter should be `Join` (~176 bytes); `DdlStatement` is
4749+
/// boxed precisely so it does not dominate. If you grow a variant, please
4750+
/// box the new large fields rather than letting this number creep up —
4751+
/// see the analogous `test_size_of_expr` in `expr.rs`.
4752+
#[test]
4753+
fn test_size_of_logical_plan() {
4754+
// `LogicalPlan` enum on aarch64 / x86_64. Today this matches
4755+
// `Join`'s 176 bytes (the enum discriminant fits in `Join`'s
4756+
// alignment padding); if `Join` grows or another variant overtakes
4757+
// it, this number will move with the new size-setter.
4758+
assert_eq!(size_of::<LogicalPlan>(), 176);
4759+
// `DdlStatement` is `Ddl(DdlStatement)`'s payload; keep it below the
4760+
// `Join` ceiling so it never re-becomes the size-setter.
4761+
assert!(
4762+
size_of::<DdlStatement>() < size_of::<Join>(),
4763+
"DdlStatement ({} bytes) should stay smaller than Join ({} bytes); \
4764+
box the new large variant rather than letting it dominate `LogicalPlan`.",
4765+
size_of::<DdlStatement>(),
4766+
size_of::<Join>(),
4767+
);
4768+
// Sanity check the two boxed variants stay boxed (so the payload
4769+
// sits on the heap, not in the enum).
4770+
assert_eq!(
4771+
size_of::<Box<crate::CreateExternalTable>>(),
4772+
8,
4773+
"CreateExternalTable should be Box'd inside DdlStatement"
4774+
);
4775+
assert_eq!(
4776+
size_of::<Box<crate::CreateFunction>>(),
4777+
8,
4778+
"CreateFunction should be Box'd inside DdlStatement"
4779+
);
4780+
}
4781+
47434782
fn employee_schema() -> Schema {
47444783
Schema::new(vec![
47454784
Field::new("id", DataType::Int32, false),

datafusion/ffi/src/table_provider_factory.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ impl FFI_TableProviderFactory {
152152
let plan = LogicalPlanNode::decode(cmd_serialized.as_ref())
153153
.map_err(|e| DataFusionError::Internal(format!("{e:?}")))?;
154154
match plan.try_into_logical_plan(&task_ctx, logical_codec.as_ref())? {
155-
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => Ok(cmd),
155+
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => Ok(*cmd),
156156
_ => Err(DataFusionError::Internal(
157157
"Invalid logical plan in FFI_TableProviderFactory.".to_owned(),
158158
)),
@@ -272,7 +272,7 @@ impl ForeignTableProviderFactory {
272272
let logical_codec: Arc<dyn LogicalExtensionCodec> =
273273
(&self.0.logical_codec).into();
274274

275-
let plan = LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd));
275+
let plan = LogicalPlan::Ddl(DdlStatement::CreateExternalTable(Box::new(cmd)));
276276
let plan: LogicalPlanNode =
277277
AsLogicalPlan::try_from_logical_plan(&plan, logical_codec.as_ref())?;
278278

datafusion/proto/src/logical_plan/mod.rs

Lines changed: 27 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -790,26 +790,30 @@ impl AsLogicalPlan for LogicalPlanNode {
790790
}
791791

792792
Ok(LogicalPlan::Ddl(DdlStatement::CreateExternalTable(
793-
CreateExternalTable::builder(
794-
from_table_reference(
795-
create_extern_table.name.as_ref(),
796-
"CreateExternalTable",
797-
)?,
798-
create_extern_table.location.clone(),
799-
create_extern_table.file_type.clone(),
800-
pb_schema.try_into()?,
801-
)
802-
.with_partition_cols(create_extern_table.table_partition_cols.clone())
803-
.with_order_exprs(order_exprs)
804-
.with_if_not_exists(create_extern_table.if_not_exists)
805-
.with_or_replace(create_extern_table.or_replace)
806-
.with_temporary(create_extern_table.temporary)
807-
.with_definition(definition)
808-
.with_unbounded(create_extern_table.unbounded)
809-
.with_options(create_extern_table.options.clone())
810-
.with_constraints(constraints.into())
811-
.with_column_defaults(column_defaults)
812-
.build(),
793+
Box::new(
794+
CreateExternalTable::builder(
795+
from_table_reference(
796+
create_extern_table.name.as_ref(),
797+
"CreateExternalTable",
798+
)?,
799+
create_extern_table.location.clone(),
800+
create_extern_table.file_type.clone(),
801+
pb_schema.try_into()?,
802+
)
803+
.with_partition_cols(
804+
create_extern_table.table_partition_cols.clone(),
805+
)
806+
.with_order_exprs(order_exprs)
807+
.with_if_not_exists(create_extern_table.if_not_exists)
808+
.with_or_replace(create_extern_table.or_replace)
809+
.with_temporary(create_extern_table.temporary)
810+
.with_definition(definition)
811+
.with_unbounded(create_extern_table.unbounded)
812+
.with_options(create_extern_table.options.clone())
813+
.with_constraints(constraints.into())
814+
.with_column_defaults(column_defaults)
815+
.build(),
816+
),
813817
)))
814818
}
815819
LogicalPlanType::CreateView(create_view) => {
@@ -1774,8 +1778,8 @@ impl AsLogicalPlan for LogicalPlanNode {
17741778
},
17751779
)),
17761780
}),
1777-
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(
1778-
CreateExternalTable {
1781+
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(ce)) => {
1782+
let CreateExternalTable {
17791783
name,
17801784
location,
17811785
file_type,
@@ -1790,8 +1794,7 @@ impl AsLogicalPlan for LogicalPlanNode {
17901794
constraints,
17911795
column_defaults,
17921796
temporary,
1793-
},
1794-
)) => {
1797+
} = ce.as_ref();
17951798
let mut converted_order_exprs: Vec<SortExprNodeCollection> = vec![];
17961799
for order in order_exprs {
17971800
let temp = SortExprNodeCollection {

datafusion/sql/src/statement.rs

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1468,15 +1468,15 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
14681468
function_body,
14691469
};
14701470

1471-
let statement = DdlStatement::CreateFunction(CreateFunction {
1471+
let statement = DdlStatement::CreateFunction(Box::new(CreateFunction {
14721472
or_replace,
14731473
temporary,
14741474
name,
14751475
return_type: return_type.map(|f| f.data_type().clone()),
14761476
args,
14771477
params,
14781478
schema: DFSchemaRef::new(DFSchema::empty()),
1479-
});
1479+
}));
14801480

14811481
Ok(LogicalPlan::Ddl(statement))
14821482
}
@@ -1855,18 +1855,20 @@ impl<S: ContextProvider> SqlToRel<'_, S> {
18551855
let constraints =
18561856
self.new_constraint_from_table_constraints(&all_constraints, &df_schema)?;
18571857
Ok(LogicalPlan::Ddl(DdlStatement::CreateExternalTable(
1858-
PlanCreateExternalTable::builder(name, location, file_type, df_schema)
1859-
.with_partition_cols(table_partition_cols)
1860-
.with_if_not_exists(if_not_exists)
1861-
.with_or_replace(or_replace)
1862-
.with_temporary(temporary)
1863-
.with_definition(definition)
1864-
.with_order_exprs(ordered_exprs)
1865-
.with_unbounded(unbounded)
1866-
.with_options(options_map)
1867-
.with_constraints(constraints)
1868-
.with_column_defaults(column_defaults)
1869-
.build(),
1858+
Box::new(
1859+
PlanCreateExternalTable::builder(name, location, file_type, df_schema)
1860+
.with_partition_cols(table_partition_cols)
1861+
.with_if_not_exists(if_not_exists)
1862+
.with_or_replace(or_replace)
1863+
.with_temporary(temporary)
1864+
.with_definition(definition)
1865+
.with_order_exprs(ordered_exprs)
1866+
.with_unbounded(unbounded)
1867+
.with_options(options_map)
1868+
.with_constraints(constraints)
1869+
.with_column_defaults(column_defaults)
1870+
.build(),
1871+
),
18701872
)))
18711873
}
18721874

docs/source/library-user-guide/upgrading/55.0.0.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,112 @@ as a supertrait:
9797
+ pub trait QueryPlanner: Any + Debug
9898
```
9999

100+
### `DdlStatement::CreateExternalTable` and `CreateFunction` are now boxed
101+
102+
The two largest variants of `datafusion_expr::DdlStatement` are now
103+
`Box`ed:
104+
105+
```rust,ignore
106+
// Before
107+
pub enum DdlStatement {
108+
CreateExternalTable(CreateExternalTable),
109+
// ...
110+
CreateFunction(CreateFunction),
111+
// ...
112+
}
113+
114+
// After
115+
pub enum DdlStatement {
116+
CreateExternalTable(Box<CreateExternalTable>),
117+
// ...
118+
CreateFunction(Box<CreateFunction>),
119+
// ...
120+
}
121+
```
122+
123+
`CreateExternalTable` is 312 bytes and `CreateFunction` is 288 bytes, so
124+
without boxing they forced the entire `LogicalPlan` enum to 320 bytes
125+
even on SELECT-only query paths that never instantiate them. Boxing
126+
shrinks `LogicalPlan` from 320 → 176 bytes (−45%), making every
127+
`mem::take` / `mem::swap` / `Arc<LogicalPlan>` store on the planning
128+
hot path move a smaller payload.
129+
130+
**Who is affected:**
131+
132+
- Users who construct `DdlStatement::CreateExternalTable(...)` or
133+
`DdlStatement::CreateFunction(...)` from an owned struct.
134+
- Users who pattern-match these variants and destructure the inner
135+
struct in the same pattern (e.g.
136+
`DdlStatement::CreateExternalTable(CreateExternalTable { name, .. })`).
137+
- Code that consumes the inner struct out of these variants (e.g. to
138+
pass `CreateExternalTable` by value to another function).
139+
140+
**Migration guide:**
141+
142+
When constructing the variants, wrap the inner struct in `Box::new`:
143+
144+
```rust,ignore
145+
// Before
146+
let stmt = DdlStatement::CreateFunction(CreateFunction { name, args, .. });
147+
148+
// After
149+
let stmt = DdlStatement::CreateFunction(Box::new(CreateFunction {
150+
name,
151+
args,
152+
..
153+
}));
154+
```
155+
156+
When pattern-matching, bind the boxed value and either access fields
157+
through it (Rust auto-derefs the `Box`) or destructure via `.as_ref()`:
158+
159+
```rust,ignore
160+
// Before
161+
match ddl {
162+
DdlStatement::CreateExternalTable(CreateExternalTable {
163+
name, location, ..
164+
}) => { /* use name, location */ }
165+
}
166+
167+
// After — access fields through the box
168+
match ddl {
169+
DdlStatement::CreateExternalTable(ce) => {
170+
let name = &ce.name;
171+
let location = &ce.location;
172+
/* ... */
173+
}
174+
}
175+
176+
// After — destructure the dereferenced struct
177+
match ddl {
178+
DdlStatement::CreateExternalTable(ce) => {
179+
let CreateExternalTable { name, location, .. } = ce.as_ref();
180+
/* ... */
181+
}
182+
}
183+
```
184+
185+
When you need an owned `CreateExternalTable` / `CreateFunction` out of
186+
the variant, dereference the box with `*`:
187+
188+
```rust,ignore
189+
// Before
190+
match plan {
191+
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => Ok(cmd),
192+
_ => { /* ... */ }
193+
}
194+
195+
// After
196+
match plan {
197+
LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) => Ok(*cmd),
198+
_ => { /* ... */ }
199+
}
200+
```
201+
202+
See [PR #22733](https://github.com/apache/datafusion/pull/22733) for
203+
details, including the per-variant size breakdown and benchmark
204+
results.
205+
100206
### Spark map functions now reject duplicate keys by default
101207

102208
The Spark-compatibility map-construction functions (`map_from_arrays`,

0 commit comments

Comments
 (0)