Commit c253bfb
authored
feat: Add pluggable StatisticsRegistry for operator-level statistics propagation (#21483)
## Which issue does this PR close?
- Part of #21443 (Pluggable operator-level statistics propagation)
- Part of #8227 (statistics improvements epic)
## Rationale for this change
DataFusion's built-in statistics propagation has no extension point:
downstream projects cannot inject external catalog stats, override
built-in estimation, or plug in custom strategies without forking.
This PR introduces `StatisticsRegistry`, a pluggable
chain-of-responsibility for operator-level statistics following the same
pattern as `RelationPlanner` for SQL parsing and `ExpressionAnalyzer`
(#21120) for expression-level stats. See #21443 for full motivation and
design context.
## What changes are included in this PR?
1. Framework (`operator_statistics/mod.rs`): `StatisticsProvider` trait,
`StatisticsRegistry` (chain-of-responsibility), `ExtendedStatistics`
(Statistics + type-erased extension map), `DefaultStatisticsProvider`.
`PhysicalOptimizerContext` trait with `optimize_with_context` dispatch.
`SessionState` integration.
2. Built-in providers for Filter, Projection, Passthrough
(sort/repartition/etc), Aggregate, Join
(hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities:
`num_distinct_vals`, `ndv_after_selectivity`.
3. `ClosureStatisticsProvider`: closure-based provider for test
injection and cardinality feedback.
4. JoinSelection integration: `use_statistics_registry` config flag
(default false), registry-aware `optimize_with_context`, SLT test
demonstrating plan difference on skewed data.
## Are these changes tested?
- 39 unit tests covering all providers, NDV utilities, chain priority,
and edge cases (Inexact precision, Absent propagation, Partial aggregate
delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV,
exact Cartesian product, CrossJoin, GlobalLimit skip+fetch)
- 1 SLT test (`statistics_registry.slt`): three-table join on skewed
data (8:1:1 customer_id distribution) where the built-in NDV formula
estimates 33 rows (wrong; actual=66) and the registry conservatively
estimates 100, producing the correct build-side swap
## Are there any user-facing changes?
New public API (purely additive, non-breaking):
- `StatisticsProvider` trait and `StatisticsRegistry` in
`datafusion-physical-plan`
- `ExtendedStatistics`, `StatisticsResult` types; built-in provider
structs; `num_distinct_vals`, `ndv_after_selectivity` utilities
- `PhysicalOptimizerContext` trait and `ConfigOnlyContext` in
`datafusion-physical-optimizer`
- `SessionState::statistics_registry()`,
`SessionStateBuilder::with_statistics_registry()`
- Config: `datafusion.optimizer.use_statistics_registry` (default false)
Default behavior is unchanged. The registry is only consulted when the
flag is explicitly enabled.
Known limitations:
- Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit
boundaries are not improved: these operators call
`partition_statistics(None)` internally, re-fetching raw child stats and
discarding registry enrichment. 4 TODO comments mark the affected call
sites; #20184 would close this gap.
- No `ExpressionAnalyzer` integration yet (#21122).
---
Disclaimer: I used AI to assist in the code generation, I have manually
reviewed the output and it matches my intention and understanding.1 parent d68373e commit c253bfb
File tree
12 files changed
+2680
-44
lines changed- datafusion
- common/src
- core/src
- execution
- physical-optimizer/src
- physical-plan/src
- operator_statistics
- sqllogictest/test_files
- docs/source/user-guide
12 files changed
+2680
-44
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1244 | 1244 | | |
1245 | 1245 | | |
1246 | 1246 | | |
| 1247 | + | |
| 1248 | + | |
| 1249 | + | |
| 1250 | + | |
| 1251 | + | |
| 1252 | + | |
1247 | 1253 | | |
1248 | 1254 | | |
1249 | 1255 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
| 71 | + | |
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
| 75 | + | |
74 | 76 | | |
75 | 77 | | |
76 | 78 | | |
| |||
191 | 193 | | |
192 | 194 | | |
193 | 195 | | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
194 | 202 | | |
195 | 203 | | |
196 | 204 | | |
197 | 205 | | |
198 | 206 | | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
199 | 217 | | |
200 | 218 | | |
201 | 219 | | |
| |||
817 | 835 | | |
818 | 836 | | |
819 | 837 | | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
| 844 | + | |
| 845 | + | |
820 | 846 | | |
821 | 847 | | |
822 | 848 | | |
| |||
1006 | 1032 | | |
1007 | 1033 | | |
1008 | 1034 | | |
| 1035 | + | |
1009 | 1036 | | |
1010 | 1037 | | |
1011 | 1038 | | |
| |||
1047 | 1074 | | |
1048 | 1075 | | |
1049 | 1076 | | |
| 1077 | + | |
1050 | 1078 | | |
1051 | 1079 | | |
1052 | 1080 | | |
| |||
1103 | 1131 | | |
1104 | 1132 | | |
1105 | 1133 | | |
| 1134 | + | |
1106 | 1135 | | |
1107 | 1136 | | |
1108 | 1137 | | |
| |||
1424 | 1453 | | |
1425 | 1454 | | |
1426 | 1455 | | |
| 1456 | + | |
| 1457 | + | |
| 1458 | + | |
| 1459 | + | |
| 1460 | + | |
| 1461 | + | |
| 1462 | + | |
| 1463 | + | |
| 1464 | + | |
| 1465 | + | |
1427 | 1466 | | |
1428 | 1467 | | |
1429 | 1468 | | |
| |||
1491 | 1530 | | |
1492 | 1531 | | |
1493 | 1532 | | |
| 1533 | + | |
1494 | 1534 | | |
1495 | 1535 | | |
1496 | 1536 | | |
| |||
1531 | 1571 | | |
1532 | 1572 | | |
1533 | 1573 | | |
| 1574 | + | |
1534 | 1575 | | |
1535 | 1576 | | |
1536 | 1577 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2773 | 2773 | | |
2774 | 2774 | | |
2775 | 2775 | | |
2776 | | - | |
| 2776 | + | |
2777 | 2777 | | |
2778 | 2778 | | |
2779 | 2779 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
27 | 29 | | |
28 | 30 | | |
29 | 31 | | |
| |||
37 | 39 | | |
38 | 40 | | |
39 | 41 | | |
| 42 | + | |
40 | 43 | | |
41 | 44 | | |
42 | 45 | | |
| |||
53 | 56 | | |
54 | 57 | | |
55 | 58 | | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
56 | 72 | | |
57 | 73 | | |
58 | 74 | | |
59 | 75 | | |
60 | 76 | | |
61 | | - | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
62 | 80 | | |
63 | | - | |
64 | | - | |
| 81 | + | |
| 82 | + | |
65 | 83 | | |
66 | 84 | | |
67 | | - | |
68 | 85 | | |
69 | 86 | | |
70 | 87 | | |
71 | 88 | | |
72 | 89 | | |
73 | 90 | | |
| 91 | + | |
74 | 92 | | |
75 | 93 | | |
76 | 94 | | |
77 | 95 | | |
78 | 96 | | |
79 | | - | |
80 | | - | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
86 | 102 | | |
87 | 103 | | |
88 | 104 | | |
| |||
102 | 118 | | |
103 | 119 | | |
104 | 120 | | |
| 121 | + | |
105 | 122 | | |
106 | | - | |
| 123 | + | |
107 | 124 | | |
108 | 125 | | |
109 | 126 | | |
| |||
126 | 143 | | |
127 | 144 | | |
128 | 145 | | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
134 | 165 | | |
135 | 166 | | |
136 | 167 | | |
137 | 168 | | |
138 | 169 | | |
139 | 170 | | |
140 | 171 | | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | 172 | | |
153 | | - | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
154 | 176 | | |
155 | 177 | | |
156 | 178 | | |
| |||
178 | 200 | | |
179 | 201 | | |
180 | 202 | | |
| 203 | + | |
181 | 204 | | |
182 | 205 | | |
183 | 206 | | |
| |||
188 | 211 | | |
189 | 212 | | |
190 | 213 | | |
| 214 | + | |
191 | 215 | | |
192 | 216 | | |
193 | 217 | | |
194 | 218 | | |
195 | 219 | | |
196 | 220 | | |
| 221 | + | |
197 | 222 | | |
198 | 223 | | |
199 | 224 | | |
200 | 225 | | |
201 | 226 | | |
202 | 227 | | |
203 | 228 | | |
204 | | - | |
| 229 | + | |
205 | 230 | | |
206 | 231 | | |
207 | 232 | | |
| |||
245 | 270 | | |
246 | 271 | | |
247 | 272 | | |
| 273 | + | |
248 | 274 | | |
249 | 275 | | |
250 | 276 | | |
251 | 277 | | |
252 | 278 | | |
253 | 279 | | |
254 | | - | |
| 280 | + | |
255 | 281 | | |
256 | 282 | | |
257 | 283 | | |
| |||
285 | 311 | | |
286 | 312 | | |
287 | 313 | | |
| 314 | + | |
288 | 315 | | |
289 | 316 | | |
290 | 317 | | |
291 | | - | |
| 318 | + | |
292 | 319 | | |
293 | | - | |
| 320 | + | |
294 | 321 | | |
295 | 322 | | |
296 | | - | |
297 | | - | |
298 | | - | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
299 | 326 | | |
300 | | - | |
| 327 | + | |
| 328 | + | |
301 | 329 | | |
302 | 330 | | |
303 | 331 | | |
304 | 332 | | |
305 | 333 | | |
306 | 334 | | |
307 | | - | |
| 335 | + | |
308 | 336 | | |
309 | 337 | | |
310 | 338 | | |
| |||
317 | 345 | | |
318 | 346 | | |
319 | 347 | | |
320 | | - | |
| 348 | + | |
321 | 349 | | |
322 | 350 | | |
323 | 351 | | |
| |||
326 | 354 | | |
327 | 355 | | |
328 | 356 | | |
329 | | - | |
| 357 | + | |
330 | 358 | | |
331 | 359 | | |
332 | 360 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
| 50 | + | |
0 commit comments