Skip to content

Commit ba75202

Browse files
adriangbclaude
andcommitted
fix(ci): sync information_schema.slt + drop redundant rustdoc link
- `information_schema.slt`: bumps the baked-in default and doc string for `optimizer.hash_join_inlist_pushdown_max_distinct_values` to match the 150 → 20 default change (sqllogictest, extended_tests, sqlite suite, verify-benchmark-results all hit this slt). - `partitioned_hash_eval.rs`: drop the redundant explicit-target on a `[BooleanArray]` doc link. Adding `BooleanArray` to imports for `MultiMapLookupExpr` made the existing `[`BooleanArray`](arrow::array::BooleanArray)` link redundant under `-D rustdoc::redundant_explicit_links`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f64718f commit ba75202

2 files changed

Lines changed: 3 additions & 3 deletions

File tree

datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ impl PhysicalExpr for HashExpr {
203203

204204
/// Physical expression that checks join keys in a [`Map`] (hash table or array map).
205205
///
206-
/// Returns a [`BooleanArray`](arrow::array::BooleanArray) indicating if join keys (from `on_columns`) exist in the map.
206+
/// Returns a [`BooleanArray`] indicating if join keys (from `on_columns`) exist in the map.
207207
// TODO: rename to MapLookupExpr
208208
pub struct HashTableLookupExpr {
209209
/// Columns in the ON clause used to compute the join key for lookups

datafusion/sqllogictest/test_files/information_schema.slt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ datafusion.optimizer.enable_window_limits true
312312
datafusion.optimizer.enable_window_topn false
313313
datafusion.optimizer.expand_views_at_output false
314314
datafusion.optimizer.filter_null_join_keys false
315-
datafusion.optimizer.hash_join_inlist_pushdown_max_distinct_values 150
315+
datafusion.optimizer.hash_join_inlist_pushdown_max_distinct_values 20
316316
datafusion.optimizer.hash_join_inlist_pushdown_max_size 131072
317317
datafusion.optimizer.hash_join_single_partition_threshold 1048576
318318
datafusion.optimizer.hash_join_single_partition_threshold_rows 131072
@@ -459,7 +459,7 @@ datafusion.optimizer.enable_window_limits true When set to true, the optimizer w
459459
datafusion.optimizer.enable_window_topn false When set to true, the optimizer will replace Filter(rn<=K) → Window(ROW_NUMBER) → Sort patterns with a PartitionedTopKExec that maintains per-partition heaps, avoiding a full sort of the input. When the window partition key has low cardinality, enabling this optimization can improve performance. However, for high cardinality keys, it may cause regressions in both memory usage and runtime.
460460
datafusion.optimizer.expand_views_at_output false When set to true, if the returned type is a view type then the output will be coerced to a non-view. Coerces `Utf8View` to `LargeUtf8`, and `BinaryView` to `LargeBinary`.
461461
datafusion.optimizer.filter_null_join_keys false When set to true, the optimizer will insert filters before a join between a nullable and non-nullable column to filter out nulls on the nullable side. This filter can add additional overhead when the file format does not fully support predicate push down.
462-
datafusion.optimizer.hash_join_inlist_pushdown_max_distinct_values 150 Maximum number of distinct values (rows) in the build side of a hash join to be pushed down as an InList expression for dynamic filtering. Build sides with more rows than this will use hash table lookups instead. Set to 0 to always use hash table lookups. This provides an additional limit beyond `hash_join_inlist_pushdown_max_size` to prevent very large IN lists that might not provide much benefit over hash table lookups. This uses the deduplicated row count once the build side has been evaluated. The default is 150 values per partition. This is inspired by Trino's `max-filter-keys-per-column` setting. See: <https://trino.io/docs/current/admin/dynamic-filtering.html#dynamic-filter-collection-thresholds>
462+
datafusion.optimizer.hash_join_inlist_pushdown_max_distinct_values 20 Maximum number of distinct values (rows) in the build side of a hash join to be pushed down as an InList expression for dynamic filtering. Build sides with more rows than this will use hash table lookups instead. Set to 0 to always use hash table lookups. This provides an additional limit beyond `hash_join_inlist_pushdown_max_size` to prevent very large IN lists that might not provide much benefit over hash table lookups. This uses the deduplicated row count once the build side has been evaluated. In `Partitioned` hash-join mode the same threshold also gates the cross-partition merged InList: per-partition InList arrays are concatenated and deduplicated, and the merged `IN (SET)` only fires when the union has at most this many distinct values. The default is 20 distinct values, tuned so the resulting `IN (SET)` stays small enough to participate in parquet stats / bloom-filter pruning at the scan side.
463463
datafusion.optimizer.hash_join_inlist_pushdown_max_size 131072 Maximum size in bytes for the build side of a hash join to be pushed down as an InList expression for dynamic filtering. Build sides larger than this will use hash table lookups instead. Set to 0 to always use hash table lookups. InList pushdown can be more efficient for small build sides because it can result in better statistics pruning as well as use any bloom filters present on the scan side. InList expressions are also more transparent and easier to serialize over the network in distributed uses of DataFusion. On the other hand InList pushdown requires making a copy of the data and thus adds some overhead to the build side and uses more memory. This setting is per-partition, so we may end up using `hash_join_inlist_pushdown_max_size` * `target_partitions` memory. The default is 128kB per partition. This should allow point lookup joins (e.g. joining on a unique primary key) to use InList pushdown in most cases but avoids excessive memory usage or overhead for larger joins.
464464
datafusion.optimizer.hash_join_single_partition_threshold 1048576 The maximum estimated size in bytes for one input side of a HashJoin will be collected into a single partition
465465
datafusion.optimizer.hash_join_single_partition_threshold_rows 131072 The maximum estimated size in rows for one input side of a HashJoin will be collected into a single partition

0 commit comments

Comments
 (0)