Skip to content

Commit bec6714

Browse files
authored
docs: Document the TableProvider evaluation order for filter, limit and projection (#21091)
## Which issue does this PR close? - Follow on to #21057 ## Rationale for this change As mentioned by @hareshkh on #21057 (comment): It is not clear from the existing documentation that the (logical) evaluation order for push down operations is 'filter -> limit -> projection' this is the actual order implemented by the built in providers, but it wasn't documented anywhere explicitly ## What changes are included in this PR? 1. Explicitly document the evaluation order on TableProvider 2. Some drive by cleanups of the documentation ## Are these changes tested? By CI ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
1 parent dfc8bb7 commit bec6714

File tree

1 file changed

+51
-29
lines changed

1 file changed

+51
-29
lines changed

datafusion/catalog/src/table.rs

Lines changed: 51 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -84,10 +84,10 @@ pub trait TableProvider: Debug + Sync + Send {
8484
None
8585
}
8686

87-
/// Create an [`ExecutionPlan`] for scanning the table with optionally
88-
/// specified `projection`, `filter` and `limit`, described below.
87+
/// Create an [`ExecutionPlan`] for scanning the table with optional
88+
/// `projection`, `filter`, and `limit`, described below.
8989
///
90-
/// The `ExecutionPlan` is responsible scanning the datasource's
90+
/// The returned `ExecutionPlan` is responsible for scanning the datasource's
9191
/// partitions in a streaming, parallelized fashion.
9292
///
9393
/// # Projection
@@ -96,33 +96,30 @@ pub trait TableProvider: Debug + Sync + Send {
9696
/// specified. The projection is a set of indexes of the fields in
9797
/// [`Self::schema`].
9898
///
99-
/// DataFusion provides the projection to scan only the columns actually
100-
/// used in the query to improve performance, an optimization called
101-
/// "Projection Pushdown". Some datasources, such as Parquet, can use this
102-
/// information to go significantly faster when only a subset of columns is
103-
/// required.
99+
/// DataFusion provides the projection so the scan reads only the columns
100+
/// actually used in the query, an optimization called "Projection
101+
/// Pushdown". Some datasources, such as Parquet, can use this information
102+
/// to go significantly faster when only a subset of columns is required.
104103
///
105104
/// # Filters
106105
///
107106
/// A list of boolean filter [`Expr`]s to evaluate *during* the scan, in the
108107
/// manner specified by [`Self::supports_filters_pushdown`]. Only rows for
109-
/// which *all* of the `Expr`s evaluate to `true` must be returned (aka the
110-
/// expressions are `AND`ed together).
108+
/// which *all* of the `Expr`s evaluate to `true` must be returned (that is,
109+
/// the expressions are `AND`ed together).
111110
///
112-
/// To enable filter pushdown you must override
113-
/// [`Self::supports_filters_pushdown`] as the default implementation does
114-
/// not and `filters` will be empty.
111+
/// To enable filter pushdown, override
112+
/// [`Self::supports_filters_pushdown`]. The default implementation does not
113+
/// push down filters, and `filters` will be empty.
115114
///
116-
/// DataFusion pushes filtering into the scans whenever possible
117-
/// ("Filter Pushdown"), and depending on the format and the
118-
/// implementation of the format, evaluating the predicate during the scan
119-
/// can increase performance significantly.
115+
/// DataFusion pushes filters into scans whenever possible ("Filter
116+
/// Pushdown"). Depending on the data format and implementation, evaluating
117+
/// predicates during the scan can significantly improve performance.
120118
///
121119
/// ## Note: Some columns may appear *only* in Filters
122120
///
123-
/// In certain cases, a query may only use a certain column in a Filter that
124-
/// has been completely pushed down to the scan. In this case, the
125-
/// projection will not contain all the columns found in the filter
121+
/// In some cases, a query may use a column only in a filter and the
122+
/// projection will not contain all columns referenced by the filter
126123
/// expressions.
127124
///
128125
/// For example, given the query `SELECT t.a FROM t WHERE t.b > 5`,
@@ -154,15 +151,40 @@ pub trait TableProvider: Debug + Sync + Send {
154151
///
155152
/// # Limit
156153
///
157-
/// If `limit` is specified, must only produce *at least* this many rows,
158-
/// (though it may return more). Like Projection Pushdown and Filter
159-
/// Pushdown, DataFusion pushes `LIMIT`s as far down in the plan as
160-
/// possible, called "Limit Pushdown" as some sources can use this
161-
/// information to improve their performance. Note that if there are any
162-
/// Inexact filters pushed down, the LIMIT cannot be pushed down. This is
163-
/// because inexact filters do not guarantee that every filtered row is
164-
/// removed, so applying the limit could lead to too few rows being available
165-
/// to return as a final result.
154+
/// If `limit` is specified, the scan must produce *at least* this many
155+
/// rows, though it may return more. Like Projection Pushdown and Filter
156+
/// Pushdown, DataFusion pushes `LIMIT`s as far down in the plan as
157+
/// possible. This is called "Limit Pushdown", and some sources can use the
158+
/// information to improve performance.
159+
///
160+
/// Note: If any pushed-down filters are `Inexact`, the `LIMIT` cannot be
161+
/// pushed down. Inexact filters do not guarantee that every filtered row is
162+
/// removed, so applying the limit could leave too few rows to return in the
163+
/// final result.
164+
///
165+
/// # Evaluation Order
166+
///
167+
/// The logical evaluation order is `filters`, then `limit`, then
168+
/// `projection`.
169+
///
170+
/// Note that `limit` applies to the filtered result, not to the unfiltered
171+
/// input, and `projection` affects only which columns are returned, not
172+
/// which rows qualify.
173+
///
174+
/// For example, if a scan receives:
175+
///
176+
/// - `projection = [a]`
177+
/// - `filters = [b > 5]`
178+
/// - `limit = Some(3)`
179+
///
180+
/// It must logically produce results equivalent to:
181+
///
182+
/// ```text
183+
/// PROJECTION a (LIMIT 3 (SCAN WHERE b > 5))
184+
/// ```
185+
///
186+
/// As noted above, columns referenced only by pushed-down filters may be
187+
/// absent from `projection`.
166188
async fn scan(
167189
&self,
168190
state: &dyn Session,

0 commit comments

Comments
 (0)