You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Status: Draft for discussion · Author:@RyanL1997 · Date: 2026-06-25
Full design: docs/dev/ppl-combination-pushdown-test-framework.md.
Proof-of-concept branches (on this fork): poc/ppl-combination-pushdown-tests (framework), fix/sort-pushdown-text-keyword-guard (a real bug the framework surfaced).
1. Problem
PPL pushdown tests are overwhelmingly single-command in intent. Explain coverage is ~338
hand-written golden files, almost all one command. The bugs this misses are adjacency bugs — an
operator that pushes down fine alone breaks when a neighbour is present. The two motivating
fixes are exactly this class:
These are two distinct failure modes (lost pushdown vs wrong DSL→wrong rows), and single-command
goldens structurally cannot catch either. There is no systematic combination coverage, no mechanism
to ensure a new command gets combination/pushdown coverage, and no record/regenerate mode — every
golden is hand-typed.
2. Proposal
A framework that exercises reasonable multi-command pipelines and verifies each command pushes
down as expected, structured so a new/changed command is caught automatically. Four pieces:
Shape oracle — parse the physical explain into the set of pushed PushDownType tokens and
verify it bidirectionally: a missing token is a pushdown loss ([BugFix] Restore dedup pushdown when combined with WHERE clause (#5482) opensearch-project/sql#5488 class), an extra token
an undeclared gain. Expectations are computed from a command→token map and the field types
(field-type-aware), never recorded — so a behavior change always turns the suite red.
Differential oracle — run a pipeline with pushdown on vs off and assert identical results
(schema-checked, order-insensitive multiset, per-cell ULP tolerance), with a documented exclusion
list for legitimate divergences. This is the academic NoREC oracle, applied intra-engine.
Reasonable-combination generator — a field-availability-aware validity model that emits valid
pipelines (never referencing a dropped field; no redundant adjacency) — not a cartesian product.
Coverage gate — reflect the active grammar's OpenSearchPPLParser.ruleNames; a new/renamed
command fails the build until it is declared, forcing combination + pushdown coverage.
Why this design (prior art)
TrinoBaseConnectorTest (isFullyPushedDown() / isNotFullyPushedDown(NodeClass) / skipResultsCorrectnessCheckForPushdown()) is our shape oracle, including the shape/result
decoupling.
SQLancer NoREC (optimizer on-vs-off result differential, 51 optimization bugs found) is our differential oracle — and being intra-engine it sidesteps the cross-dialect objection PQS
raises against differential testing.
Elastic ES|QL (our closest sibling) validates the corpus + random pipeline generator; we borrow
its cheap error-classification oracle (a valid pipeline must not throw). ES|QL verifies pushdown
shape only via separate optimizer-rule unit tests — our integrated token oracle on the live explain
unifies all three.
3. What's already built (proof of concept)
A working POC on poc/ppl-combination-pushdown-tests, production-shaped and green:
Tests (all passing):CommandCoverageGateTest (ppl); PushdownShapeOracleIT (parser validated
on 367 real goldens + the command→token map on 90 benchmark queries); DifferentialComparatorIT; CombinationModelIT; CalcitePplCombinationShapeIT (shape oracle on a live cluster); CalcitePplDifferentialIT (differential on a live cluster, incl. AVG ULP); PipelineGeneratorIT;
and CalcitePplGeneratedDifferentialIT — the generator's 100 pipelines (20 two-command +
80 three-command), each verified pushdown-invariant on a live cluster with zero per-pipeline test
code. Adding a command template or an index field expands coverage automatically.
4. The framework already found (and we fixed) real bugs
Pointing the framework's lens at current main surfaced a family of latent pushdown bugs sharing one
root cause — the keyword-subfield guard the dedup path has was missing in the sort, sort-expr and
aggregate-terms paths. Fixed on fix/sort-pushdown-text-keyword-guard:
sort <text-expr-key> sorted on the raw analyzed field → wrong order / HTTP error (the genuine
bug); now sorts on .keyword or declines cleanly.
sort/stats by on a text-without-.keyword field relied on a swallowed exception; now declines
explicitly.
Verified with no regressions: CalciteSortCommandIT (30), CalcitePPLAggregationIT (100), CalciteExplainIT (258, 0 failures, no golden moved). This is the ROI: the framework catches this
class automatically and continuously.
5. Project tenets (what every review asks)
Adequacy is measured (branch + interlock + field-type coverage; mutation kill-rate;
historical-bug replay), not a query count.
Expectations are declared, never recorded — or detection is destroyed.
Detection is bidirectional (a pushdown gain is as loud as a loss).
You maintain intent, not artifacts — one declarative line, not hundreds of goldens.
Combinations must be reasonable (field-availability + position), never cartesian.
Two oracles, two failure modes (shape for loss/gain; differential for wrong rows).
Robust to churn, sensitive to behavior (token presence, not digests/ordinals).
6. Execution model
Tiered, not one CI job: cheap forcing-functions gate every PR (ppl coverage gate; cluster-free
oracle/map checks; a small live smoke subset); the full generated sweep + mutation/coverage run
nightly. The generator produces cases at runtime from the manifest, so adding coverage adds no
files.
7. Open questions / next steps
Generator depth and breadth (more index profiles; order-sensitive comparison for total-order sorts).
Adopt the ES|QL error-classification oracle as a first-line generator check (partially in the
generated-differential IT already).
Wire the nightly lane + mutation (PIT) on the pushdown rule classes for the adequacy numbers.
Land the bug fix (fix/sort-pushdown-text-keyword-guard) as its own PR upstream.
Feedback welcome on: the two-oracle split, the coverage-gate-as-forcing-function approach, and how
aggressively to grow the generator vs curate a corpus.
Branches (on this fork)
poc/ppl-combination-pushdown-tests— 4 commits, all green (open PR)fix/sort-pushdown-text-keyword-guard— 2 commits, verified (open PR)Design doc:
docs/dev/ppl-combination-pushdown-test-framework.md· RFC source:docs/dev/rfc-ppl-combination-pushdown-testing.md(both on the framework branch).RFC: Scalable PPL command-combination + pushdown verification testing
Status: Draft for discussion · Author: @RyanL1997 · Date: 2026-06-25
1. Problem
PPL pushdown tests are overwhelmingly single-command in intent. Explain coverage is ~338
hand-written golden files, almost all one command. The bugs this misses are adjacency bugs — an
operator that pushes down fine alone breaks when a neighbour is present. The two motivating
fixes are exactly this class:
where … | dedup …): a filter-merge defeatedPPLSimplifyDedupRule; dedup silentlyfell back to an in-memory window — correct rows, lost pushdown (a perf cliff).
range AND IN/Sargon a timestamp): a Sarg fold re-typed the literal; the emitted DSLlost ISO-8601 normalization — wrong rows / HTTP 500.
These are two distinct failure modes (lost pushdown vs wrong DSL→wrong rows), and single-command
goldens structurally cannot catch either. There is no systematic combination coverage, no mechanism
to ensure a new command gets combination/pushdown coverage, and no record/regenerate mode — every
golden is hand-typed.
2. Proposal
A framework that exercises reasonable multi-command pipelines and verifies each command pushes
down as expected, structured so a new/changed command is caught automatically. Four pieces:
PushDownTypetokens andverify it bidirectionally: a missing token is a pushdown loss ([BugFix] Restore dedup pushdown when combined with WHERE clause (#5482) opensearch-project/sql#5488 class), an extra token
an undeclared gain. Expectations are computed from a command→token map and the field types
(field-type-aware), never recorded — so a behavior change always turns the suite red.
(schema-checked, order-insensitive multiset, per-cell ULP tolerance), with a documented exclusion
list for legitimate divergences. This is the academic NoREC oracle, applied intra-engine.
pipelines (never referencing a dropped field; no redundant adjacency) — not a cartesian product.
OpenSearchPPLParser.ruleNames; a new/renamedcommand fails the build until it is declared, forcing combination + pushdown coverage.
Why this design (prior art)
BaseConnectorTest(isFullyPushedDown()/isNotFullyPushedDown(NodeClass)/skipResultsCorrectnessCheckForPushdown()) is our shape oracle, including the shape/resultdecoupling.
differential oracle — and being intra-engine it sidesteps the cross-dialect objection PQS
raises against differential testing.
its cheap error-classification oracle (a valid pipeline must not throw). ES|QL verifies pushdown
shape only via separate optimizer-rule unit tests — our integrated token oracle on the live explain
unifies all three.
3. What's already built (proof of concept)
A working POC on
poc/ppl-combination-pushdown-tests, production-shaped and green:PushdownShapeOracle,DifferentialComparator,CombinationModel(field-typeeligibility),
PipelineGenerator,QueryResults,PushdownDifferentialTestCase.CommandCoverageGateTest(ppl);PushdownShapeOracleIT(parser validatedon 367 real goldens + the command→token map on 90 benchmark queries);
DifferentialComparatorIT;CombinationModelIT;CalcitePplCombinationShapeIT(shape oracle on a live cluster);CalcitePplDifferentialIT(differential on a live cluster, incl. AVG ULP);PipelineGeneratorIT;and
CalcitePplGeneratedDifferentialIT— the generator's 100 pipelines (20 two-command +80 three-command), each verified pushdown-invariant on a live cluster with zero per-pipeline test
code. Adding a command template or an index field expands coverage automatically.
4. The framework already found (and we fixed) real bugs
Pointing the framework's lens at current
mainsurfaced a family of latent pushdown bugs sharing oneroot cause — the keyword-subfield guard the dedup path has was missing in the sort, sort-expr and
aggregate-terms paths. Fixed on
fix/sort-pushdown-text-keyword-guard:sort <text-expr-key>sorted on the raw analyzed field → wrong order / HTTP error (the genuinebug); now sorts on
.keywordor declines cleanly.sort/stats byon atext-without-.keywordfield relied on a swallowed exception; now declinesexplicitly.
Verified with no regressions:
CalciteSortCommandIT(30),CalcitePPLAggregationIT(100),CalciteExplainIT(258, 0 failures, no golden moved). This is the ROI: the framework catches thisclass automatically and continuously.
5. Project tenets (what every review asks)
historical-bug replay), not a query count.
6. Execution model
Tiered, not one CI job: cheap forcing-functions gate every PR (
pplcoverage gate; cluster-freeoracle/map checks; a small live smoke subset); the full generated sweep + mutation/coverage run
nightly. The generator produces cases at runtime from the manifest, so adding coverage adds no
files.
7. Open questions / next steps
generated-differential IT already).
fix/sort-pushdown-text-keyword-guard) as its own PR upstream.Feedback welcome on: the two-oracle split, the coverage-gate-as-forcing-function approach, and how
aggressively to grow the generator vs curate a corpus.