Skip to content

Route analytics queries by index setting#5432

Merged
ahkcs merged 1 commit into
opensearch-project:feature/mustang-ppl-integrationfrom
bowenlan-amzn:experiment/routing-by-dataformat-setting
May 11, 2026
Merged

Route analytics queries by index setting#5432
ahkcs merged 1 commit into
opensearch-project:feature/mustang-ppl-integrationfrom
bowenlan-amzn:experiment/routing-by-dataformat-setting

Conversation

@bowenlan-amzn
Copy link
Copy Markdown
Member

Summary

  • Route queries to the analytics engine based on index settings instead of table-name prefix convention
  • Check both index.pluggable.dataformat.enabled=true AND index.pluggable.dataformat=composite before routing to the analytics path
  • If pluggable dataformat is enabled but the value is not composite (e.g. lucene), fall through to the standard Calcite→DSL path
  • Add pluginSettings to RestUnifiedQueryAction to support cluster-level setting overrides (rex max_match limit)

Test plan

  • Unit tests for routing logic (RestUnifiedQueryActionTest)
    • Composite format routes to analytics
    • Lucene format routes to old path
    • Missing index routes to old path
    • Null/empty queries route to old path

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

PR Reviewer Guide 🔍

(Review updated until commit 28ac22a)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Misleading test name

The test pluggableDataformatIndexRoutesToAnalytics registers an index named parquet_logs but now sets the dataformat to composite. This creates confusion between the index name and its actual format, potentially misleading future maintainers about what the test validates.

public void pluggableDataformatIndexRoutesToAnalytics() {
  registerIndex(
      "parquet_logs",
      Settings.builder()
          .put(IndexSettings.PLUGGABLE_DATAFORMAT_ENABLED_SETTING.getKey(), true)
          .put(IndexSettings.PLUGGABLE_DATAFORMAT_VALUE_SETTING.getKey(), "composite")
          .build());

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

PR Code Suggestions ✨

Latest suggestions up to 28ac22a
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Update test index name for clarity

The test index name parquet_logs is misleading since the dataformat value has been
changed from "parquet" to "composite". Consider renaming the index to reflect the
actual dataformat being tested, such as composite_logs, to maintain consistency and
avoid confusion.

plugin/src/test/java/org/opensearch/sql/plugin/rest/RestUnifiedQueryActionTest.java [56-61]

 registerIndex(
-    "parquet_logs",
+    "composite_logs",
     Settings.builder()
         .put(IndexSettings.PLUGGABLE_DATAFORMAT_ENABLED_SETTING.getKey(), true)
         .put(IndexSettings.PLUGGABLE_DATAFORMAT_VALUE_SETTING.getKey(), "composite")
         .build());
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the test index name parquet_logs is now misleading after changing the dataformat value from "parquet" to "composite". Renaming to composite_logs would improve code clarity and maintainability, though this is not a critical functional issue.

Medium

Previous suggestions

Suggestions up to commit d504f77
CategorySuggestion                                                                                                                                    Impact
General
Replace hardcoded string with constant

The hardcoded string "composite" should be replaced with a constant to prevent typos
and improve maintainability. Consider defining a constant like COMPOSITE_FORMAT or
using an existing constant if available in IndexSettings.

plugin/src/main/java/org/opensearch/sql/plugin/rest/RestUnifiedQueryAction.java [106-114]

+private static final String COMPOSITE_FORMAT = "composite";
+
 private boolean isPluggableDataformatIndex(String indexName) {
   var indexMetadata = clusterService.state().metadata().index(indexName);
   if (indexMetadata == null) {
     return false;
   }
   var settings = indexMetadata.getSettings();
   return IndexSettings.PLUGGABLE_DATAFORMAT_ENABLED_SETTING.get(settings)
-      && "composite".equals(IndexSettings.PLUGGABLE_DATAFORMAT_VALUE_SETTING.get(settings));
+      && COMPOSITE_FORMAT.equals(IndexSettings.PLUGGABLE_DATAFORMAT_VALUE_SETTING.get(settings));
 }
Suggestion importance[1-10]: 5

__

Why: While replacing the hardcoded "composite" string with a constant improves maintainability and reduces typo risk, this is a minor code quality improvement. The suggestion is valid and follows best practices, but has moderate impact.

Low

Today `RestUnifiedQueryAction.isAnalyticsIndex` dispatches to the analytics
engine when the source index name starts with `parquet_`. That's brittle —
it conflates naming convention with storage type. An index created without
the prefix but with pluggable dataformat enabled is silently sent to the
Lucene path; an index named `parquet_foo` without the setting is
mis-dispatched to analytics.

Use the authoritative signal instead: the `index.pluggable.dataformat.enabled`
setting on cluster-state metadata. This is the same setting integration tests
(`CoordinatorReduceIT`, `CompositeCommitDeletionIT`, etc.) already use to
create analytics-backed indices, and it's what `FieldStorageResolver` reads
to decide field-level storage.

Behavior:
- `index.pluggable.dataformat.enabled=true`  → analytics engine (DataFusion)
- flag absent / false / index missing         → Calcite→OpenSearch DSL path

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 28ac22a

@ahkcs ahkcs merged commit 32d7358 into opensearch-project:feature/mustang-ppl-integration May 11, 2026
56 of 59 checks passed
ahkcs pushed a commit to ahkcs/sql that referenced this pull request May 11, 2026
…search-project#5432)

Today `RestUnifiedQueryAction.isAnalyticsIndex` dispatches to the analytics
engine when the source index name starts with `parquet_`. That's brittle —
it conflates naming convention with storage type. An index created without
the prefix but with pluggable dataformat enabled is silently sent to the
Lucene path; an index named `parquet_foo` without the setting is
mis-dispatched to analytics.

Use the authoritative signal instead: the `index.pluggable.dataformat.enabled`
setting on cluster-state metadata. This is the same setting integration tests
(`CoordinatorReduceIT`, `CompositeCommitDeletionIT`, etc.) already use to
create analytics-backed indices, and it's what `FieldStorageResolver` reads
to decide field-level storage.

Behavior:
- `index.pluggable.dataformat.enabled=true`  → analytics engine (DataFusion)
- flag absent / false / index missing         → Calcite→OpenSearch DSL path

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
ahkcs added a commit to ahkcs/sql that referenced this pull request May 11, 2026
Single squashed delivery of the long-running
feature/mustang-ppl-integration branch into main, consolidating 22
feature-branch PRs plus the conflict-resolved merge of current main.
Squashed because the feature branch's history includes commits with
missing or mismatched Signed-off-by trailers that block DCO at this
scope — the equivalent issue documented for the catch-up squashes
(opensearch-project#5397).

The feature branch f006e29 is retained for individual-commit lineage.

### What this delivers

Analytics-engine PPL integration — a new execution path that routes
Parquet-backed (non-Lucene) indices through an analytics engine while
keeping Lucene-backed indices on the existing v2 / Calcite paths.

Headline pieces:
- Query routing (opensearch-project#5267) — PPL queries against Parquet-backed indices
  hand off to the analytics-engine execution path; Lucene-backed indices
  continue through the legacy path
- Explain support (opensearch-project#5275) — EXPLAIN covers the analytics-engine path
- Profiling + UnifiedQueryParser (opensearch-project#5285) — migrates PPL parsing to the
  unified parser and wires profiling metrics through the analytics path
- extendedPlugins wiring (opensearch-project#5302) — analytics-engine attaches as an
  OpenSearch extension via SPI
- SQL REST endpoint integration (opensearch-project#5317) — same analytics-route fork
  applied to the SQL transport, plus delegateToV2Engine extraction in
  RestSqlAction
- Async QueryPlanExecutor (opensearch-project#5396) — async execution for analytics-engine
  plans + version bump to OpenSearch 3.7
- Optional dependency (opensearch-project#5403) — analytics-engine becomes an optional
  runtime dep so the SQL bundle is shippable without it
- Index-setting-based routing (opensearch-project#5429, opensearch-project#5432) — replaces the earlier
  table-name-prefix heuristic with an authoritative index-setting check

Supporting infrastructure:
- Gradle wrapper bump to 9.4.1 (opensearch-project#5406)
- Jar-hell exclusions for arrow-flight-rpc / httpcore5-h2 /
  httpcore5-reactive / httpclient5 (opensearch-project#5400, opensearch-project#5409)
- IT plumbing: CalciteEvalCommandIT / CalciteFieldFormatCommandIT
  carried through the helper-managed index path (opensearch-project#5407, opensearch-project#5417);
  CalciteReplaceCommandIT column-order-agnostic (opensearch-project#5415); @ignore'd
  Calcite ITs dropped from CalciteNoPushdownIT (opensearch-project#5416)
- plugins.calcite.enabled=true defaulted on the unified query path
  (opensearch-project#5413)
- PPL_REX_MAX_MATCH_LIMIT bridged into UnifiedQueryContext (opensearch-project#5418)
- Calcite tolerance fixes: array() default type (opensearch-project#5421),
  containsNestedAggregator flat-leaf schemas (opensearch-project#5423)
- Sandbox deps switched to analytics-api JDK 21 surface (opensearch-project#5426)

### Feature-branch commits squashed (22)

opensearch-project#5432, opensearch-project#5429, opensearch-project#5426, opensearch-project#5423, opensearch-project#5421, opensearch-project#5418, opensearch-project#5403, opensearch-project#5417, opensearch-project#5415, opensearch-project#5416, opensearch-project#5413,
opensearch-project#5407, opensearch-project#5409, opensearch-project#5406, opensearch-project#5400, opensearch-project#5396, opensearch-project#5317, opensearch-project#5302, opensearch-project#5285, opensearch-project#5275, opensearch-project#5267,
opensearch-project#5397, opensearch-project#5286

### Main commits absorbed via the merge (54)

Brings the branch up to current upstream/main (54 commits since the
last catch-up at opensearch-project#5397, divergence point 513e1b2). Highlights: opensearch-project#5419,
opensearch-project#5408, opensearch-project#5414, opensearch-project#5399, opensearch-project#5394, opensearch-project#5361, opensearch-project#5360, opensearch-project#5240, opensearch-project#5266, opensearch-project#5278, plus 44
others (bugfixes, doc updates, infra).

### Conflict resolutions (7)

Resolved during the merge of main into the feature branch. Resolution
kept the feature branch's analytics-engine-path semantics where main's
changes would have regressed them.

- api/.../UnifiedQueryContext.java
  Blank-line-only conflict; took main's tighter formatting.

- core/.../executor/QueryService.java
  Kept feature's CalciteClassLoaderHelper.withCalciteClassLoader(...)
  wrapping (required for analytics-engine classloader isolation) and
  the matching import.

- integ-test/build.gradle
  Kept feature's detailed root-cause comment on the Gradle 9.4.1
  TestEventReporterAsListener workaround; kept ASCII ordering of
  JSONRequestIT / JoinIT and SQLFunctionsIT / ShowIT / SourceFieldIT
  entries.

- integ-test/.../CalciteEvalCommandIT.java
  Kept feature's if (!TestUtils.isIndexExist(...)) idempotency guards
  on test_eval and test_eval_agent setup (needed for the helper-managed
  index analytics-engine compatibility run).

- legacy/.../RestSqlAction.java
  Kept feature's delegateToV2Engine(...) (extracted from the
  analytics-engine routing path). Both sides added handleException /
  getRestStatus / getRawErrorCode; removed the duplicate set git
  produced.

- plugin/.../SQLPlugin.java
  Took the union of imports: ExecutionEngine +
  ExecutionEngine.ExplainResponse + QueryType.

- plugin/.../transport/TransportPPLQueryAction.java
  Combined main's OpenSearchPluginModule(extensionsHolder.engines()) and
  feature's local pluginSettings / pluginSettingsRef wiring.

EngineExtensionsHolder.java is a new file from main (opensearch-project#5298) preserved
as-is.

### Compatibility / opt-in

The analytics-engine path is gated by the extendedPlugins extension
being installed (opensearch-project#5403 makes the dep optional). Clusters without
analytics-engine installed see no behavior change. Clusters with
analytics-engine installed route only Parquet-backed indices through
the new path (opensearch-project#5429 — by index setting).

### Verification

- ./gradlew :api:compileJava :core:compileJava :legacy:compileJava
  :opensearch-sql-plugin:compileJava :integ-test:compileTestJava passes
  locally

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Co-authored-by: bowenlan-amzn <bowenlan23@gmail.com>
@bowenlan-amzn bowenlan-amzn deleted the experiment/routing-by-dataformat-setting branch May 11, 2026 18:25
ahkcs added a commit that referenced this pull request May 11, 2026
* Land analytics-engine PPL integration into main

Single squashed delivery of the long-running
feature/mustang-ppl-integration branch into main, consolidating 22
feature-branch PRs plus the conflict-resolved merge of current main.
Squashed because the feature branch's history includes commits with
missing or mismatched Signed-off-by trailers that block DCO at this
scope — the equivalent issue documented for the catch-up squashes
(#5397).

The feature branch f006e29 is retained for individual-commit lineage.

### What this delivers

Analytics-engine PPL integration — a new execution path that routes
Parquet-backed (non-Lucene) indices through an analytics engine while
keeping Lucene-backed indices on the existing v2 / Calcite paths.

Headline pieces:
- Query routing (#5267) — PPL queries against Parquet-backed indices
  hand off to the analytics-engine execution path; Lucene-backed indices
  continue through the legacy path
- Explain support (#5275) — EXPLAIN covers the analytics-engine path
- Profiling + UnifiedQueryParser (#5285) — migrates PPL parsing to the
  unified parser and wires profiling metrics through the analytics path
- extendedPlugins wiring (#5302) — analytics-engine attaches as an
  OpenSearch extension via SPI
- SQL REST endpoint integration (#5317) — same analytics-route fork
  applied to the SQL transport, plus delegateToV2Engine extraction in
  RestSqlAction
- Async QueryPlanExecutor (#5396) — async execution for analytics-engine
  plans + version bump to OpenSearch 3.7
- Optional dependency (#5403) — analytics-engine becomes an optional
  runtime dep so the SQL bundle is shippable without it
- Index-setting-based routing (#5429, #5432) — replaces the earlier
  table-name-prefix heuristic with an authoritative index-setting check

Supporting infrastructure:
- Gradle wrapper bump to 9.4.1 (#5406)
- Jar-hell exclusions for arrow-flight-rpc / httpcore5-h2 /
  httpcore5-reactive / httpclient5 (#5400, #5409)
- IT plumbing: CalciteEvalCommandIT / CalciteFieldFormatCommandIT
  carried through the helper-managed index path (#5407, #5417);
  CalciteReplaceCommandIT column-order-agnostic (#5415); @ignore'd
  Calcite ITs dropped from CalciteNoPushdownIT (#5416)
- plugins.calcite.enabled=true defaulted on the unified query path
  (#5413)
- PPL_REX_MAX_MATCH_LIMIT bridged into UnifiedQueryContext (#5418)
- Calcite tolerance fixes: array() default type (#5421),
  containsNestedAggregator flat-leaf schemas (#5423)
- Sandbox deps switched to analytics-api JDK 21 surface (#5426)

### Feature-branch commits squashed (22)

#5432, #5429, #5426, #5423, #5421, #5418, #5403, #5417, #5415, #5416, #5413,
#5407, #5409, #5406, #5400, #5396, #5317, #5302, #5285, #5275, #5267,
#5397, #5286

### Main commits absorbed via the merge (54)

Brings the branch up to current upstream/main (54 commits since the
last catch-up at #5397, divergence point 513e1b2). Highlights: #5419,
#5408, #5414, #5399, #5394, #5361, #5360, #5240, #5266, #5278, plus 44
others (bugfixes, doc updates, infra).

### Conflict resolutions (7)

Resolved during the merge of main into the feature branch. Resolution
kept the feature branch's analytics-engine-path semantics where main's
changes would have regressed them.

- api/.../UnifiedQueryContext.java
  Blank-line-only conflict; took main's tighter formatting.

- core/.../executor/QueryService.java
  Kept feature's CalciteClassLoaderHelper.withCalciteClassLoader(...)
  wrapping (required for analytics-engine classloader isolation) and
  the matching import.

- integ-test/build.gradle
  Kept feature's detailed root-cause comment on the Gradle 9.4.1
  TestEventReporterAsListener workaround; kept ASCII ordering of
  JSONRequestIT / JoinIT and SQLFunctionsIT / ShowIT / SourceFieldIT
  entries.

- integ-test/.../CalciteEvalCommandIT.java
  Kept feature's if (!TestUtils.isIndexExist(...)) idempotency guards
  on test_eval and test_eval_agent setup (needed for the helper-managed
  index analytics-engine compatibility run).

- legacy/.../RestSqlAction.java
  Kept feature's delegateToV2Engine(...) (extracted from the
  analytics-engine routing path). Both sides added handleException /
  getRestStatus / getRawErrorCode; removed the duplicate set git
  produced.

- plugin/.../SQLPlugin.java
  Took the union of imports: ExecutionEngine +
  ExecutionEngine.ExplainResponse + QueryType.

- plugin/.../transport/TransportPPLQueryAction.java
  Combined main's OpenSearchPluginModule(extensionsHolder.engines()) and
  feature's local pluginSettings / pluginSettingsRef wiring.

EngineExtensionsHolder.java is a new file from main (#5298) preserved
as-is.

### Compatibility / opt-in

The analytics-engine path is gated by the extendedPlugins extension
being installed (#5403 makes the dep optional). Clusters without
analytics-engine installed see no behavior change. Clusters with
analytics-engine installed route only Parquet-backed indices through
the new path (#5429 — by index setting).

### Verification

- ./gradlew :api:compileJava :core:compileJava :legacy:compileJava
  :opensearch-sql-plugin:compileJava :integ-test:compileTestJava passes
  locally

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Co-authored-by: bowenlan-amzn <bowenlan23@gmail.com>

* Address @penghuo: revert stray blank line in doctest/build.gradle

After 'apply plugin: opensearch.testclusters', one blank line is enough
— restoring the single-blank spacing to match upstream/main.

Signed-off-by: Kai Huang <ahkcs@amazon.com>

---------

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Co-authored-by: bowenlan-amzn <bowenlan23@gmail.com>
ahkcs added a commit to ahkcs/sql that referenced this pull request May 12, 2026
Adds a small, opt-in test infrastructure slice so the PPL integration
test suite can run end-to-end against the analytics-engine backend
without per-test rewiring.

`-Dtests.analytics.parquet_indices=true` makes `TestUtils.createIndexByRestClient`
back every test-created index with single-shard composite/parquet storage:

    index.pluggable.dataformat.enabled = true
    index.pluggable.dataformat = "composite"
    index.composite.primary_data_format = "parquet"

`RestUnifiedQueryAction.isAnalyticsIndex` (post-opensearch-project#5432) reads these settings
and routes any query against such indices to the analytics-engine planner
(DataFusion). No additional cluster setting or routing override required —
the production routing logic is the single source of truth.

Also adds `PPLIntegTestCase.isAnalyticsParquetIndicesEnabled()` as a
per-test predicate so individual tests can branch their assertions on
engine semantics (DataFusion follows different ordering and null-bucket
semantics than the legacy V2 and Calcite-DSL paths).

Bulk loads on parquet-backed indices use `refresh=true` because
`analytics-backend-lucene`'s `LuceneCommitter.getSafeCommitInfo` is a
`TODO:: with index deleter` stub that hangs `refresh=wait_for` until the
test framework request timeout (~60s).

`integ-test/build.gradle` forwards the property to `:integTestRemote` so
the gradle command line is the single knob.

Default behavior is unchanged — with the flag unset, every test-created
index is Lucene-backed and every IT runs through the existing V2 /
Calcite path exactly as before.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants