Skip to content

Parquet-back raw-PUT test indices on the analytics-engine route#5529

Merged
ahkcs merged 1 commit into
opensearch-project:mainfrom
ahkcs:fix/parquet-back-analytics-route-test-indices
Jun 9, 2026
Merged

Parquet-back raw-PUT test indices on the analytics-engine route#5529
ahkcs merged 1 commit into
opensearch-project:mainfrom
ahkcs:fix/parquet-back-analytics-route-test-indices

Conversation

@ahkcs

@ahkcs ahkcs commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Problem

Running the Calcite IT suite through the analytics-engine route (-Dtests.analytics.parquet_indices=true) produces a large bucket of StreamException[errorCode=INTERNAL] "Failed to start streaming fragment on [<index>][<shard>]" failures, whose cluster-side cause is UnsupportedOperationException: acquireReader is not supported in EngineBackedIndexer.

The root cause is a routing-vs-storage inconsistency:

Decision Keys on
Route to the analytics engine (RestUnifiedQueryAction.isAnalyticsIndex) pluggable.dataformat value == composite
Store as DataFormatAwareEngine (IndicesService.getIndexerFactoryIndexSettings.isPluggableDataFormatEnabled) pluggable.dataformat.enabled=true (+ feature flag)

Test indices created by a raw document PUT (e.g. PUT /test/_doc/1 in a test's init()) bypass AnalyticsIndexConfig.applyIndexCreationSettings (which only stamps indices created via createIndexByRestClient/loadIndex). They inherit the composite value from the cluster opt-in — so they route to the analytics engine — but not the enabled flag, so they are stored as a plain-Lucene EngineBackedIndexer whose acquireReader() is an unimplemented stub. The query then fails with the StreamException above.

Fix

Add AnalyticsIndexConfig.applyClusterSettings(), which sets the cluster-level composite defaults (cluster.pluggable.dataformat.enabled=true, cluster.pluggable.dataformat=composite, parquet primary, lucene secondary) when tests.analytics.parquet_indices=true, and call it from SQLIntegTestCase.setUpIndices() before init(). Every index — including raw-PUT ones — is then stored as a parquet-backed DataFormatAwareEngine that the analytics engine can actually scan. It is a no-op unless tests.analytics.parquet_indices=true, so normal CI is unchanged.

Notes:

  • The lucene secondary format is required, or a parquet-only shard fails recovery (MapperParsingException: Field [_field_names] requires [FULL_TEXT_SEARCH]).
  • cluster.pluggable.dataformat.restrict.allowlist is a static (non-dynamic) setting and is intentionally not set here; it belongs in the cluster node config if a run needs to exempt extra index prefixes.

Verification (analytics route, -Dtests.analytics.parquet_indices=true)

acquireReader StreamExceptions across the affected IT classes: 59 → 0.

Class Before After
CalciteNotLikeNullIT 0/3 (acquireReader) 3/3
CalcitePPLLookupIT 1 acquireReader fail 17/17
CalcitePPLParseIT 1 acquireReader fail 6/6
CalcitePPLPluginIT 1 acquireReader fail 5/5

Indices now execute on the real analytics + DataFusion path and return correct rows. Failures that remain on the heavier classes (aggregation, datetime, mixed-field) are pre-existing, separately-tracked analytics-engine gaps that the acquireReader stub previously masked — percentile value scaling, unsupported datetime functions (TO_DAYS, ADDTIME, …), distinct_count_approx, SPAN/timestamp type divergence, etc. They are not regressions and are out of scope for this test-infra change; once the indices are scannable they simply surface their true cause instead of the generic streaming-fragment wrapper.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit 47e7c4e)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Missing Error Handling

The applyClusterSettings method calls performRequest without handling potential failures. If the cluster settings update fails (e.g., due to network issues, permission errors, or invalid settings), the test setup silently continues with inconsistent configuration. This leads to cryptic test failures later when indices route to analytics engine but aren't parquet-backed. The method should either propagate exceptions or validate that settings were applied successfully.

public static void applyClusterSettings(RestClient client) {
  if (!isEnabled()) {
    return;
  }
  JSONObject persistent =
      new JSONObject()
          .put("cluster.pluggable.dataformat.enabled", true)
          .put("cluster.pluggable.dataformat", DATAFORMAT_COMPOSITE)
          .put("cluster.composite.primary_data_format", PRIMARY_FORMAT_PARQUET)
          .put(
              "cluster.composite.secondary_data_formats",
              new JSONArray().put(SECONDARY_FORMAT_LUCENE));
  Request request = new Request("PUT", "/_cluster/settings");
  request.setJsonEntity(new JSONObject().put("persistent", persistent).toString());
  performRequest(client, request);
}

@ahkcs ahkcs force-pushed the fix/parquet-back-analytics-route-test-indices branch from 03663d0 to f153b99 Compare June 9, 2026 18:54
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to 769dd82

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Add exception handling for cluster settings

The method does not handle potential exceptions from performRequest, which could
cause cluster settings to fail silently. Add error handling to catch and log any
IOException or other failures when applying cluster settings, ensuring test setup
issues are visible.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [120-135]

 public static void applyClusterSettings(RestClient client) {
   if (!isEnabled()) {
     return;
   }
   JSONObject persistent =
       new JSONObject()
           .put("cluster.pluggable.dataformat.enabled", true)
           .put("cluster.pluggable.dataformat", DATAFORMAT_COMPOSITE)
           .put("cluster.composite.primary_data_format", PRIMARY_FORMAT_PARQUET)
           .put(
               "cluster.composite.secondary_data_formats",
               new JSONArray().put(SECONDARY_FORMAT_LUCENE));
   Request request = new Request("PUT", "/_cluster/settings");
   request.setJsonEntity(new JSONObject().put("persistent", persistent).toString());
-  performRequest(client, request);
+  try {
+    performRequest(client, request);
+  } catch (Exception e) {
+    throw new RuntimeException("Failed to apply cluster settings for analytics config", e);
+  }
 }
Suggestion importance[1-10]: 7

__

Why: Adding exception handling for performRequest is a valid improvement that would make test setup failures more visible. However, this is a moderate enhancement rather than a critical fix, as the method is called during test setup where failures would likely be noticed anyway.

Medium

Previous suggestions

Suggestions up to commit f153b99
CategorySuggestion                                                                                                                                    Impact
General
Use JSON library for request body

The JSON string concatenation is error-prone and lacks proper escaping. Consider
using a JSON library (like JSONObject) to construct the request body
programmatically, ensuring proper formatting and reducing the risk of syntax errors.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [118-124]

-request.setJsonEntity(
-    "{\"persistent\":{"
-        + "\"cluster.pluggable.dataformat.enabled\":true,"
-        + "\"cluster.pluggable.dataformat\":\"composite\","
-        + "\"cluster.composite.primary_data_format\":\"parquet\","
-        + "\"cluster.composite.secondary_data_formats\":[\"lucene\"]"
-        + "}}");
+JSONObject persistent = new JSONObject();
+persistent.put("cluster.pluggable.dataformat.enabled", true);
+persistent.put("cluster.pluggable.dataformat", "composite");
+persistent.put("cluster.composite.primary_data_format", "parquet");
+persistent.put("cluster.composite.secondary_data_formats", new org.json.JSONArray().put("lucene"));
+JSONObject settings = new JSONObject();
+settings.put("persistent", persistent);
+request.setJsonEntity(settings.toString());
Suggestion importance[1-10]: 7

__

Why: Using JSONObject instead of string concatenation improves code maintainability and reduces the risk of syntax errors. The suggestion correctly identifies the code location and provides a valid alternative approach that is already used elsewhere in the same file (lines 84-89).

Medium
Add error handling for cluster settings

The performRequest call lacks error handling for potential failures when applying
cluster settings. If the request fails, tests may proceed with incorrect cluster
configuration, leading to misleading test results or failures.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [125]

-performRequest(client, request);
+try {
+  performRequest(client, request);
+} catch (Exception e) {
+  throw new RuntimeException("Failed to apply cluster settings for parquet indices", e);
+}
Suggestion importance[1-10]: 6

__

Why: Adding error handling would make failures more explicit and prevent tests from running with incorrect configuration. However, the score is moderate because performRequest may already handle exceptions appropriately, and the suggestion doesn't verify the existing error handling behavior in the codebase.

Low
Suggestions up to commit 03663d0
CategorySuggestion                                                                                                                                    Impact
General
Use JSON library for settings

The hardcoded JSON string is error-prone and difficult to maintain. Consider using a
JSON library (like org.json.JSONObject) to construct the settings object
programmatically, ensuring proper escaping and reducing the risk of syntax errors.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [118-124]

-request.setJsonEntity(
-    "{\"persistent\":{"
-        + "\"cluster.pluggable.dataformat.enabled\":true,"
-        + "\"cluster.pluggable.dataformat\":\"composite\","
-        + "\"cluster.composite.primary_data_format\":\"parquet\","
-        + "\"cluster.composite.secondary_data_formats\":[\"lucene\"]"
-        + "}}");
+JSONObject persistent = new JSONObject();
+persistent.put("cluster.pluggable.dataformat.enabled", true);
+persistent.put("cluster.pluggable.dataformat", "composite");
+persistent.put("cluster.composite.primary_data_format", "parquet");
+persistent.put("cluster.composite.secondary_data_formats", new org.json.JSONArray().put("lucene"));
+JSONObject settings = new JSONObject();
+settings.put("persistent", persistent);
+request.setJsonEntity(settings.toString());
Suggestion importance[1-10]: 7

__

Why: Using JSONObject instead of hardcoded strings improves maintainability and reduces syntax errors. This is a valid improvement that enhances code quality, though the current implementation is functional.

Medium
Add error handling for settings

The performRequest call lacks error handling for potential failures when applying
cluster settings. Add try-catch or verify the response status to ensure settings
were successfully applied before tests proceed.

integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [125]

-performRequest(client, request);
+Response response = performRequest(client, request);
+if (response.getStatusLine().getStatusCode() != 200) {
+  throw new RuntimeException("Failed to apply cluster settings: " + response.getStatusLine());
+}
Suggestion importance[1-10]: 6

__

Why: Adding error handling for performRequest would improve robustness by catching failures when applying cluster settings. However, this is a moderate improvement as the existing code may have error handling elsewhere or the method may already throw exceptions.

Low

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f153b99

@ahkcs ahkcs added the enhancement New feature or request label Jun 9, 2026
@ahkcs ahkcs force-pushed the fix/parquet-back-analytics-route-test-indices branch from f153b99 to 769dd82 Compare June 9, 2026 19:06
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 769dd82

With -Dtests.analytics.parquet_indices=true, indices created by a raw
document PUT (e.g. `PUT /test/_doc/1` in a test's init()) bypass
AnalyticsIndexConfig.applyIndexCreationSettings, so they inherit the
composite *value* — and are therefore routed to the analytics engine by
RestUnifiedQueryAction.isAnalyticsIndex — but not the
`pluggable.dataformat.enabled` flag. They are then stored as a
plain-Lucene EngineBackedIndexer whose acquireReader() is unimplemented,
and the query fails with
`StreamException[INTERNAL] Failed to start streaming fragment`.

Apply the cluster-level composite defaults in setUpIndices() so every
index — including raw-PUT ones — is stored as a parquet-backed
DataFormatAwareEngine that is actually scannable by the analytics engine
it routes to. No-op unless tests.analytics.parquet_indices=true, so
normal CI is unchanged.

Signed-off-by: Kai Huang <ahkcs@amazon.com>
@ahkcs ahkcs force-pushed the fix/parquet-back-analytics-route-test-indices branch from 769dd82 to 47e7c4e Compare June 9, 2026 19:09
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 47e7c4e

// auto-creates via a raw document PUT, which bypasses createIndexByRestClient) parquet-backed
// composite, so it is stored as a DataFormatAwareEngine and is actually scannable by the
// analytics engine it routes to. Must run before init() creates any index.
TestUtils.AnalyticsIndexConfig.applyClusterSettings(client());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Any cleanup required in tear down logic?
  2. I assume this applied to all SQL tests too if any raw doc put?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into the issues:

  1. No extra cleanup needed — same lifecycle as enableCalcite(). applyClusterSettings() is called from setUpIndices() (@before) so it's re-applied before every test method, and all of these settings are wiped by the existing @afterclass cleanUpIndices() → wipeAllClusterSettings() (nulls persistent./transient. after each class).
  2. Correct — but only when -Dtests.analytics.parquet_indices=true, which is set only for the analytics-route runs

@ahkcs ahkcs merged commit 0bff609 into opensearch-project:main Jun 9, 2026
40 checks passed
@ahkcs ahkcs deleted the fix/parquet-back-analytics-route-test-indices branch June 9, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants