Parquet-back raw-PUT test indices on the analytics-engine route by ahkcs · Pull Request #5529 · opensearch-project/sql

ahkcs · 2026-06-09T18:52:54Z

Problem

Running the Calcite IT suite through the analytics-engine route (-Dtests.analytics.parquet_indices=true) produces a large bucket of StreamException[errorCode=INTERNAL] "Failed to start streaming fragment on [<index>][<shard>]" failures, whose cluster-side cause is UnsupportedOperationException: acquireReader is not supported in EngineBackedIndexer.

The root cause is a routing-vs-storage inconsistency:

Decision	Keys on
Route to the analytics engine (`RestUnifiedQueryAction.isAnalyticsIndex`)	`pluggable.dataformat` value == `composite`
Store as `DataFormatAwareEngine` (`IndicesService.getIndexerFactory` → `IndexSettings.isPluggableDataFormatEnabled`)	`pluggable.dataformat.``enabled``=true` (+ feature flag)

Test indices created by a raw document PUT (e.g. PUT /test/_doc/1 in a test's init()) bypass AnalyticsIndexConfig.applyIndexCreationSettings (which only stamps indices created via createIndexByRestClient/loadIndex). They inherit the composite value from the cluster opt-in — so they route to the analytics engine — but not the enabled flag, so they are stored as a plain-Lucene EngineBackedIndexer whose acquireReader() is an unimplemented stub. The query then fails with the StreamException above.

Fix

Add AnalyticsIndexConfig.applyClusterSettings(), which sets the cluster-level composite defaults (cluster.pluggable.dataformat.enabled=true, cluster.pluggable.dataformat=composite, parquet primary, lucene secondary) when tests.analytics.parquet_indices=true, and call it from SQLIntegTestCase.setUpIndices() before init(). Every index — including raw-PUT ones — is then stored as a parquet-backed DataFormatAwareEngine that the analytics engine can actually scan. It is a no-op unless tests.analytics.parquet_indices=true, so normal CI is unchanged.

Notes:

The lucene secondary format is required, or a parquet-only shard fails recovery (MapperParsingException: Field [_field_names] requires [FULL_TEXT_SEARCH]).
cluster.pluggable.dataformat.restrict.allowlist is a static (non-dynamic) setting and is intentionally not set here; it belongs in the cluster node config if a run needs to exempt extra index prefixes.

Verification (analytics route, `-Dtests.analytics.parquet_indices=true`)

acquireReader StreamExceptions across the affected IT classes: 59 → 0.

Class	Before	After
`CalciteNotLikeNullIT`	0/3 (acquireReader)	3/3
`CalcitePPLLookupIT`	1 acquireReader fail	17/17
`CalcitePPLParseIT`	1 acquireReader fail	6/6
`CalcitePPLPluginIT`	1 acquireReader fail	5/5

Indices now execute on the real analytics + DataFusion path and return correct rows. Failures that remain on the heavier classes (aggregation, datetime, mixed-field) are pre-existing, separately-tracked analytics-engine gaps that the acquireReader stub previously masked — percentile value scaling, unsupported datetime functions (TO_DAYS, ADDTIME, …), distinct_count_approx, SPAN/timestamp type divergence, etc. They are not regressions and are out of scope for this test-infra change; once the indices are scannable they simply surface their true cause instead of the generic streaming-fragment wrapper.

github-actions · 2026-06-09T18:54:07Z

PR Reviewer Guide 🔍

(Review updated until commit `47e7c4e`)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Missing Error Handling The `applyClusterSettings` method calls `performRequest` without handling potential failures. If the cluster settings update fails (e.g., due to network issues, permission errors, or invalid settings), the test setup silently continues with inconsistent configuration. This leads to cryptic test failures later when indices route to analytics engine but aren't parquet-backed. The method should either propagate exceptions or validate that settings were applied successfully. public static void applyClusterSettings(RestClient client) { if (!isEnabled()) { return; } JSONObject persistent = new JSONObject() .put("cluster.pluggable.dataformat.enabled", true) .put("cluster.pluggable.dataformat", DATAFORMAT_COMPOSITE) .put("cluster.composite.primary_data_format", PRIMARY_FORMAT_PARQUET) .put( "cluster.composite.secondary_data_formats", new JSONArray().put(SECONDARY_FORMAT_LUCENE)); Request request = new Request("PUT", "/_cluster/settings"); request.setJsonEntity(new JSONObject().put("persistent", persistent).toString()); performRequest(client, request); }

github-actions · 2026-06-09T18:54:24Z

PR Code Suggestions ✨

Latest suggestions up to 769dd82

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Add exception handling for cluster settings The method does not handle potential exceptions from `performRequest`, which could cause cluster settings to fail silently. Add error handling to catch and log any IOException or other failures when applying cluster settings, ensuring test setup issues are visible. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [120-135] public static void applyClusterSettings(RestClient client) { if (!isEnabled()) { return; } JSONObject persistent = new JSONObject() .put("cluster.pluggable.dataformat.enabled", true) .put("cluster.pluggable.dataformat", DATAFORMAT_COMPOSITE) .put("cluster.composite.primary_data_format", PRIMARY_FORMAT_PARQUET) .put( "cluster.composite.secondary_data_formats", new JSONArray().put(SECONDARY_FORMAT_LUCENE)); Request request = new Request("PUT", "/_cluster/settings"); request.setJsonEntity(new JSONObject().put("persistent", persistent).toString()); - performRequest(client, request); + try { + performRequest(client, request); + } catch (Exception e) { + throw new RuntimeException("Failed to apply cluster settings for analytics config", e); + } } Suggestion importance[1-10]: 7 __ Why: Adding exception handling for `performRequest` is a valid improvement that would make test setup failures more visible. However, this is a moderate enhancement rather than a critical fix, as the method is called during test setup where failures would likely be noticed anyway.	Medium

Previous suggestions

Suggestions up to commit f153b99

Category	Suggestion	Impact
General	Use JSON library for request body The JSON string concatenation is error-prone and lacks proper escaping. Consider using a JSON library (like JSONObject) to construct the request body programmatically, ensuring proper formatting and reducing the risk of syntax errors. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [118-124] -request.setJsonEntity( - "{\"persistent\":{" - + "\"cluster.pluggable.dataformat.enabled\":true," - + "\"cluster.pluggable.dataformat\":\"composite\"," - + "\"cluster.composite.primary_data_format\":\"parquet\"," - + "\"cluster.composite.secondary_data_formats\":[\"lucene\"]" - + "}}"); +JSONObject persistent = new JSONObject(); +persistent.put("cluster.pluggable.dataformat.enabled", true); +persistent.put("cluster.pluggable.dataformat", "composite"); +persistent.put("cluster.composite.primary_data_format", "parquet"); +persistent.put("cluster.composite.secondary_data_formats", new org.json.JSONArray().put("lucene")); +JSONObject settings = new JSONObject(); +settings.put("persistent", persistent); +request.setJsonEntity(settings.toString()); Suggestion importance[1-10]: 7 __ Why: Using `JSONObject` instead of string concatenation improves code maintainability and reduces the risk of syntax errors. The suggestion correctly identifies the code location and provides a valid alternative approach that is already used elsewhere in the same file (lines 84-89).	Medium
General	Add error handling for cluster settings The `performRequest` call lacks error handling for potential failures when applying cluster settings. If the request fails, tests may proceed with incorrect cluster configuration, leading to misleading test results or failures. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [125] -performRequest(client, request); +try { + performRequest(client, request); +} catch (Exception e) { + throw new RuntimeException("Failed to apply cluster settings for parquet indices", e); +} Suggestion importance[1-10]: 6 __ Why: Adding error handling would make failures more explicit and prevent tests from running with incorrect configuration. However, the score is moderate because `performRequest` may already handle exceptions appropriately, and the suggestion doesn't verify the existing error handling behavior in the codebase.	Low

Suggestions up to commit 03663d0

Category	Suggestion	Impact
General	Use JSON library for settings The hardcoded JSON string is error-prone and difficult to maintain. Consider using a JSON library (like `org.json.JSONObject`) to construct the settings object programmatically, ensuring proper escaping and reducing the risk of syntax errors. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [118-124] -request.setJsonEntity( - "{\"persistent\":{" - + "\"cluster.pluggable.dataformat.enabled\":true," - + "\"cluster.pluggable.dataformat\":\"composite\"," - + "\"cluster.composite.primary_data_format\":\"parquet\"," - + "\"cluster.composite.secondary_data_formats\":[\"lucene\"]" - + "}}"); +JSONObject persistent = new JSONObject(); +persistent.put("cluster.pluggable.dataformat.enabled", true); +persistent.put("cluster.pluggable.dataformat", "composite"); +persistent.put("cluster.composite.primary_data_format", "parquet"); +persistent.put("cluster.composite.secondary_data_formats", new org.json.JSONArray().put("lucene")); +JSONObject settings = new JSONObject(); +settings.put("persistent", persistent); +request.setJsonEntity(settings.toString()); Suggestion importance[1-10]: 7 __ Why: Using `JSONObject` instead of hardcoded strings improves maintainability and reduces syntax errors. This is a valid improvement that enhances code quality, though the current implementation is functional.	Medium
General	Add error handling for settings The `performRequest` call lacks error handling for potential failures when applying cluster settings. Add try-catch or verify the response status to ensure settings were successfully applied before tests proceed. integ-test/src/test/java/org/opensearch/sql/legacy/TestUtils.java [125] -performRequest(client, request); +Response response = performRequest(client, request); +if (response.getStatusLine().getStatusCode() != 200) { + throw new RuntimeException("Failed to apply cluster settings: " + response.getStatusLine()); +} Suggestion importance[1-10]: 6 __ Why: Adding error handling for `performRequest` would improve robustness by catching failures when applying cluster settings. However, this is a moderate improvement as the existing code may have error handling elsewhere or the method may already throw exceptions.	Low

github-actions · 2026-06-09T18:55:32Z

Persistent review updated to latest commit f153b99

github-actions · 2026-06-09T19:08:22Z

Persistent review updated to latest commit 769dd82

With -Dtests.analytics.parquet_indices=true, indices created by a raw document PUT (e.g. `PUT /test/_doc/1` in a test's init()) bypass AnalyticsIndexConfig.applyIndexCreationSettings, so they inherit the composite *value* — and are therefore routed to the analytics engine by RestUnifiedQueryAction.isAnalyticsIndex — but not the `pluggable.dataformat.enabled` flag. They are then stored as a plain-Lucene EngineBackedIndexer whose acquireReader() is unimplemented, and the query fails with `StreamException[INTERNAL] Failed to start streaming fragment`. Apply the cluster-level composite defaults in setUpIndices() so every index — including raw-PUT ones — is stored as a parquet-backed DataFormatAwareEngine that is actually scannable by the analytics engine it routes to. No-op unless tests.analytics.parquet_indices=true, so normal CI is unchanged. Signed-off-by: Kai Huang <ahkcs@amazon.com>

github-actions · 2026-06-09T19:10:45Z

Persistent review updated to latest commit 47e7c4e

dai-chen · 2026-06-09T19:57:11Z

+    // auto-creates via a raw document PUT, which bypasses createIndexByRestClient) parquet-backed
+    // composite, so it is stored as a DataFormatAwareEngine and is actually scannable by the
+    // analytics engine it routes to. Must run before init() creates any index.
+    TestUtils.AnalyticsIndexConfig.applyClusterSettings(client());


Any cleanup required in tear down logic?

I assume this applied to all SQL tests too if any raw doc put?

Looked into the issues:

No extra cleanup needed — same lifecycle as enableCalcite(). applyClusterSettings() is called from setUpIndices() (@before) so it's re-applied before every test method, and all of these settings are wiped by the existing @afterclass cleanUpIndices() → wipeAllClusterSettings() (nulls persistent./transient. after each class).

Correct — but only when -Dtests.analytics.parquet_indices=true, which is set only for the analytics-route runs

ahkcs requested review from LantaoJin, RyanL1997, Swiddis, acarbonetto, anirudha, dai-chen, joshuali925, mengweieric, noCharger, penghuo, ps48, qianheng-aws, songkant-aws, vamsimanohar, ykmr1224 and yuancu as code owners June 9, 2026 18:52

ahkcs force-pushed the fix/parquet-back-analytics-route-test-indices branch from 03663d0 to f153b99 Compare June 9, 2026 18:54

ahkcs added the enhancement New feature or request label Jun 9, 2026

ahkcs force-pushed the fix/parquet-back-analytics-route-test-indices branch from f153b99 to 769dd82 Compare June 9, 2026 19:06

ahkcs force-pushed the fix/parquet-back-analytics-route-test-indices branch from 769dd82 to 47e7c4e Compare June 9, 2026 19:09

dai-chen approved these changes Jun 9, 2026

View reviewed changes

ahkcs merged commit 0bff609 into opensearch-project:main Jun 9, 2026
40 checks passed

ahkcs deleted the fix/parquet-back-analytics-route-test-indices branch June 9, 2026 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet-back raw-PUT test indices on the analytics-engine route#5529

Parquet-back raw-PUT test indices on the analytics-engine route#5529
ahkcs merged 1 commit into
opensearch-project:mainfrom
ahkcs:fix/parquet-back-analytics-route-test-indices

ahkcs commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

dai-chen Jun 9, 2026

Uh oh!

ahkcs Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ahkcs commented Jun 9, 2026

Problem

Fix

Verification (analytics route, -Dtests.analytics.parquet_indices=true)

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 47e7c4e)

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

dai-chen Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ahkcs Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Verification (analytics route, `-Dtests.analytics.parquet_indices=true`)

github-actions Bot commented Jun 9, 2026 •

edited

Loading

(Review updated until commit `47e7c4e`)

github-actions Bot commented Jun 9, 2026 •

edited

Loading