Skip to content

Publish failure metrics for SQL analytics engine execution path#5487

Merged
dai-chen merged 1 commit into
opensearch-project:mainfrom
dai-chen:fix/sql-ae-failure-metrics
May 29, 2026
Merged

Publish failure metrics for SQL analytics engine execution path#5487
dai-chen merged 1 commit into
opensearch-project:mainfrom
dai-chen:fix/sql-ae-failure-metrics

Conversation

@dai-chen

@dai-chen dai-chen commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR fixes the SQL analytics engine (AE) execution path so failures now publish failure metrics and are logged, consistent with the SQL V2 path. The PPL AE path needs no change — failures already propagate to the top-level listener callback in RestPPLQueryAction.onFailure().

Example

Use the unsupported SEC_TO_TIME function to trigger a backend failure on both endpoints, followed by the stats. Both failed_request_count_syserr (SQL) and ppl_failed_request_count_syserr (PPL) now increment correctly on AE failures, with no double-counting:

$ curl -s -X POST "localhost:9200/_plugins/_sql" -H 'Content-Type: application/json' \
   -d '{"query":"SELECT SEC_TO_TIME(3661) FROM test_basic"}'

$ curl -s -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' \
   -d '{"query":"source=test_basic | eval t = SEC_TO_TIME(3661)"}'
 {
   "error": {
     "reason": "There was internal problem at backend",
     "details": "No backend supports scalar function [SEC_TO_TIME] among [datafusion]",
     "type": "IllegalStateException"
   },
   "status": 500
 }

$ curl -s -X GET "localhost:9200/_plugins/_sql/stats?pretty"
 {
   "request_total": 3,
   "request_count": 3,
   "failed_request_count_cuserr": 0,
   "failed_request_count_syserr": 3,
   ...
   "ppl_request_total": 4,
   "ppl_request_count": 4,
   "ppl_failed_request_count_cuserr": 0,
   "ppl_failed_request_count_syserr": 4
 }

Related Issues

Part of #5246

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen self-assigned this May 28, 2026
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit 61f3bc5)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to 61f3bc5
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure static method dependencies are resolved

Making handleException static may break if logAndPublishMetrics or getRestStatus are
instance methods that depend on instance state. Verify these methods are also static
or accessible in a static context to prevent runtime errors.

legacy/src/main/java/org/opensearch/sql/legacy/plugin/RestSqlAction.java [207-210]

 public static void handleException(RestChannel restChannel, Exception exception) {
-  RestStatus status = getRestStatus(exception);
-  logAndPublishMetrics(status, exception);
+  RestStatus status = RestSqlAction.getRestStatus(exception);
+  RestSqlAction.logAndPublishMetrics(status, exception);
   reportError(restChannel, exception, status);
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion asks to verify that getRestStatus and logAndPublishMetrics are static or accessible in static context, which is a valid concern. However, it only warrants verification rather than code changes, and the 'improved_code' unnecessarily adds explicit class qualifiers that aren't required.

Low

Previous suggestions

Suggestions up to commit f2c5013
CategorySuggestion                                                                                                                                    Impact
General
Extract duplicated failure handling logic

The failure handling logic is duplicated in both explain and execute paths. Extract
this into a reusable wrapper method to reduce code duplication and ensure consistent
error handling across both code paths.

plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryAction.java [163-192]

-new ActionListener<>() {
-  @Override
-  public void onResponse(TransportPPLQueryResponse response) {
-    clearingListener.onResponse(response);
-  }
+createMetricsTrackingListener(clearingListener)
 
-  @Override
-  public void onFailure(Exception e) {
-    RestPPLQueryAction.loggedErrorCode(e);
-    clearingListener.onFailure(e);
-  }
-}
-
Suggestion importance[1-10]: 6

__

Why: Valid suggestion to reduce code duplication between the explain and execute paths. The duplicated ActionListener implementations with identical failure handling could be extracted into a helper method for better maintainability.

Low
Possible issue
Ensure failure metrics are published

The loggedErrorCode method only returns a status code but doesn't log metrics or
publish them. This may result in incomplete failure tracking. Ensure metrics are
properly logged and published for analytics engine failures.

plugin/src/main/java/org/opensearch/sql/plugin/transport/TransportPPLQueryAction.java [170-174]

 @Override
 public void onFailure(Exception e) {
-  RestPPLQueryAction.loggedErrorCode(e);
+  RestStatus status = RestPPLQueryAction.loggedErrorCode(e);
+  Metrics.getInstance().getNumericalMetric(MetricName.PPL_FAILED_REQ_COUNT_SYS).increment();
   clearingListener.onFailure(e);
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that loggedErrorCode doesn't publish metrics, but assumes metrics should be tracked here without verifying if they're already tracked elsewhere in the call chain or if this is the appropriate location for metric tracking.

Low

The SQL analytics router in SQLPlugin handled errors by sending a raw
BytesRestResponse without incrementing FAILED_REQ_COUNT_CUS or
FAILED_REQ_COUNT_SYS, so SQL+AE failures were invisible in stats. Unlike
the PPL path, the router owns the channel and never routes failures back
to RestSqlAction's terminal error handler.

Make RestSqlAction.handleException public static and call it from the
analytics router onFailure callbacks so failure metrics and proper error
formatting are applied consistently. The PPL+AE path already publishes
metrics via RestPPLQueryAction's listener, so no change is needed there.

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen force-pushed the fix/sql-ae-failure-metrics branch from f2c5013 to 61f3bc5 Compare May 29, 2026 15:53
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 61f3bc5

@dai-chen dai-chen changed the title Fix missing failure metrics for SQL/PPL with analytics engine Publish failure metrics for SQL analytics engine execution path May 29, 2026
@dai-chen dai-chen marked this pull request as ready for review May 29, 2026 17:36
@dai-chen dai-chen merged commit acd4437 into opensearch-project:main May 29, 2026
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants