Skip to content

[Backport 3.2] Monitoring: Accurately capture predict failures (#4525)#4838

Open
pyek-bot wants to merge 1 commit into
opensearch-project:3.2from
pyek-bot:backport/backport-4555-to-3.2
Open

[Backport 3.2] Monitoring: Accurately capture predict failures (#4525)#4838
pyek-bot wants to merge 1 commit into
opensearch-project:3.2from
pyek-bot:backport/backport-4555-to-3.2

Conversation

@pyek-bot
Copy link
Copy Markdown
Collaborator

@pyek-bot pyek-bot commented May 27, 2026

Summary

  • Backport of [Backport 3.3] Monitoring: Accurately capture predict failures #4555 to 3.2
  • Introduces shouldTrackRemoteFailure(Exception e) method that skips tracking for user/configuration issues (4xx HTTP errors, IllegalArgumentException) while still tracking infrastructure/service issues (5xx errors, other exceptions)
  • Fixes two call sites in runPredict that previously hard-coded false (never tracking remote failures)

Test plan

  • Unit test testShouldTrackRemoteFailure validates all scenarios

) (opensearch-project#4555)

* fix: correctly capture ml failure stats on predict failures

* refactor: remove redundant braces

* refactor: dont track any 400 errors and add positive test case

* fix: apply spotless

---------

(cherry picked from commit a28d344)

Signed-off-by: Pavan Yekbote <pybot@amazon.com>
Co-authored-by: Pavan Yekbote <pybot@amazon.com>
Signed-off-by: Pavan Yekbote <pybot@amazon.com>
@pyek-bot pyek-bot force-pushed the backport/backport-4555-to-3.2 branch from 7db5be8 to 1c2784d Compare May 27, 2026 22:29
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

PR Reviewer Guide 🔍

(Review updated until commit 1c2784d)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

PR Code Suggestions ✨

Latest suggestions up to 1c2784d
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Check exception causes for proper classification

The method should check for nested causes of exceptions to properly classify
failures. Many exceptions wrap the actual cause, so checking only the top-level
exception type may miss user configuration errors that are wrapped in other
exception types.

plugin/src/main/java/org/opensearch/ml/task/MLPredictTaskRunner.java [657-673]

 boolean shouldTrackRemoteFailure(Exception e) {
-    // Don't track failures for user configuration issues
-    if (e instanceof IllegalArgumentException) {
-        return false;
-    }
-
-    // Don't track any 4xx client errors (user/configuration issues)
-    if (e instanceof OpenSearchStatusException) {
-        RestStatus status = ((OpenSearchStatusException) e).status();
-        if (status.getStatus() >= 400 && status.getStatus() < 500) {
+    // Check the exception and its causes
+    Throwable current = e;
+    while (current != null) {
+        // Don't track failures for user configuration issues
+        if (current instanceof IllegalArgumentException) {
             return false;
         }
+
+        // Don't track any 4xx client errors (user/configuration issues)
+        if (current instanceof OpenSearchStatusException) {
+            RestStatus status = ((OpenSearchStatusException) current).status();
+            if (status.getStatus() >= 400 && status.getStatus() < 500) {
+                return false;
+            }
+        }
+        
+        current = current.getCause();
     }
 
     // Track failures for infrastructure/service issues
     return true;
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that checking nested exception causes would make the classification more robust. However, the current implementation may be intentionally simple for performance reasons, and the improvement is not critical since most exceptions in this context are likely not deeply wrapped.

Medium

Previous suggestions

Suggestions up to commit 1c2784d
CategorySuggestion                                                                                                                                    Impact
General
Check exception cause chain

The method doesn't handle wrapped exceptions where the root cause might be an
IllegalArgumentException or OpenSearchStatusException. Consider checking the
exception's cause chain to accurately classify failures that are wrapped in other
exception types.

plugin/src/main/java/org/opensearch/ml/task/MLPredictTaskRunner.java [657-673]

 boolean shouldTrackRemoteFailure(Exception e) {
-    // Don't track failures for user configuration issues
-    if (e instanceof IllegalArgumentException) {
-        return false;
-    }
-
-    // Don't track any 4xx client errors (user/configuration issues)
-    if (e instanceof OpenSearchStatusException) {
-        RestStatus status = ((OpenSearchStatusException) e).status();
-        if (status.getStatus() >= 400 && status.getStatus() < 500) {
+    // Check the exception and its cause chain
+    Throwable current = e;
+    while (current != null) {
+        // Don't track failures for user configuration issues
+        if (current instanceof IllegalArgumentException) {
             return false;
         }
+
+        // Don't track any 4xx client errors (user/configuration issues)
+        if (current instanceof OpenSearchStatusException) {
+            RestStatus status = ((OpenSearchStatusException) current).status();
+            if (status.getStatus() >= 400 && status.getStatus() < 500) {
+                return false;
+            }
+        }
+        current = current.getCause();
     }
 
     // Track failures for infrastructure/service issues
     return true;
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that wrapped exceptions might not be properly classified. Checking the cause chain would improve the robustness of the failure tracking logic, though the current implementation may be sufficient for most common cases where exceptions are not wrapped.

Medium

@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 1c2784d

@pyek-bot
Copy link
Copy Markdown
Collaborator Author

Spotless failing due to:

A problem occurred configuring project ':opensearch-ml-plugin'.
> java.io.IOException: Failed to load eclipse jdt formatter: dev.equo.solstice.p2.P2Client$CouldNotFindException: Attempted to find resource at

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant