[Backport 3.2] Monitoring: Accurately capture predict failures (#4525) by pyek-bot · Pull Request #4838 · opensearch-project/ml-commons

pyek-bot · 2026-05-27T22:28:47Z

Summary

Backport of [Backport 3.3] Monitoring: Accurately capture predict failures #4555 to 3.2
Introduces shouldTrackRemoteFailure(Exception e) method that skips tracking for user/configuration issues (4xx HTTP errors, IllegalArgumentException) while still tracking infrastructure/service issues (5xx errors, other exceptions)
Fixes two call sites in runPredict that previously hard-coded false (never tracking remote failures)

Test plan

Unit test testShouldTrackRemoteFailure validates all scenarios

) (opensearch-project#4555) * fix: correctly capture ml failure stats on predict failures * refactor: remove redundant braces * refactor: dont track any 400 errors and add positive test case * fix: apply spotless --------- (cherry picked from commit a28d344) Signed-off-by: Pavan Yekbote <pybot@amazon.com> Co-authored-by: Pavan Yekbote <pybot@amazon.com> Signed-off-by: Pavan Yekbote <pybot@amazon.com>

github-actions · 2026-05-27T22:29:42Z

PR Reviewer Guide 🔍

(Review updated until commit `1c2784d`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-05-27T22:29:55Z

PR Code Suggestions ✨

Latest suggestions up to 1c2784d
Explore these optional code suggestions:

Category

Suggestion

Impact

General

Check exception causes for proper classification

The method should check for nested causes of exceptions to properly classify
failures. Many exceptions wrap the actual cause, so checking only the top-level
exception type may miss user configuration errors that are wrapped in other
exception types.

plugin/src/main/java/org/opensearch/ml/task/MLPredictTaskRunner.java [657-673]

 boolean shouldTrackRemoteFailure(Exception e) {
-    // Don't track failures for user configuration issues
-    if (e instanceof IllegalArgumentException) {
-        return false;
-    }
-
-    // Don't track any 4xx client errors (user/configuration issues)
-    if (e instanceof OpenSearchStatusException) {
-        RestStatus status = ((OpenSearchStatusException) e).status();
-        if (status.getStatus() >= 400 && status.getStatus() < 500) {
+    // Check the exception and its causes
+    Throwable current = e;
+    while (current != null) {
+        // Don't track failures for user configuration issues
+        if (current instanceof IllegalArgumentException) {
             return false;
         }
+
+        // Don't track any 4xx client errors (user/configuration issues)
+        if (current instanceof OpenSearchStatusException) {
+            RestStatus status = ((OpenSearchStatusException) current).status();
+            if (status.getStatus() >= 400 && status.getStatus() < 500) {
+                return false;
+            }
+        }
+        
+        current = current.getCause();
     }
 
     // Track failures for infrastructure/service issues
     return true;
 }

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that checking nested exception causes would make the classification more robust. However, the current implementation may be intentionally simple for performance reasons, and the improvement is not critical since most exceptions in this context are likely not deeply wrapped.

Medium

Previous suggestions

Suggestions up to commit 1c2784d

Category Suggestion Impact

General

Check exception cause chain

The method doesn't handle wrapped exceptions where the root cause might be an
IllegalArgumentException or OpenSearchStatusException. Consider checking the
exception's cause chain to accurately classify failures that are wrapped in other
exception types.

plugin/src/main/java/org/opensearch/ml/task/MLPredictTaskRunner.java [657-673]

 boolean shouldTrackRemoteFailure(Exception e) {
-    // Don't track failures for user configuration issues
-    if (e instanceof IllegalArgumentException) {
-        return false;
-    }
-
-    // Don't track any 4xx client errors (user/configuration issues)
-    if (e instanceof OpenSearchStatusException) {
-        RestStatus status = ((OpenSearchStatusException) e).status();
-        if (status.getStatus() >= 400 && status.getStatus() < 500) {
+    // Check the exception and its cause chain
+    Throwable current = e;
+    while (current != null) {
+        // Don't track failures for user configuration issues
+        if (current instanceof IllegalArgumentException) {
             return false;
         }
+
+        // Don't track any 4xx client errors (user/configuration issues)
+        if (current instanceof OpenSearchStatusException) {
+            RestStatus status = ((OpenSearchStatusException) current).status();
+            if (status.getStatus() >= 400 && status.getStatus() < 500) {
+                return false;
+            }
+        }
+        current = current.getCause();
     }
 
     // Track failures for infrastructure/service issues
     return true;
 }

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that wrapped exceptions might not be properly classified. Checking the cause chain would improve the robustness of the failure tracking logic, though the current implementation may be sufficient for most common cases where exceptions are not wrapped.

Medium

github-actions · 2026-05-27T22:30:35Z

Persistent review updated to latest commit 1c2784d

pyek-bot · 2026-05-27T22:42:55Z

Spotless failing due to:

A problem occurred configuring project ':opensearch-ml-plugin'.
> java.io.IOException: Failed to load eclipse jdt formatter: dev.equo.solstice.p2.P2Client$CouldNotFindException: Attempted to find resource at

pyek-bot requested review from HenryL27, Zhangxunmt, austintlee, b4sjoo, dhrubo-os, jngz-es, mingshl, model-collapse, rbhavna, xinyual, ylwu-amzn and zane-neo as code owners May 27, 2026 22:28

pyek-bot force-pushed the backport/backport-4555-to-3.2 branch from 7db5be8 to 1c2784d Compare May 27, 2026 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 3.2] Monitoring: Accurately capture predict failures (#4525)#4838

[Backport 3.2] Monitoring: Accurately capture predict failures (#4525)#4838
pyek-bot wants to merge 1 commit into
opensearch-project:3.2from
pyek-bot:backport/backport-4555-to-3.2

pyek-bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

pyek-bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pyek-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 1c2784d)

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

pyek-bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pyek-bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

(Review updated until commit `1c2784d`)

github-actions Bot commented May 27, 2026 •

edited

Loading