[Backport 2.19] Monitoring: Accurately capture predict failures (#4525) by pyek-bot · Pull Request #4841 · opensearch-project/ml-commons

pyek-bot · 2026-05-27T22:28:54Z

Summary

Backport of [Backport 3.3] Monitoring: Accurately capture predict failures #4555 to 2.19
Introduces shouldTrackRemoteFailure(Exception e) method that skips tracking for user/configuration issues (4xx HTTP errors, IllegalArgumentException) while still tracking infrastructure/service issues (5xx errors, other exceptions)
Fixes two call sites in runPredict that previously hard-coded false (never tracking remote failures)

Test plan

Unit test testShouldTrackRemoteFailure validates all scenarios

) (opensearch-project#4555) * fix: correctly capture ml failure stats on predict failures * refactor: remove redundant braces * refactor: dont track any 400 errors and add positive test case * fix: apply spotless --------- (cherry picked from commit a28d344) Signed-off-by: Pavan Yekbote <pybot@amazon.com> Co-authored-by: Pavan Yekbote <pybot@amazon.com> Signed-off-by: Pavan Yekbote <pybot@amazon.com>

github-actions · 2026-05-27T22:29:50Z

PR Reviewer Guide 🔍

(Review updated until commit `69e8df5`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

github-actions · 2026-05-27T22:30:03Z

PR Code Suggestions ✨

Latest suggestions up to 69e8df5
Explore these optional code suggestions:

Category Suggestion Impact

General

Check exception cause chain

The method doesn't handle wrapped exceptions where the root cause might be an
IllegalArgumentException or OpenSearchStatusException. Consider checking the
exception's cause chain to ensure user configuration errors are properly identified
even when wrapped in other exception types.

plugin/src/main/java/org/opensearch/ml/task/MLPredictTaskRunner.java [608-624]

 boolean shouldTrackRemoteFailure(Exception e) {
     // Don't track failures for user configuration issues
-    if (e instanceof IllegalArgumentException) {
-        return false;
-    }
-
-    // Don't track any 4xx client errors (user/configuration issues)
-    if (e instanceof OpenSearchStatusException) {
-        RestStatus status = ((OpenSearchStatusException) e).status();
-        if (status.getStatus() >= 400 && status.getStatus() < 500) {
+    Throwable current = e;
+    while (current != null) {
+        if (current instanceof IllegalArgumentException) {
             return false;
         }
+
+        // Don't track any 4xx client errors (user/configuration issues)
+        if (current instanceof OpenSearchStatusException) {
+            RestStatus status = ((OpenSearchStatusException) current).status();
+            if (status.getStatus() >= 400 && status.getStatus() < 500) {
+                return false;
+            }
+        }
+        current = current.getCause();
     }
 
     // Track failures for infrastructure/service issues
     return true;
 }

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that wrapped exceptions might not be properly handled. Checking the cause chain ensures that user configuration errors (like IllegalArgumentException or 4xx OpenSearchStatusException) are detected even when wrapped, preventing incorrect failure tracking. This is a valid improvement for robustness.

Medium

Previous suggestions

Suggestions up to commit 69e8df5

Category Suggestion Impact

General

Check exception cause chain

The method should handle nested exceptions by checking the cause chain. If the root
cause is an IllegalArgumentException or a 4xx OpenSearchStatusException, it should
not be tracked even if wrapped in another exception type.

plugin/src/main/java/org/opensearch/ml/task/MLPredictTaskRunner.java [608-624]

 boolean shouldTrackRemoteFailure(Exception e) {
-    // Don't track failures for user configuration issues
-    if (e instanceof IllegalArgumentException) {
-        return false;
-    }
-
-    // Don't track any 4xx client errors (user/configuration issues)
-    if (e instanceof OpenSearchStatusException) {
-        RestStatus status = ((OpenSearchStatusException) e).status();
-        if (status.getStatus() >= 400 && status.getStatus() < 500) {
+    // Check the exception and its cause chain
+    Throwable current = e;
+    while (current != null) {
+        // Don't track failures for user configuration issues
+        if (current instanceof IllegalArgumentException) {
             return false;
         }
+
+        // Don't track any 4xx client errors (user/configuration issues)
+        if (current instanceof OpenSearchStatusException) {
+            RestStatus status = ((OpenSearchStatusException) current).status();
+            if (status.getStatus() >= 400 && status.getStatus() < 500) {
+                return false;
+            }
+        }
+        current = current.getCause();
     }
 
     // Track failures for infrastructure/service issues
     return true;
 }

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that checking the exception cause chain would make the error classification more robust. However, the current implementation may be intentionally simple, and the added complexity should be evaluated against actual use cases where exceptions are wrapped.

Medium

github-actions · 2026-05-27T22:30:25Z

Persistent review updated to latest commit 69e8df5

pyek-bot · 2026-05-27T22:40:22Z

Spotless failing due to:

* What went wrong:
A problem occurred evaluating project ':opensearch-ml-client'.
> Could not find method shadowJar() for arguments [build_56550d4946yr8dezgj7g3z4xk$_run_closure5@6eb61855] on project ':opensearch-ml-client' of type org.gradle.api.Project.

pyek-bot requested review from HenryL27, Zhangxunmt, austintlee, b4sjoo, dhrubo-os, jngz-es, mingshl, model-collapse, rbhavna, xinyual, ylwu-amzn and zane-neo as code owners May 27, 2026 22:28

pyek-bot force-pushed the backport/backport-4555-to-2.19 branch from 56e412f to 69e8df5 Compare May 27, 2026 22:29

ylwu-amzn approved these changes May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport 2.19] Monitoring: Accurately capture predict failures (#4525)#4841

[Backport 2.19] Monitoring: Accurately capture predict failures (#4525)#4841
pyek-bot wants to merge 1 commit into
opensearch-project:2.19from
pyek-bot:backport/backport-4555-to-2.19

pyek-bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

pyek-bot commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pyek-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 69e8df5)

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

pyek-bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pyek-bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

(Review updated until commit `69e8df5`)

github-actions Bot commented May 27, 2026 •

edited

Loading

pyek-bot commented May 27, 2026 •

edited

Loading