Skip to content

Fix stream error propagation to surface root cause message#4768

Open
ohltyler wants to merge 3 commits into
opensearch-project:mainfrom
ohltyler:fix/stream-error-propagation
Open

Fix stream error propagation to surface root cause message#4768
ohltyler wants to merge 3 commits into
opensearch-project:mainfrom
ohltyler:fix/stream-error-propagation

Conversation

@ohltyler
Copy link
Copy Markdown
Member

@ohltyler ohltyler commented Apr 1, 2026

Description

Fixes #4749.

When stream errors occur (e.g., agent cannot access memory service), the error message sent to the user only contains the transport-level node address:

data: {"error": "[ip-10-0-105-16.us-west-2.compute.internal][10.0.105.16:9300][cluster:admin/opensearch/ml/execute/stream]"}

This is because TransportException.getMessage() returns the transport wrapper message, not the actual root cause. The fix uses the existing MLExceptionUtils.getRootCauseMessage() to unwrap the exception chain and surface the real error.

After this fix, users will see the actual error:

data: {"error": "Error processing request: Unable to access memory service"}

Existing pattern

This follows the same pattern used throughout the codebase for surfacing errors — unwrapping with getRootCauseMessage() rather than using getMessage() directly:

The streaming error handler in RestMLExecuteStreamAction was the one place not following this pattern.

Changes

  • RestMLExecuteStreamAction.java:
    • Extracted error message building from the inline onErrorResume lambda into a @VisibleForTesting method buildStreamErrorMessage(Throwable)
    • Uses MLExceptionUtils.getRootCauseMessage() instead of ex.getMessage() to unwrap the exception chain
    • Added null-safe fallback to exception class name when root cause message is null
  • RestMLExecuteStreamActionTests.java: Added 4 tests that directly validate buildStreamErrorMessage:
    • RemoteTransportException wrapping a real cause → root cause surfaced, node address excluded
    • IOException → correct "Failed to parse request" prefix
    • Deeply nested TransportException → multi-level unwrapping verified (asserts intermediate wrapper text is excluded)
    • Null message exception → falls back to class name instead of "null"
  • MLExceptionUtilsTests.java: Added 2 tests for getRootCauseMessage with RemoteTransportException

Check List

  • New functionality includes testing
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

PR Reviewer Guide 🔍

(Review updated until commit f22e766)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Unwrap TransportException cause before forwarding to channel

Relevant files:

  • plugin/src/main/java/org/opensearch/ml/action/execute/TransportExecuteStreamTaskAction.java
  • plugin/src/test/java/org/opensearch/ml/action/execute/TransportExecuteStreamTaskActionTests.java

Sub-PR theme: Use getRootCauseMessage for stream error SSE responses

Relevant files:

  • plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java
  • plugin/src/test/java/org/opensearch/ml/rest/RestMLExecuteStreamActionTests.java
  • plugin/src/test/java/org/opensearch/ml/utils/MLExceptionUtilsTests.java

⚡ Recommended focus areas for review

Incomplete Unwrapping

The new unwrapping logic only goes one level deep: it checks exp.getCause() but does not recursively unwrap nested causes. If the TransportException wraps a RemoteTransportException which itself wraps the real cause, the channel will still receive an intermediate wrapper rather than the true root cause. Consider using MLExceptionUtils.getRootCauseMessage() here as well, or applying the same recursive unwrapping used in buildStreamErrorMessage.

Throwable cause = exp.getCause() != null ? exp.getCause() : exp;
Exception toSend = cause instanceof Exception ? (Exception) cause : exp;
channel.sendResponse(toSend);
Potential XSS / Injection

The error message is embedded directly into an SSE JSON string using simple replace("\"", "\\\""). This does not escape other JSON-special characters (backslash, control characters, newlines, etc.), which could corrupt the SSE frame or allow injection. Consider using a proper JSON serializer to build the error payload.

HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + errorMessage.replace("\"", "\\\"") + "\"}\n\n", true);
channel.sendChunk(errorChunk);

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

PR Code Suggestions ✨

Latest suggestions up to f22e766

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Improve JSON escaping for error messages

The error message is only escaped for double quotes, but other special characters
(e.g., backslashes, newlines, control characters) in the root cause message could
break the JSON structure or cause parsing errors on the client side. Consider using
a proper JSON serialization approach instead of manual string replacement.

plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java [285-287]

 String errorMessage = buildStreamErrorMessage(ex);
-HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + errorMessage.replace("\"", "\\\"") + "\"}\n\n", true);
+String escapedMessage = errorMessage.replace("\\", "\\\\").replace("\"", "\\\"").replace("\n", "\\n").replace("\r", "\\r").replace("\t", "\\t");
+HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + escapedMessage + "\"}\n\n", true);
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid - only escaping double quotes in the error message could result in malformed JSON if the message contains backslashes, newlines, or other special characters. However, this is a minor robustness improvement and the risk is relatively low in practice.

Low
Possible issue
Handle non-Exception throwable causes correctly

When exp.getCause() is not an Exception (e.g., it's an Error like OutOfMemoryError),
the code silently falls back to sending the original exp (the TransportException),
which defeats the purpose of unwrapping. In such cases, the Error should either be
rethrown or wrapped in a meaningful exception rather than silently swallowed.

plugin/src/main/java/org/opensearch/ml/action/execute/TransportExecuteStreamTaskAction.java [94-96]

 Throwable cause = exp.getCause() != null ? exp.getCause() : exp;
+if (cause instanceof Error) {
+    throw (Error) cause;
+}
 Exception toSend = cause instanceof Exception ? (Exception) cause : exp;
 channel.sendResponse(toSend);
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies an edge case where exp.getCause() could be an Error (e.g., OutOfMemoryError), but this scenario is extremely rare in practice. Rethrowing an Error from within a transport handler could also have unintended consequences on the node's stability, making this a debatable improvement.

Low

Previous suggestions

Suggestions up to commit 843d41e
CategorySuggestion                                                                                                                                    Impact
Possible issue
Unwrap full cause chain, not just one level

The current logic only unwraps one level of the cause chain. If the
TransportException wraps multiple layers (e.g., TransportException
RuntimeExceptionOpenSearchStatusException), the caller still receives an
intermediate wrapper rather than the true root cause. Consider walking the full
cause chain to find the deepest cause, consistent with how buildStreamErrorMessage
uses MLExceptionUtils.getRootCauseMessage.

plugin/src/main/java/org/opensearch/ml/action/execute/TransportExecuteStreamTaskAction.java [98-100]

-Throwable cause = exp.getCause() != null ? exp.getCause() : exp;
+Throwable cause = exp;
+while (cause.getCause() != null) {
+    cause = cause.getCause();
+}
 Exception toSend = cause instanceof Exception ? (Exception) cause : exp;
 channel.sendResponse(toSend);
Suggestion importance[1-10]: 5

__

Why: The suggestion is logically valid — only unwrapping one level may leave intermediate wrappers. However, the test testMessageReceivedHandlerUnwrapsWrappedCause only tests a single-level wrap, and the PR's intent seems to be a minimal fix. The suggestion is consistent with how MLExceptionUtils.getRootCauseMessage works, but the improved code is a reasonable enhancement rather than a critical bug fix.

Low
General
Properly escape special characters in JSON error response

The error message is manually escaped only for double quotes, but other special
characters (e.g., backslashes, newlines, control characters) in the root cause
message could break the JSON structure or cause parsing errors on the client side.
Use a proper JSON serialization method to safely encode the error message string.

plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java [286]

-HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + errorMessage.replace("\"", "\\\"") + "\"}\n\n", true);
+String safeErrorMessage = org.opensearch.common.xcontent.XContentHelper.convertToJson(
+    org.opensearch.core.xcontent.XContentBuilder.builder(org.opensearch.common.xcontent.XContentType.JSON.xContent())
+        .startObject().field("error", errorMessage).endObject().bytes(), false);
+// Or use a simple but more complete escaping utility:
+String escapedMessage = errorMessage.replace("\\", "\\\\").replace("\"", "\\\"")
+    .replace("\n", "\\n").replace("\r", "\\r").replace("\t", "\\t");
+HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + escapedMessage + "\"}\n\n", true);
Suggestion importance[1-10]: 4

__

Why: The concern about incomplete JSON escaping is valid — characters like backslashes or newlines in error messages could break the JSON structure. However, the improved_code is overly complex and mixes two different approaches, making it hard to apply cleanly. The simpler escaping extension is a reasonable improvement but not critical.

Low
Suggestions up to commit 54d57ad
CategorySuggestion                                                                                                                                    Impact
Possible issue
Improve JSON escaping for error messages

The current JSON escaping only handles double quotes, but the error message could
contain other characters that break JSON (e.g., backslashes, newlines, control
characters). Consider using a proper JSON serialization approach to escape the error
message safely.

plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java [286]

-HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + errorMessage.replace("\"", "\\\"") + "\"}\n\n", true);
+String escapedMessage = errorMessage.replace("\\", "\\\\").replace("\"", "\\\"").replace("\n", "\\n").replace("\r", "\\r").replace("\t", "\\t");
+HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + escapedMessage + "\"}\n\n", true);
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid - error messages could contain backslashes, newlines, or other special characters that would break JSON. However, this is a pre-existing issue not introduced by this PR, and the improvement is minor since a proper JSON library would be more robust than manual escaping.

Low
General
Check root cause type for error categorization

The instanceof IOException check is performed on the original exception ex, but for
wrapped exceptions (e.g., RemoteTransportException wrapping an IOException), the
root cause may be an IOException while ex itself is not. Consider checking the root
cause's type as well to correctly categorize the error prefix.

plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java [628-634]

 String buildStreamErrorMessage(Throwable ex) {
     String rootCause = MLExceptionUtils.getRootCauseMessage(ex);
     if (rootCause == null || rootCause.isEmpty()) {
         rootCause = ex.getClass().getSimpleName();
     }
-    return ex instanceof IOException ? "Failed to parse request: " + rootCause : "Error processing request: " + rootCause;
+    Throwable cause = ex;
+    while (cause.getCause() != null) {
+        cause = cause.getCause();
+    }
+    return (ex instanceof IOException || cause instanceof IOException)
+        ? "Failed to parse request: " + rootCause
+        : "Error processing request: " + rootCause;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion raises a valid edge case where a wrapped IOException would be categorized as "Error processing request" instead of "Failed to parse request". However, this is a minor semantic issue and the current behavior is consistent with the original code's intent of checking the top-level exception type.

Low
Suggestions up to commit e779067
CategorySuggestion                                                                                                                                    Impact
General
Strengthen assertion to verify full error message

The test asserts message.contains("Too many requests") but the full root cause
message is "Error from remote service: Too many requests". The assertion should
verify the complete root cause message to ensure the full error string is
propagated, not just a substring.

plugin/src/test/java/org/opensearch/ml/rest/RestMLExecuteStreamActionTests.java [792-806]

 public void testBuildStreamErrorMessage_nestedTransportException() {
     RuntimeException realCause = new RuntimeException("Error from remote service: Too many requests");
     RuntimeException wrapper = new RuntimeException("wrapped", realCause);
-    ...
+    org.opensearch.transport.RemoteTransportException transportException = new org.opensearch.transport.RemoteTransportException(
+        "node1",
+        null,
+        "cluster:admin/opensearch/ml/execute/stream",
+        wrapper
+    );
+
     String message = restAction.buildStreamErrorMessage(transportException);
 
-    assertTrue(message.contains("Too many requests"));
+    assertTrue(message.contains("Error from remote service: Too many requests"));
     assertFalse(message.contains("node1"));
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion to assert the full error message "Error from remote service: Too many requests" instead of just "Too many requests" is a valid improvement for test completeness. The improved_code accurately reflects the change and the existing code in the PR.

Low
Possible issue
Guard against null error message

The errorMessage returned by buildStreamErrorMessage could potentially be null if
MLExceptionUtils.getRootCauseMessage returns null, which would cause a
NullPointerException when calling .replace(). Add a null-safe fallback before using
the message.

plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java [285-286]

 String errorMessage = buildStreamErrorMessage(ex);
+if (errorMessage == null) errorMessage = "Unknown error";
 HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + errorMessage.replace("\"", "\\\"") + "\"}\n\n", true);
Suggestion importance[1-10]: 3

__

Why: While null-safety is a valid concern, buildStreamErrorMessage constructs the string with a prefix ("Failed to parse request: " or "Error processing request: ") concatenated with the root cause message, so even if getRootCauseMessage returns null, the result would be a non-null string like "Error processing request: null". The NullPointerException risk is minimal in practice.

Low
Suggestions up to commit bc0de1d
CategorySuggestion                                                                                                                                    Impact
Possible issue
Escape all JSON special characters in error message

The rootCause string may itself contain backslashes or other JSON special characters
(e.g., \n, \t, \) beyond just double quotes, which could produce malformed JSON.
Consider using a proper JSON serialization approach or escaping all JSON special
characters, not just double quotes.

plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java [286-289]

 String errorMessage = ex instanceof IOException
     ? "Failed to parse request: " + rootCause
     : "Error processing request: " + rootCause;
-HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + errorMessage.replace("\"", "\\\"") + "\"}\n\n", true);
+String escapedMessage = errorMessage.replace("\\", "\\\\").replace("\"", "\\\"").replace("\n", "\\n").replace("\r", "\\r").replace("\t", "\\t");
+HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + escapedMessage + "\"}\n\n", true);
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid - only double quotes are currently escaped, while other JSON special characters like \n, \r, \t, and backslashes could produce malformed JSON. However, this is a pre-existing issue not introduced by this PR, and the impact is moderate.

Low
General
Add content assertion to AGUI error test

The AGUI error response test does not verify that the error message content ("Remote
service unavailable") is actually present in the chunk output. This makes the test
incomplete and unable to catch regressions in error propagation for the AGUI path.

plugin/src/test/java/org/opensearch/ml/rest/RestMLExecuteStreamActionTests.java [786-790]

 HttpChunk chunk = invokeConvertToHttpChunk(response, true, "thread_1", "run_1");
 
 assertNotNull(chunk);
 assertTrue(chunk.isLast());
+String content = new String(BytesReference.toBytes(chunk.content()));
+assertTrue(content.contains("Remote service unavailable"));
Suggestion importance[1-10]: 5

__

Why: The test testConvertToHttpChunk_AGUI_errorResponse lacks a content assertion for "Remote service unavailable", making it weaker than the non-AGUI counterpart. Adding the assertion would improve test coverage for the AGUI error propagation path.

Low
Suggestions up to commit c88cc9a
CategorySuggestion                                                                                                                                    Impact
Possible issue
Escape all JSON special characters in error message

The rootCause string may contain backslashes or other JSON special characters (e.g.,
\n, \t, \) beyond just double quotes, which could produce malformed JSON in the
error chunk. Consider using a proper JSON serialization method to escape the error
message, or at minimum also escape backslashes before escaping double quotes.

plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java [286-289]

 String errorMessage = ex instanceof IOException
     ? "Failed to parse request: " + rootCause
     : "Error processing request: " + rootCause;
-HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + errorMessage.replace("\"", "\\\"") + "\"}\n\n", true);
+String escapedMessage = errorMessage.replace("\\", "\\\\").replace("\"", "\\\"");
+HttpChunk errorChunk = createHttpChunk("data: {\"error\": \"" + escapedMessage + "\"}\n\n", true);
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid - the current code only escapes double quotes but not backslashes or other JSON special characters in rootCause, which could produce malformed JSON. However, this is a minor edge case and the impact is limited.

Low
General
Add content assertion to AGUI error test

The AGUI error response test does not verify that the chunk content actually
contains the expected error message "Remote service unavailable". This makes the
test incomplete and unable to catch regressions in error propagation for the AGUI
code path.

plugin/src/test/java/org/opensearch/ml/rest/RestMLExecuteStreamActionTests.java [794-798]

 HttpChunk chunk = invokeConvertToHttpChunk(response, true, "thread_1", "run_1");
 
 assertNotNull(chunk);
 assertTrue(chunk.isLast());
+String content = new String(BytesReference.toBytes(chunk.content()));
+assertTrue(content.contains("Remote service unavailable"));
Suggestion importance[1-10]: 5

__

Why: The test for the AGUI error response path is incomplete - it only checks isLast() but doesn't verify the content contains the expected error message "Remote service unavailable", making it unable to catch regressions in error propagation for the AGUI code path.

Low

@ohltyler ohltyler force-pushed the fix/stream-error-propagation branch from c88cc9a to bc0de1d Compare April 1, 2026 22:56
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit bc0de1d.

PathLineSeverityDescription
plugin/src/main/java/org/opensearch/ml/rest/RestMLExecuteStreamAction.java285lowRoot cause messages from internal exceptions (including RemoteTransportException) are now forwarded directly to HTTP streaming responses. While not malicious, this may expose internal node names, service topology, or infrastructure details to API consumers if exception messages contain such information. The change is consistent with the stated purpose of improving error propagation (issue #4749).

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 0 | Low: 1


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

Persistent review updated to latest commit bc0de1d

@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 1, 2026 22:57 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 1, 2026 22:57 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 1, 2026 22:57 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 1, 2026 22:57 — with GitHub Actions Error
@ohltyler ohltyler force-pushed the fix/stream-error-propagation branch from bc0de1d to e779067 Compare April 1, 2026 22:58
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

Persistent review updated to latest commit e779067

@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 1, 2026 23:00 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 1, 2026 23:00 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 1, 2026 23:00 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 1, 2026 23:00 — with GitHub Actions Failure
@ohltyler ohltyler temporarily deployed to ml-commons-cicd-env-require-approval April 1, 2026 23:02 — with GitHub Actions Inactive
@ohltyler ohltyler temporarily deployed to ml-commons-cicd-env-require-approval April 1, 2026 23:02 — with GitHub Actions Inactive
@ohltyler
Copy link
Copy Markdown
Member Author

ohltyler commented Apr 1, 2026

@jiapingzeng could you verify if this resolves your reported issue in your env?

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 60.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.41%. Comparing base (7f744d1) to head (54d57ad).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../opensearch/ml/rest/RestMLExecuteStreamAction.java 60.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4768      +/-   ##
============================================
+ Coverage     77.40%   77.41%   +0.01%     
- Complexity    11902    11906       +4     
============================================
  Files           963      963              
  Lines         53310    53312       +2     
  Branches       6500     6501       +1     
============================================
+ Hits          41263    41271       +8     
+ Misses         9288     9282       -6     
  Partials       2759     2759              
Flag Coverage Δ
ml-commons 77.41% <60.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ohltyler ohltyler temporarily deployed to ml-commons-cicd-env-require-approval April 1, 2026 23:40 — with GitHub Actions Inactive
@ohltyler ohltyler temporarily deployed to ml-commons-cicd-env-require-approval April 1, 2026 23:40 — with GitHub Actions Inactive
Copy link
Copy Markdown
Collaborator

@rithinpullela rithinpullela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the PR @ohltyler

dhrubo-os
dhrubo-os previously approved these changes Apr 7, 2026
@jiapingzeng
Copy link
Copy Markdown
Contributor

Hi Tyler, sorry only saw this just now. Thanks for making the PR! Unfortunately I am still getting the same issue.

I created an agent with agentic memory, but manually specified "xxx" for its memory container ID which does not exist, and when I execute the agent I get:

data: {"error": "Error processing request: [integTest-0][127.0.0.1:9300][cluster:admin/opensearch/ml/execute/stream]"}

Here's the full stack trace, seems that the error gets lost once it reaches FlightClientChannel in OS core https://github.com/opensearch-project/OpenSearch/blob/main/plugins/arrow-flight-rpc/src/main/java/org/opensearch/arrow/flight/transport/FlightClientChannel.java#L307. Not sure if we are able to fix it in ml-commons or if we need to fix in core.

[2026-04-09T17:44:03,490][DEBUG][o.o.m.e.a.a.MLAgentExecutor] [integTest-0] MLAgentExecutor creating new memory, params: {tenant_id=null, memory_id=f02c00f9-d5cf-416d-ab7f-1e6676f0a435, app_type=null, memory_container_id=xxx, memory_name=list indices
}
[2026-04-09T17:44:03,490][DEBUG][o.o.m.u.RestActionUtils  ] [integTest-0] Current user is null
[2026-04-09T17:44:03,490][DEBUG][o.o.m.h.MemoryContainerHelper] [integTest-0] Fetching memory container with ID: xxx for tenant: null
[2026-04-09T17:44:03,490][ERROR][o.o.m.e.m.AgenticConversationMemory] [integTest-0] Failed to ensure session exists via TransportCreateSessionAction
org.opensearch.OpenSearchStatusException: Memory container not found
        at org.opensearch.ml.helper.MemoryContainerHelper.lambda$getMemoryContainer$0(MemoryContainerHelper.java:145)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
        at org.opensearch.remote.metadata.client.impl.LocalClusterIndicesClient.lambda$innerGetDataObjectAsync$4(LocalClusterIndicesClient.java:205)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:115)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:109)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:320)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:306)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587)
        at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1680)
        at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1660)
        at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45)
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:977)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
[2026-04-09T17:44:03,491][WARN ][o.o.m.e.m.AgenticConversationMemory] [integTest-0] Failed to ensure session exists for id f02c00f9-d5cf-416d-ab7f-1e6676f0a435, proceeding without session document
org.opensearch.OpenSearchStatusException: Memory container not found
        at org.opensearch.ml.helper.MemoryContainerHelper.lambda$getMemoryContainer$0(MemoryContainerHelper.java:145)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
        at org.opensearch.remote.metadata.client.impl.LocalClusterIndicesClient.lambda$innerGetDataObjectAsync$4(LocalClusterIndicesClient.java:205)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:115)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:109)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:320)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:306)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587)
        at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1680)
        at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1660)
        at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45)
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:977)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
[2026-04-09T17:44:03,492][DEBUG][o.o.m.h.MemoryContainerHelper] [integTest-0] Fetching memory container with ID: xxx for tenant: null
[2026-04-09T17:44:03,493][ERROR][o.o.m.e.m.AgenticConversationMemory] [integTest-0] Failed to retrieve structured messages
org.opensearch.OpenSearchStatusException: Memory container not found
        at org.opensearch.ml.helper.MemoryContainerHelper.lambda$getMemoryContainer$0(MemoryContainerHelper.java:145)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
        at org.opensearch.remote.metadata.client.impl.LocalClusterIndicesClient.lambda$innerGetDataObjectAsync$4(LocalClusterIndicesClient.java:205)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:115)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:109)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:320)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:306)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587)
        at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1680)
        at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1660)
        at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45)
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:977)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
[2026-04-09T17:44:03,493][ERROR][o.o.m.e.a.a.MLAgentExecutor] [integTest-0] Failed to get history. agentId=J6vXdJ0BqNDcsfjpEB0g, tenantId=null
org.opensearch.OpenSearchStatusException: Memory container not found
        at org.opensearch.ml.helper.MemoryContainerHelper.lambda$getMemoryContainer$0(MemoryContainerHelper.java:145)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
        at org.opensearch.remote.metadata.client.impl.LocalClusterIndicesClient.lambda$innerGetDataObjectAsync$4(LocalClusterIndicesClient.java:205)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:115)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:109)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:320)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:306)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587)
        at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1680)
        at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1660)
        at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45)
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:977)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
[2026-04-09T17:44:03,494][ERROR][o.o.m.e.a.a.MLAgentExecutor] [integTest-0] Agent execution failed. agentType=AG_UI, agentId=J6vXdJ0BqNDcsfjpEB0g, tenantId=null, latencyMs=2, statusCode=404, error=Memory container not found
org.opensearch.OpenSearchStatusException: Memory container not found
        at org.opensearch.ml.helper.MemoryContainerHelper.lambda$getMemoryContainer$0(MemoryContainerHelper.java:145)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
        at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2179)
        at org.opensearch.remote.metadata.client.impl.LocalClusterIndicesClient.lambda$innerGetDataObjectAsync$4(LocalClusterIndicesClient.java:205)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:115)
        at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:109)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:320)
        at org.opensearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleResponse(TransportSingleShardAction.java:306)
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1587)
        at org.opensearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1680)
        at org.opensearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1660)
        at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:62)
        at org.opensearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:45)
        at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:977)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
[2026-04-09T17:44:03,496][ERROR][o.o.a.f.t.FlightClientChannel] [integTest-0] Exception while handling stream response
org.opensearch.transport.stream.StreamException: [integTest-0][127.0.0.1:9300][cluster:admin/opensearch/ml/execute/stream]
        at org.opensearch.arrow.flight.transport.FlightErrorMapper.fromFlightException(FlightErrorMapper.java:65)
        at org.opensearch.arrow.flight.transport.FlightTransportResponse.lambda$openAndPrefetchAsync$0(FlightTransportResponse.java:102)
        at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)
[2026-04-09T17:44:03,497][ERROR][o.o.m.r.RestMLExecuteStreamAction] [integTest-0] Error occurred
org.opensearch.transport.stream.StreamException: [integTest-0][127.0.0.1:9300][cluster:admin/opensearch/ml/execute/stream]
        at org.opensearch.arrow.flight.transport.FlightErrorMapper.fromFlightException(FlightErrorMapper.java:65)
        at org.opensearch.arrow.flight.transport.FlightTransportResponse.lambda$openAndPrefetchAsync$0(FlightTransportResponse.java:102)
        at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)

@jiapingzeng
Copy link
Copy Markdown
Contributor

@rishabhmaurya would you have any suggestions on how we can fix this error propagation issue?

@jiapingzeng
Copy link
Copy Markdown
Contributor

Okay, I think I understand the issue now:

here we are currently doing

channel.sendResponse(exp);

where exp is a TransportException which implements OpenSearchWrapperException. So we need to unwrap the cause to get the actual exception:

channel.sendResponse((Exception) exp.unwrapCause());

Tried it on my end and it seems to fix the issue and return the proper error.

@rishabhmaurya
Copy link
Copy Markdown

@jiapingzeng this makes sense. Also, in order to log these exception, one can enable trace logs on org.opensearch.arrow.flight.transport

Stream errors from TransportException were only showing the node
address instead of the actual error. Use MLExceptionUtils.getRootCauseMessage()
to unwrap the exception chain and surface the real error to the user.

Signed-off-by: Tyler Ohlsen <ohltyler@amazon.com>
The stream transport handler forwarded the wrapper TransportException
directly to the channel, so callers received the transport-level node
address string instead of the real root cause. Unwrap the cause before
sending so the real exception reaches the client.

TransportException/StreamException do not implement OpenSearchWrapperException,
so exp.unwrapCause() is a no-op on them. Fall back to exp.getCause() to
walk past the transport wrapper manually.

Signed-off-by: Tyler Ohlsen <ohltyler@amazon.com>
@ohltyler
Copy link
Copy Markdown
Member Author

ohltyler commented Apr 17, 2026

Thanks @jiapingzeng — pushed 843d41e with the unwrap + test.

@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit 843d41e

@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 17, 2026 01:45 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 17, 2026 01:45 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 17, 2026 01:45 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 17, 2026 01:45 — with GitHub Actions Failure
Signed-off-by: Tyler Ohlsen <ohltyler@amazon.com>
@github-actions
Copy link
Copy Markdown

Persistent review updated to latest commit f22e766

@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 17, 2026 01:53 — with GitHub Actions Error
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 17, 2026 01:53 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 17, 2026 01:53 — with GitHub Actions Failure
@ohltyler ohltyler had a problem deploying to ml-commons-cicd-env-require-approval April 17, 2026 01:53 — with GitHub Actions Error
@jiapingzeng
Copy link
Copy Markdown
Contributor

Hi @ohltyler, I ended up including the unwrap change in this PR as I needed to add some other changes for AGUI agent. I think your change is also good to have. Could you rebase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] stream errors not propagated properly

5 participants