Fix stream error propagation to surface root cause message#4768
Fix stream error propagation to surface root cause message#4768ohltyler wants to merge 3 commits into
Conversation
PR Reviewer Guide 🔍(Review updated until commit f22e766)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to f22e766 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 843d41e
Suggestions up to commit 54d57ad
Suggestions up to commit e779067
Suggestions up to commit bc0de1d
Suggestions up to commit c88cc9a
|
c88cc9a to
bc0de1d
Compare
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit bc0de1d.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit bc0de1d |
bc0de1d to
e779067
Compare
|
Persistent review updated to latest commit e779067 |
|
@jiapingzeng could you verify if this resolves your reported issue in your env? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4768 +/- ##
============================================
+ Coverage 77.40% 77.41% +0.01%
- Complexity 11902 11906 +4
============================================
Files 963 963
Lines 53310 53312 +2
Branches 6500 6501 +1
============================================
+ Hits 41263 41271 +8
+ Misses 9288 9282 -6
Partials 2759 2759
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
rithinpullela
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the PR @ohltyler
|
Hi Tyler, sorry only saw this just now. Thanks for making the PR! Unfortunately I am still getting the same issue. I created an agent with agentic memory, but manually specified "xxx" for its memory container ID which does not exist, and when I execute the agent I get: Here's the full stack trace, seems that the error gets lost once it reaches |
|
@rishabhmaurya would you have any suggestions on how we can fix this error propagation issue? |
|
Okay, I think I understand the issue now: here we are currently doing where Tried it on my end and it seems to fix the issue and return the proper error. |
|
@jiapingzeng this makes sense. Also, in order to log these exception, one can enable trace logs on |
Stream errors from TransportException were only showing the node address instead of the actual error. Use MLExceptionUtils.getRootCauseMessage() to unwrap the exception chain and surface the real error to the user. Signed-off-by: Tyler Ohlsen <ohltyler@amazon.com>
The stream transport handler forwarded the wrapper TransportException directly to the channel, so callers received the transport-level node address string instead of the real root cause. Unwrap the cause before sending so the real exception reaches the client. TransportException/StreamException do not implement OpenSearchWrapperException, so exp.unwrapCause() is a no-op on them. Fall back to exp.getCause() to walk past the transport wrapper manually. Signed-off-by: Tyler Ohlsen <ohltyler@amazon.com>
54d57ad to
843d41e
Compare
|
Thanks @jiapingzeng — pushed 843d41e with the unwrap + test. |
|
Persistent review updated to latest commit 843d41e |
Signed-off-by: Tyler Ohlsen <ohltyler@amazon.com>
|
Persistent review updated to latest commit f22e766 |
Description
Fixes #4749.
When stream errors occur (e.g., agent cannot access memory service), the error message sent to the user only contains the transport-level node address:
This is because
TransportException.getMessage()returns the transport wrapper message, not the actual root cause. The fix uses the existingMLExceptionUtils.getRootCauseMessage()to unwrap the exception chain and surface the real error.After this fix, users will see the actual error:
Existing pattern
This follows the same pattern used throughout the codebase for surfacing errors — unwrapping with
getRootCauseMessage()rather than usinggetMessage()directly:TransportDeployModelAction.java#L325— deploy failureTransportDeployModelOnNodeAction.java#L199— per-node deploy failureTransportBatchIngestionAction.java#L173— batch ingestion failureTransportRegisterModelAction.java#L443— model registration failureMLModelManager.java#L1158— model loading failureMemoryOperationsService.java#L266— memory operation failureThe streaming error handler in
RestMLExecuteStreamActionwas the one place not following this pattern.Changes
RestMLExecuteStreamAction.java:onErrorResumelambda into a@VisibleForTestingmethodbuildStreamErrorMessage(Throwable)MLExceptionUtils.getRootCauseMessage()instead ofex.getMessage()to unwrap the exception chainRestMLExecuteStreamActionTests.java: Added 4 tests that directly validatebuildStreamErrorMessage:RemoteTransportExceptionwrapping a real cause → root cause surfaced, node address excludedIOException→ correct "Failed to parse request" prefixTransportException→ multi-level unwrapping verified (asserts intermediate wrapper text is excluded)MLExceptionUtilsTests.java: Added 2 tests forgetRootCauseMessagewithRemoteTransportExceptionCheck List
--signoffBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.