HDDS-14212. Make maxFailovers in GrpcOmTransport apply on a per-request basis.#9546
Conversation
Gargi-jais11
left a comment
There was a problem hiding this comment.
Thanks @greenwich for working on this patch.
Left some comments.
|
@greenwich your added test cases looks great, but we can have test for below case as well in
Why we need this? Test Scenario:
Without this test: We rely on inference that local variables reset, but don't explicitly verify the fix works for sequential requests. |
@Gargi-jais11, is it possible you missed |
@greenwich You're absolutely right! I missed that |
Gargi-jais11
left a comment
There was a problem hiding this comment.
Thanks @greenwich for updating the patch. Overall LGTM. Just some minor comments.
|
@ChenSammi and @ivandika3 please take a look on the patch. |
ChenSammi
left a comment
There was a problem hiding this comment.
I have to admit that I'm not familiar with GrpcOmTransport at all before this review.
After spent a few time, the new change looks good to me.
Thanks @greenwich for the contribution, and @Gargi-jais11 for the review.
What changes were proposed in this pull request?
This PR makes
maxFailoversandfailoverCountin S3g apply on a per-request basis.Rationale:
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.javaline 93: we haveprivate int failoverCount = 0;- All threads share this counter; it never resets.GrpcOmTransport.shouldRetry(258) we runaction = retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);which is also shared between requests.OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction, we still use that globalfailoverCountcheckingif (failovers < maxFailovers)(258), which always returnsreturn RetryAction.FAIL;(263) once we reached themaxFailoversmaxFailoversfrom above is defined byOZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEYI propose to change the value of
failoverCountper request, rather than making it a global flag. So basically, it makesOZONE_CLIENT_FAILOVER_MAX_ATTEMPTS_KEYconfiguration to serve on a per-request basis.Detailed explanation and discussion are here: #9477
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14212
How was this patch tested?
Branch pipeline: https://github.com/greenwich/ozone/actions/runs/20430980506
failoverCountis set for each requestom0toom2. It reproduces the issue, then I used it to test the fix.Concurrent test results before the fix - demonstrating the bug -> failover didn't happen to om2
Concurrent test results after the fix - demonstrating the correct behaviour -> failover to om2