Add retry in connecting manager in MultiProcessShared#38456
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses flaky tests in the MultiProcessShared utility by introducing a robust retry mechanism when connecting to the manager. By adding exponential backoff, the system can now gracefully recover from transient ConnectionError or EOFError exceptions, improving the reliability of multi-process shared resource management. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a retry mechanism with exponential backoff for manager connections in multi_process_shared.py, specifically handling ConnectionError and EOFError to improve resilience against transient failures. Corresponding tests were added to verify this behavior. Feedback suggests extracting the duplicated list of retryable exceptions into a variable to improve maintainability and ensure consistency.
|
Assigning reviewers: R: @tvalentyn for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Got another flaky test below (https://github.com/apache/beam/actions/runs/25705764624/job/75475412939 and https://github.com/apache/beam/actions/runs/25708527415/job/75483693419?pr=38455)
This could happen when it fails to connect to the manager (a possibly remote proxy) at
counter2 = shared2.acquire()due to some transient network issue. In the proposed fix, we add some retrying during making the connection.