Skip to content

outlier-detection: Unconditionally eject always failing endpoints#12537

Closed
incubos wants to merge 4 commits intogrpc:masterfrom
incubos:outlier-detection-success-rate-eject-always-failing
Closed

outlier-detection: Unconditionally eject always failing endpoints#12537
incubos wants to merge 4 commits intogrpc:masterfrom
incubos:outlier-detection-success-rate-eject-always-failing

Conversation

@incubos
Copy link
Copy Markdown

@incubos incubos commented Nov 25, 2025

If success rate standard deviation and/or stdevFactor are big enough then SuccessRateOutlierEjectionAlgorithm calculates negative requiredSuccessRate threshold and doesn't eject (or even worse -- eventually stops ejecting) always failing endpoints (having zero successCount metric).

We fix the issue by unconditionally ejecting endpoints with zero successCount (ignoring stdevFactor-based threshold). A separate unit test is added.

This change of behaviour might affect production installations with high standard deviation of success rates by ejecting completely unhealthy endpoints, but it is expected to work out for the best.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Nov 25, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Nov 25, 2025

Is the grpc-java implementation not following gRFC A50? If it is a bug in the design, then the gRFC needs updating, as the other languages would need fixing too.

@incubos
Copy link
Copy Markdown
Author

incubos commented Nov 25, 2025

Unfortunately, it looks like a bug in the design of gRFC A50 Success Rate Algorithm. I would suggest replacing 3-iii list item:

If the address's success rate is less than (mean - stdev * (success_rate_ejection.stdev_factor / 1000)), then choose a random integer in [0, 100). If that number is less than success_rate_ejection.enforcement_percentage, eject that address.

with

If the address's success rate is zero or is less than (mean - stdev * (success_rate_ejection.stdev_factor / 1000)), then choose a random integer in [0, 100). If that number is less than success_rate_ejection.enforcement_percentage, eject that address.

Could you please tell, how gRFC updating should be initiated?

@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Nov 25, 2025

@incubos, it'd be a PR to the proposal repository. Prefix the PR title with "A50 update:"

@murgatroid99, I assume you'd be the one to take a look.

@incubos
Copy link
Copy Markdown
Author

incubos commented Nov 26, 2025

Thanks a lot!
@murgatroid99, could you please take a look at tiny grpc/proposal#523?

@incubos
Copy link
Copy Markdown
Author

incubos commented Dec 4, 2025

Could not convince the maintainer of A50 design to deviate from Envoy implementation having the same issue.

@incubos incubos closed this Dec 4, 2025
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Mar 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants