Added RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to c…#63642
Added RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to c…#63642johntaylor-cell wants to merge 5 commits into
Conversation
…ontrol inclusion of high-cardinality tags from controller metrics Signed-off-by: john.taylor <john.taylor@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a feature flag RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to conditionally drop high-cardinality tags (such as replica and handle IDs) from controller-emitted metrics, helping to reduce metric volume in large deployments. The reviewer suggests optimizing the helper function _get_autoscaling_metrics_delay_tags by passing the high-cardinality key and value as separate arguments instead of a dictionary. This avoids unnecessary dictionary allocations on the hot path when high-cardinality tags are disabled. Corresponding updates are also suggested for the call sites and unit tests.
Signed-off-by: john.taylor <john.taylor@anyscale.com>
|
please link the GH issue this fixes in the description |
abrarsheikh
left a comment
There was a problem hiding this comment.
there are other metrics that controller emits that includes replica. Let's apply the change to those as well
https://docs.ray.io/en/latest/serve/monitoring.html#replica-lifecycle-metrics
https://docs.ray.io/en/latest/serve/monitoring.html#autoscaling-metrics
https://docs.ray.io/en/latest/serve/monitoring.html#autoscaling-metrics
Signed-off-by: john.taylor <john.taylor@anyscale.com>
Signed-off-by: john.taylor <john.taylor@anyscale.com>
Signed-off-by: john.taylor <john.taylor@anyscale.com>
|
Added ability to disable replica metrics in |
This PR addresses his issue: #63114
API/Schema Changes
Added env var RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS, default 1.
When set to 0, Serve controller autoscaling delay metrics drop high-cardinality source tags:
handle from ray_serve_autoscaling_handle_metrics_delay_ms and replica from ray_serve_autoscaling_replica_metrics_delay_ms.
Internal Call Sites
Updated ServeController metric gauge construction and emission in python/ray/serve/_private/controller.py.
Added helpers to keep gauge tag_keys and emitted tags consistent based on the env var.
Tests Added
Added unit coverage in python/ray/serve/tests/unit/test_controller.py for both modes:
default includes handle / replica
opt-out keeps only deployment / application
Tests Run
git diff --check
python -m py_compile changed Python files
python -m pytest python/ray/serve/tests/unit/test_controller.py -q -k autoscaling_metrics_delay_tags
Focused pytest result:
2 passed, 70 deselected in 0.18s
Backwards Compatibility
Fully backwards compatible by default. Existing metric labels remain unchanged unless users explicitly set RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS=0.