feat: add pool_name label to replica#860
Conversation
Signed-off-by: Maheswara Reddy Chennuru <cmr.mahesh@gmail.com>
|
I went with a cache-miss approach here — |
| } | ||
|
|
||
| /// Lists all replicas to build a replica name → pool_name mapping. | ||
| pub(crate) async fn list_replicas( |
There was a problem hiding this comment.
| pub(crate) async fn list_replicas( | |
| pub(crate) async fn fetch_replica_pool_mapping( |
| /// Fetches replica list and stores replica name → pool_name mapping in cache. | ||
| /// Only refreshes when new replicas appear that aren't in the existing map. | ||
| async fn store_replica_pool_map(client: &GrpcClient) { | ||
| let needs_refresh = { |
There was a problem hiding this comment.
What if replicas get deleted from io-engine? Should we update map then as well?
There was a problem hiding this comment.
When a replica is deleted, it disappears from the stats response, so its map entry is never looked up — it's a stale and harmless. The map gets fully replaced on the next refresh (triggered when a new replica appears *cache.replica_pool_map_mut() = new_map), which cleans up stale entries. I could add explicit pruning but it adds complexity for no extra benefits. Let me know your thoughts.
Signed-off-by: Maheswara Reddy Chennuru <cmr.mahesh@gmail.com>
|
How about adding the pool_name to the replica stat itself? |
Description
Add
pool_namelabel to all replica I/O metrics (replica_bytes_read,replica_num_read_ops,replica_bytes_written,replica_num_write_ops,replica_read_latency_us,replica_write_latency_us).The metrics-exporter now calls
ListReplicasgRPC to build a replica name → pool_name mapping and includes it as a label when emitting replica metrics. The map is only refreshed when a new replica appears that isn't already cached, so there's zero additional gRPC overhead during steady-state operation.Motivation and Context
Described at #1990
Replica I/O metrics currently have no
pool_namelabel. The only shared dimension between replica metrics and disk pool metrics (diskpool_*) isnode, which means correlating a specific replica's I/O to its backing pool requires visual inference ("same node = same pool") — this breaks on nodes with multiple disk pools.With this change, users can directly query:
or join with pool metrics:
replica_num_read_ops * on(pool_name, node) group_left() diskpool_total_size_bytesRegression
No
How Has This Been Tested?
Types of changes
Checklist: