Skip to content

feat: add pool_name label to replica#860

Open
cmrmahesh wants to merge 2 commits into
openebs:developfrom
cmrmahesh:cmr/replica_label
Open

feat: add pool_name label to replica#860
cmrmahesh wants to merge 2 commits into
openebs:developfrom
cmrmahesh:cmr/replica_label

Conversation

@cmrmahesh
Copy link
Copy Markdown

@cmrmahesh cmrmahesh commented May 28, 2026

Description

Add pool_name label to all replica I/O metrics (replica_bytes_read, replica_num_read_ops, replica_bytes_written, replica_num_write_ops, replica_read_latency_us, replica_write_latency_us).

The metrics-exporter now calls ListReplicas gRPC to build a replica name → pool_name mapping and includes it as a label when emitting replica metrics. The map is only refreshed when a new replica appears that isn't already cached, so there's zero additional gRPC overhead during steady-state operation.

Motivation and Context

Described at #1990

Replica I/O metrics currently have no pool_name label. The only shared dimension between replica metrics and disk pool metrics (diskpool_*) is node, which means correlating a specific replica's I/O to its backing pool requires visual inference ("same node = same pool") — this breaks on nodes with multiple disk pools.

With this change, users can directly query:

rate(replica_num_read_ops{pool_name="pool-1"}[5m])

or join with pool metrics:

replica_num_read_ops * on(pool_name, node) group_left() diskpool_total_size_bytes

Regression
No

How Has This Been Tested?

  • Verified that the Replica gRPC message already exposes poolname (field 8 in replica.proto)
  • Confirmed ListReplicaOptions::default() returns all replicas on the node
  • Validated that the cache-miss-only refresh logic avoids unnecessary gRPC calls in steady state
  • Existing metric labels (node, name, pv_name) remain unchanged; pool_name is additive
  • Deploy the changes and validate on an actual cluster

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added unit tests to cover my changes.

Signed-off-by: Maheswara Reddy Chennuru <cmr.mahesh@gmail.com>
@cmrmahesh cmrmahesh requested a review from a team as a code owner May 28, 2026 15:47
@cmrmahesh
Copy link
Copy Markdown
Author

I went with a cache-miss approach here —
the map only refreshes when we see a replica in the stats that isn't already mapped. A replica doesn't migrate between pools; if one fails, it's destroyed and a new replica (with a new identity) is created on another pool. That new replica naturally triggers a cache miss and a refresh. In steady state this means zero additional gRPC calls.

}

/// Lists all replicas to build a replica name → pool_name mapping.
pub(crate) async fn list_replicas(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub(crate) async fn list_replicas(
pub(crate) async fn fetch_replica_pool_mapping(

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done.

/// Fetches replica list and stores replica name → pool_name mapping in cache.
/// Only refreshes when new replicas appear that aren't in the existing map.
async fn store_replica_pool_map(client: &GrpcClient) {
let needs_refresh = {
Copy link
Copy Markdown
Member

@abhilashshetty04 abhilashshetty04 May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if replicas get deleted from io-engine? Should we update map then as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a replica is deleted, it disappears from the stats response, so its map entry is never looked up — it's a stale and harmless. The map gets fully replaced on the next refresh (triggered when a new replica appears *cache.replica_pool_map_mut() = new_map), which cleans up stale entries. I could add explicit pruning but it adds complexity for no extra benefits. Let me know your thoughts.

Signed-off-by: Maheswara Reddy Chennuru <cmr.mahesh@gmail.com>
@tiagolobocastro
Copy link
Copy Markdown
Member

How about adding the pool_name to the replica stat itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants