feat: more metrics #9906
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Pierugo Pace <pierugo.pace@dfinity.org>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Pierugo Pace <pierugo.pace@dfinity.org>
There was a problem hiding this comment.
Pull request overview
This PR extends canister metrics to include canister-to-canister connectivity (message counts by sender), persists those metrics in checkpoints, and updates state-tool / subnet-splitting tooling to export and consume the new connectivity CSV.
Changes:
- Add per-canister connection metrics (LRU-capped map keyed by sender canister) and increment them during local-subnet request execution.
- Persist connection metrics in checkpoint protobufs and restore them on checkpoint load.
- Update
state-tool canister_metricsto emit two CSVs (load metrics + connectivity metrics) and adjust subnet-splitting integration + data loading.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| rs/state_tool/src/main.rs | Extends CLI to accept two output paths for canister metrics export. |
| rs/state_tool/src/commands/canister_metrics.rs | Writes separate CSVs for load metrics and connectivity metrics. |
| rs/state_manager/src/tip.rs | Serializes connection metrics into CanisterStateBits when checkpointing. |
| rs/state_manager/src/checkpoint.rs | Restores connection metrics when loading canister state from checkpoints. |
| rs/state_layout/src/state_layout.rs | Adds connection_metrics field to CanisterStateBits. |
| rs/state_layout/src/state_layout/proto.rs | Adds protobuf (de)serialization for connection_metrics. |
| rs/state_layout/src/state_layout/tests.rs | Updates test fixture to include the new connection_metrics field. |
| rs/replicated_state/src/canister_state/system_state.rs | Introduces ConnectionMetrics + LRUConnectionMetrics and plumbs into CanisterMetrics. |
| rs/execution_environment/src/execution_environment.rs | Increments connectivity metrics when executing local-subnet requests. |
| rs/recovery/subnet_splitting/tests/integration_tests.rs | Updates e2e test to use real connectivity metrics output instead of fake CSV. |
| rs/recovery/subnet_splitting/split_finder/data_io.py | Adjusts communication/load data ingestion (currently with commented-out filtering). |
| rs/protobuf/def/state/canister_state_bits/v1/canister_state_bits.proto | Adds connection_metrics field and CanisterToCanisterMetrics message. |
| rs/protobuf/src/gen/state/state.canister_state_bits.v1.rs | Regenerated protobuf bindings for the new message/field. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| instructions_executed, | ||
| load_metrics, | ||
| connection_metrics: LRUConnectionMetrics { | ||
| metrics_per_canister: connection_metrics, | ||
| }, |
There was a problem hiding this comment.
MAX_CAPACITY is only enforced on increment(). When metrics are constructed from checkpoint data (CanisterMetrics::new takes a full BTreeMap), a malformed/large checkpoint could load an arbitrarily large map and bypass the intended cap. Consider trimming/evicting down to MAX_CAPACITY during construction (and/or in deserialization) to make the bound invariant hold.
| local_subnet_messages_executed: 0, | ||
| http_outcalls_executed: 0, | ||
| heartbeats_and_global_timers_executed: 0, | ||
| connection_metrics: BTreeMap::new(), | ||
| } |
There was a problem hiding this comment.
A new connection_metrics field was added to CanisterStateBits, but the encode/decode tests in this module don’t cover non-empty connection_metrics. Adding a roundtrip test with at least one entry would help prevent regressions in the new protobuf mapping.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Canister connections output path. | ||
| #[clap(long)] | ||
| canister_connections_output: PathBuf, |
There was a problem hiding this comment.
The CLI flag/field name uses "connections" (canister_connections_output) while the implementation and test artifacts use "connectivity". Consider standardizing on one term across the CLI, variable names, and canister_metrics::get parameter naming to reduce user confusion and make it easier to discover the right output file.
| /// Canister connections output path. | |
| #[clap(long)] | |
| canister_connections_output: PathBuf, | |
| /// Canister connectivity output path. | |
| #[clap(long = "canister-connectivity-output", alias = "canister-connections-output")] | |
| canister_connectivity_output: PathBuf, |
| local_subnet_messages_executed: 0, | ||
| http_outcalls_executed: 0, | ||
| heartbeats_and_global_timers_executed: 0, | ||
| connection_metrics: BTreeMap::new(), | ||
| } |
There was a problem hiding this comment.
A new connection_metrics field was added to CanisterStateBits and its proto (de)serialization, but there is no encode/decode test coverage for it in this test module (only the default is updated). Please add a round-trip test that populates connection_metrics with at least one entry and asserts it survives pb encode/decode, to guard against future proto/schema regressions.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Canister load output path. | ||
| #[clap(long)] | ||
| output: PathBuf, | ||
| canister_load_output: PathBuf, | ||
|
|
||
| /// Canister connections output path. | ||
| #[clap(long)] | ||
| canister_connections_output: PathBuf, |
There was a problem hiding this comment.
The canister_metrics subcommand description still says metrics are printed “to the specified file”, but the command now requires two output paths (load + connectivity). Please update the subcommand-level doc/help text to clearly describe both outputs and their corresponding flags so users don’t assume a single CSV is produced.
| def load_subnet_data( | ||
| load_path: Path, | ||
| load_baseline_path: Path, | ||
| load_type: str, | ||
| communication_data_path: Path, | ||
| communication_baseline_data_path: Path, | ||
| ) -> Dict: |
There was a problem hiding this comment.
load_subnet_data gained a new communication_baseline_data_path parameter, but the function docstring still only documents communication_data_path. Please update the docstring to describe the baseline communication CSV and how it’s used (e.g., subtraction to compute deltas).
| parser.add_argument( | ||
| "--communication-baseline-data-path", | ||
| type=Path, | ||
| help="Path to canister-to-canister communication data", |
There was a problem hiding this comment.
The help text for --communication-baseline-data-path is identical to the non-baseline flag (“Path to canister-to-canister communication data”), which makes it hard to understand what the baseline file represents. Please update the help string to explicitly say it’s the baseline/earlier snapshot used to compute relative (delta) communication counts.
| help="Path to canister-to-canister communication data", | |
| help="Path to baseline canister-to-canister communication data. " | |
| "It should represent the earlier snapshot taken before the data under " | |
| "`communication-data-path` so that relative (delta) communication counts can be computed", |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| connection_metrics: item | ||
| .connection_metrics | ||
| .into_iter() | ||
| .map( | ||
| |( |
There was a problem hiding this comment.
New connection_metrics encoding/decoding logic is added here, but there’s no test exercising a non-empty map roundtrip (including timestamp + count). Since this module already has encode/decode tests, please add a small test case that serializes a CanisterStateBits with at least one connection_metrics entry and verifies it decodes back correctly.
| #[clap(long = "checkpoint")] | ||
| path: PathBuf, | ||
|
|
||
| /// Output path. | ||
| /// Canister load output path. | ||
| #[clap(long)] |
There was a problem hiding this comment.
The canister_metrics subcommand’s top-level doc comment still says it prints metrics to a single “specified file”, but this variant now requires two output paths (load + connections). Please update that description so --help accurately reflects that two CSV files are produced.
No description provided.