Skip to content

feat: more metrics #9906

Draft
kpop-dfinity wants to merge 46 commits intomasterfrom
kpop/ss_tool_follow_up
Draft

feat: more metrics #9906
kpop-dfinity wants to merge 46 commits intomasterfrom
kpop/ss_tool_follow_up

Conversation

@kpop-dfinity
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends canister metrics to include canister-to-canister connectivity (message counts by sender), persists those metrics in checkpoints, and updates state-tool / subnet-splitting tooling to export and consume the new connectivity CSV.

Changes:

  • Add per-canister connection metrics (LRU-capped map keyed by sender canister) and increment them during local-subnet request execution.
  • Persist connection metrics in checkpoint protobufs and restore them on checkpoint load.
  • Update state-tool canister_metrics to emit two CSVs (load metrics + connectivity metrics) and adjust subnet-splitting integration + data loading.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
rs/state_tool/src/main.rs Extends CLI to accept two output paths for canister metrics export.
rs/state_tool/src/commands/canister_metrics.rs Writes separate CSVs for load metrics and connectivity metrics.
rs/state_manager/src/tip.rs Serializes connection metrics into CanisterStateBits when checkpointing.
rs/state_manager/src/checkpoint.rs Restores connection metrics when loading canister state from checkpoints.
rs/state_layout/src/state_layout.rs Adds connection_metrics field to CanisterStateBits.
rs/state_layout/src/state_layout/proto.rs Adds protobuf (de)serialization for connection_metrics.
rs/state_layout/src/state_layout/tests.rs Updates test fixture to include the new connection_metrics field.
rs/replicated_state/src/canister_state/system_state.rs Introduces ConnectionMetrics + LRUConnectionMetrics and plumbs into CanisterMetrics.
rs/execution_environment/src/execution_environment.rs Increments connectivity metrics when executing local-subnet requests.
rs/recovery/subnet_splitting/tests/integration_tests.rs Updates e2e test to use real connectivity metrics output instead of fake CSV.
rs/recovery/subnet_splitting/split_finder/data_io.py Adjusts communication/load data ingestion (currently with commented-out filtering).
rs/protobuf/def/state/canister_state_bits/v1/canister_state_bits.proto Adds connection_metrics field and CanisterToCanisterMetrics message.
rs/protobuf/src/gen/state/state.canister_state_bits.v1.rs Regenerated protobuf bindings for the new message/field.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 223 to +227
instructions_executed,
load_metrics,
connection_metrics: LRUConnectionMetrics {
metrics_per_canister: connection_metrics,
},
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_CAPACITY is only enforced on increment(). When metrics are constructed from checkpoint data (CanisterMetrics::new takes a full BTreeMap), a malformed/large checkpoint could load an arbitrarily large map and bypass the intended cap. Consider trimming/evicting down to MAX_CAPACITY during construction (and/or in deserialization) to make the bound invariant hold.

Copilot uses AI. Check for mistakes.
Comment thread rs/recovery/subnet_splitting/split_finder/data_io.py Outdated
Comment on lines 73 to 77
local_subnet_messages_executed: 0,
http_outcalls_executed: 0,
heartbeats_and_global_timers_executed: 0,
connection_metrics: BTreeMap::new(),
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new connection_metrics field was added to CanisterStateBits, but the encode/decode tests in this module don’t cover non-empty connection_metrics. Adding a roundtrip test with at least one entry would help prevent regressions in the new protobuf mapping.

Copilot uses AI. Check for mistakes.
Comment thread rs/state_tool/src/main.rs Outdated
Comment thread rs/state_tool/src/main.rs
Comment thread rs/state_tool/src/commands/canister_metrics.rs Outdated
Comment thread rs/state_layout/src/state_layout/proto.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/state_tool/src/commands/canister_metrics.rs
Comment thread rs/state_tool/src/commands/canister_metrics.rs Outdated
Comment thread rs/state_tool/src/main.rs
Comment on lines +188 to +190
/// Canister connections output path.
#[clap(long)]
canister_connections_output: PathBuf,
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI flag/field name uses "connections" (canister_connections_output) while the implementation and test artifacts use "connectivity". Consider standardizing on one term across the CLI, variable names, and canister_metrics::get parameter naming to reduce user confusion and make it easier to discover the right output file.

Suggested change
/// Canister connections output path.
#[clap(long)]
canister_connections_output: PathBuf,
/// Canister connectivity output path.
#[clap(long = "canister-connectivity-output", alias = "canister-connections-output")]
canister_connectivity_output: PathBuf,

Copilot uses AI. Check for mistakes.
Comment on lines 73 to 77
local_subnet_messages_executed: 0,
http_outcalls_executed: 0,
heartbeats_and_global_timers_executed: 0,
connection_metrics: BTreeMap::new(),
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new connection_metrics field was added to CanisterStateBits and its proto (de)serialization, but there is no encode/decode test coverage for it in this test module (only the default is updated). Please add a round-trip test that populates connection_metrics with at least one entry and asserts it survives pb encode/decode, to guard against future proto/schema regressions.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/state_tool/src/main.rs
Comment on lines +184 to +190
/// Canister load output path.
#[clap(long)]
output: PathBuf,
canister_load_output: PathBuf,

/// Canister connections output path.
#[clap(long)]
canister_connections_output: PathBuf,
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canister_metrics subcommand description still says metrics are printed “to the specified file”, but the command now requires two output paths (load + connectivity). Please update the subcommand-level doc/help text to clearly describe both outputs and their corresponding flags so users don’t assume a single CSV is produced.

Copilot uses AI. Check for mistakes.
Comment thread rs/state_tool/src/commands/canister_metrics.rs
Comment on lines +7 to +13
def load_subnet_data(
load_path: Path,
load_baseline_path: Path,
load_type: str,
communication_data_path: Path,
communication_baseline_data_path: Path,
) -> Dict:
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_subnet_data gained a new communication_baseline_data_path parameter, but the function docstring still only documents communication_data_path. Please update the docstring to describe the baseline communication CSV and how it’s used (e.g., subtraction to compute deltas).

Copilot uses AI. Check for mistakes.
Comment thread rs/recovery/subnet_splitting/split_finder/data_io.py
parser.add_argument(
"--communication-baseline-data-path",
type=Path,
help="Path to canister-to-canister communication data",
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The help text for --communication-baseline-data-path is identical to the non-baseline flag (“Path to canister-to-canister communication data”), which makes it hard to understand what the baseline file represents. Please update the help string to explicitly say it’s the baseline/earlier snapshot used to compute relative (delta) communication counts.

Suggested change
help="Path to canister-to-canister communication data",
help="Path to baseline canister-to-canister communication data. "
"It should represent the earlier snapshot taken before the data under "
"`communication-data-path` so that relative (delta) communication counts can be computed",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +84 to +88
connection_metrics: item
.connection_metrics
.into_iter()
.map(
|(
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New connection_metrics encoding/decoding logic is added here, but there’s no test exercising a non-empty map roundtrip (including timestamp + count). Since this module already has encode/decode tests, please add a small test case that serializes a CanisterStateBits with at least one connection_metrics entry and verifies it decodes back correctly.

Copilot uses AI. Check for mistakes.
Comment thread rs/state_tool/src/main.rs
Comment on lines 181 to 185
#[clap(long = "checkpoint")]
path: PathBuf,

/// Output path.
/// Canister load output path.
#[clap(long)]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canister_metrics subcommand’s top-level doc comment still says it prints metrics to a single “specified file”, but this variant now requires two output paths (load + connections). Please update that description so --help accurately reflects that two CSV files are produced.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants