Skip to content

feat(core): additional tracker modules#455

Merged
mskrzypkows merged 25 commits into
mainfrom
tracker_full_module
Jun 2, 2026
Merged

feat(core): additional tracker modules#455
mskrzypkows merged 25 commits into
mainfrom
tracker_full_module

Conversation

@mskrzypkows
Copy link
Copy Markdown
Contributor

No description provided.

  Ports analyseDutyFailed, analyseParticipation, extractParSigs, and
  all supporting helpers from charon/core/tracker/tracker.go. Replaces
  the TODO in TrackerService::run with the full analyse-on-deadline
  loop.

  Known deviation: fetch errors always map to REASON_BUG_FETCH_ERROR
  because StepError is opaque (Arc<dyn Error>); REASON_FETCH_BN_ERROR
  requires a typed wrapper to detect beacon-node HTTP errors.
eth2api: declare indexmap dependency explicitly

  types.rs references indexmap::IndexSet directly but the crate was not
  listed in Cargo.toml. Nix masked this via cached build artifacts;
  clean builds outside nix failed at the pluto-eth2api compile step.
@mskrzypkows mskrzypkows changed the title Tracker full module feat: additional tracker modules May 29, 2026
@mskrzypkows
Copy link
Copy Markdown
Contributor Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Claude finished @mskrzypkows's task in 6m 8s —— View job


Code Review — feat: additional tracker modules

I read the new modules (analysis.rs, reporters.rs, metrics.rs), the updated tracker/mod.rs and step.rs, and cross-referenced them with the Charon Go source at charon/core/tracker/{tracker.go,inclusion.go}. Overall this is a faithful, well-tested port — the failure-classification machinery and participation accounting line up step-for-step with analyseDutyFailed, analyseFetcherFailed*, and analyseParticipation. The progressive test for analyse_duty_failed and the table tests for analyse_fetcher_failed are particularly nice direct ports.

A handful of issues to consider before merging.


Bugs / Correctness

1. Dead-code shim hiding an unused-import warningcrates/core/src/tracker/analysis.rs:237

let _ = REASON_BUG_PAR_SIG_DB_EXTERNAL; // silence unused-import lint on this branch

REASON_BUG_PAR_SIG_DB_EXTERNAL is imported at line 15, but the Step::ParSigDBExternal branch on line 192 uses the fully-qualified path crate::tracker::reason::REASON_BUG_PAR_SIG_DB_EXTERNAL instead. The let _ = … line then exists only to silence the resulting unused-import lint. Fix by either using the imported identifier directly on line 192 or removing it from the import on line 15.

Fix this →

2. Producers block indefinitely with no shutdown bail-outtracker/mod.rs:206

async fn send_event(&self, event: Event) {
    if let Err(e) = self.input_tx.send(event).await {
        tracing::warn!();
    }
}

Go uses a select between t.input <- event and <-t.quit, so producers (fetcher, consensus, parsigex, etc.) can bail out when the tracker is shutting down. The Rust port awaits send(event) unconditionally. If the tracker loop stalls or the channel fills (EVENT_BUFFER = 1024) right around shutdown, callers will block until the receiver is finally dropped. Consider adding a tokio::select! with cancel.cancelled() in send_event, or document why this is acceptable for the Pluto pipeline.

3. EVENT_BUFFER = 1024 may be undersizedtracker/mod.rs:165

The comment says "epoch's worth of events". In a typical Holesky-scale cluster a single fetcher fan-out emits one event per validator pubkey, which can be hundreds to thousands per duty. Combined with many duty types per slot × 32 slots, an epoch easily exceeds 1024. Consider bumping the default or making it configurable from the start.

4. incl_supported() re-allocates a HashSet and takes a global RwLock on every duty analysisanalysis.rs:46-56

pub(crate) fn incl_supported() -> HashSet<DutyType> {
    let mut set = HashSet::new();
    set.insert(DutyType::Proposer);
    let state = GLOBAL_STATE.read().expect("featureset poisoned");
    if state.enabled(Feature::AttestationInclusion) {
        set.insert(DutyType::Attester);
        set.insert(DutyType::Aggregator);
    }
    set
}

Called once per last_step() (so once per analyser tick). Cheap, but the lock + allocation is avoidable. Since the feature set is effectively static after startup, snapshot it at TrackerService::start and pass it into the loop, or memoise behind a OnceLock. Minor perf, not a correctness bug.

5. Stale participation gauge when only unexpected events arrivereporters.rs:252-264

if part > 0 {
    TRACKER_METRICS.participation[&labels].set(1);
} else if unexp > 0 {
    // increments unexpected_events_total but does NOT touch the gauge
} else {
    absent.push(peer.name.clone());
    TRACKER_METRICS.participation[&labels].set(0);
}

If a peer sent only an unexpected event this duty, the participation gauge keeps whatever value it had last duty — which could be 1 from a previous successful duty. Go has the exact same shape, so this is a faithful port, but worth flagging since it can mislead dashboards. Setting the gauge to 0 in the unexp > 0 branch (or always setting at top of loop) would be more honest.


Tests

6. fan_out_sends_one_event_per_pubkey doesn't actually assert the fan-outtracker/mod.rs:1061

The test name promises it verifies one event per pubkey, but the body just sends two duties and then cancels — no assertion on the number of events or per-pubkey shape. Either rename it to fan_out_does_not_panic or wire it through RecordingFailureReporter/RecordingParticipationReporter and assert the resulting per-pubkey counts.

7. from_slot_filters_old_events doesn't actually verify filteringtracker/mod.rs:750

Same shape — sends a below-from_slot event and an above-from_slot event, yields, cancels. Nothing asserts that the slot-5 event was discarded vs. the slot-15 one accepted. Consider adding a recording deadliner or a counter and asserting the analyser only ever heard about slot 15.


Code quality / nits

8. String clones per peer per duty in MetricsParticipationReporter::reportreporters.rs:246

(dt_str.clone(), peer.name.clone()) allocates twice per peer per call. If your LabeledFamily index accepts (&str, &str) you could likely sidestep this; if not, cache dt_str once outside the loop (you already do — good) and consider switching peer names to Arc<str> or &'static str if possible.

9. saturating_add(1) / saturating_sub are over-defensiveanalysis.rs:481-486 and reporters.rs:250

A usize counter can't realistically overflow in practice; Go uses plain ++. The saturating variants hide rather than highlight any real bug. Plain += 1 would match the Go more closely and document the intent.

10. analyse calls events.get(duty) twicetracker/mod.rs:428-433

The inner analyse_duty_failed also re-fetches all_events.get(duty). Not a bug, just a minor duplicate lookup; ignore unless you're tightening the hot path.

11. Step::Zero matched but never populated

Step::Zero is documented as "no step reached". The match arm at analysis.rs:226 injects a string_error("no events for duty") — but duty_failed_step only returns Step::Zero when events.is_empty(). The last_for_duty comparison branch (last.step == last_for_duty && last.step_err.is_none()) can never produce Step::Zero either. So this arm is reachable only from the empty-events path, where reason = REASON_UNKNOWN and the error string is the only signal. Consider asserting / documenting this invariant; today it's load-bearing but easy to break in a refactor.

12. MetricsFailedDutyReporter::default() performs side effects (counter registration)

Default::default() invoking new() means any test or caller using Default quietly registers Prometheus counters. This is consistent with Go's "init counters at startup" pattern, but in Rust a Default impl that touches global state is surprising. Consider dropping the Default impl and forcing ::new() at the call sites, which makes the side effect explicit.

13. Cargo workspace indexmap dependency

The PR description is empty but the diff adds indexmap = { version = "2", features = ["serde"] } to workspace Cargo.toml and to crates/eth2api/Cargo.toml (with a cargo-machete ignore). Looks unrelated to the tracker work — picked up from an earlier merge (fd87328 indexmap fix eth2api). Either confirm in the description that this rides along intentionally, or split it out.


Things I checked and liked

  • The STEPS array in duty_failed_step covers every step from ChainInclusion down to Fetcher, matching Go's for step := sentinel - 1; step > zero; step-- exactly (analysis.rs:108-120).
  • extract_par_sigs dedup-by-(pubkey, share_idx) matches Go's dedupKey; the test extract_par_sigs_groups_by_msg_root_per_pubkey is a clean port of TestAnalyseParSigs.
  • is_par_sig_event_expected cleanly mirrors the Go branching (Randao → Proposer/BuilderProposer, PrepareAggregator → Attester, PrepareSyncContribution/SyncMessage → SyncContribution).
  • The newtype wrappers AnalyserRx / DeleterRx are a nice Rust-specific safety improvement over Go (prevents accidentally swapping the two channels at the call site).
  • The Tracker trait uses impl Future<Output = ()> returns rather than async-trait boxing — good for the hot path.
  • unsupported_ignorer_state_machine is thorough, including the "supported → failures no longer ignored" transition.

Todo

  • Gather PR context and read changed files
  • Cross-reference Go (Charon) implementation for parity
  • Review analysis.rs for correctness, idioms, and bugs
  • Review reporters.rs for correctness and concurrency safety
  • Review metrics.rs and step.rs changes
  • Review tracker/mod.rs integration and tests
  • Post review
    | Branch: tracker_full_module

@mskrzypkows mskrzypkows changed the title feat: additional tracker modules feat(core): additional tracker modules May 29, 2026
Copy link
Copy Markdown
Contributor Author

@mskrzypkows mskrzypkows left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR implements the tracker analysis pipeline — analyse_duty_failed, analyse_participation, extract_par_sigs, reporters, and metrics — a substantial and well-structured port of Go's Charon tracker. Test coverage for the analysis functions is thorough.

Issues to address before merge:

  1. incl_supported() acquires a global RwLock and allocates a HashSet on every call — called 2–4× per duty analysis. Should be computed once and passed down.
  2. The events HashMap in run() has no per-duty event cap — Vecs can grow without bound if a peer floods par-sig events before the deleter fires.
  3. analyse_fetcher_failed is missing Go's context.Canceled special-case — parity gap in error classification.

Several minor style and naming issues noted inline.

Comment thread crates/core/src/tracker/analysis.rs
Comment thread crates/core/src/tracker/analysis.rs
Comment thread crates/core/src/tracker/analysis.rs Outdated
Comment thread crates/core/src/tracker/analysis.rs
Comment thread crates/core/src/tracker/reporters.rs Outdated
Comment thread crates/core/src/tracker/reporters.rs
Comment thread crates/core/src/tracker/reporters.rs
@mskrzypkows
Copy link
Copy Markdown
Contributor Author

@claude review my changes, concentrate on major issues

@mskrzypkows mskrzypkows marked this pull request as ready for review June 1, 2026 09:44
@mskrzypkows mskrzypkows requested a review from iamquang95 June 1, 2026 09:44
@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @mskrzypkows's task in 5m 25s —— View job


Code Review — Major Issues

I focused on correctness / parity bugs that change observable behavior versus Charon Go. Two of them look like real regressions, plus a couple of secondary issues worth addressing.


Todo

  • Read analysis.rs, reporters.rs, metrics.rs, mod.rs, step.rs
  • Cross-reference against Charon Go (charon/core/tracker/*.go)
  • Identify major correctness / parity / concurrency issues
  • Post review

1. BUG: absent-peer logging condition is invertedreporters.rs:269

if self.prev_absent.get(&duty.duty_type) == Some(&absent) {
    if absent.is_empty() {
        tracing::info!(duty = %duty, "All peers participated in duty");
    } else if absent.len() == self.peers.len() {
        tracing::info!(duty = %duty, "No peers participated in duty");
    } else {
        tracing::info!(duty = %duty, absent = ?absent, "Not all peers participated in duty");
    }
}

Go's equivalent at tracker.go:669 uses != (state has changed):

if fmt.Sprint(prevAbsent[duty.Type]) != fmt.Sprint(absentPeers) {
    // log
}

The comment immediately above the Rust block correctly states the intent:

Only log when the absent set changes from the previous duty of this type, to avoid log spam every slot.

…but the code does the opposite: it logs only when the set is unchanged, and suppresses logs when it changes. Concrete consequences:

Scenario Go behavior Rust behavior
First duty of a type (prev_absent = None) logs does NOT log
State unchanged across duties suppresses (anti-spam) spams every slot
State changes (peer flaps in/out) logs (the useful signal) suppresses

Flip the condition to != (or != Some(&absent) accounting for the None first-time case — Go's fmt.Sprint(nil) != fmt.Sprint([p1 p2]) always fires on first call).

Fix this →


2. BUG: empty-aggregator / empty-sync-contribution slots inflate the success countermod.rs:417-424 + reporters.rs:69

For Aggregator / SyncContribution duties, Go's analyseFetcherFailedAggregator / analyseFetcherFailedSyncContribution returns (false, fetcher, reason{}, nil) when fetchErr == nil — i.e. "no aggregators selected this slot". failedDutyReporter then short-circuits via:

if !failed {
    if step == fetcher {
        return // no metric updates
    }
    // increment expect + success counters
}

Avoiding double-counting these no-op slots, which happen on most slots in production.

The Rust port loses the step information here. analyse_fetcher_failed_aggregator returns None via fetch_err.as_ref()?, which propagates as Option<DutyFailure> = None up to TrackerService::analyse. Then:

// mod.rs:417
let (step, reason, err) = outcome
    .as_ref()
    .map_or((Step::Zero, REASON_UNKNOWN, None), |f| (f.step, f.reason, f.err.as_ref()));

Step is collapsed to Step::Zero, so MetricsDutyReporter::report (reporters.rs:69) misses the short-circuit:

if !failed {
    if step == Step::Fetcher {   // never true — we lost the Fetcher signal
        return;
    }
    TRACKER_METRICS.expect_duties_total[&dt].inc();   // wrongly incremented
    TRACKER_METRICS.success_duties_total[&dt].inc();  // wrongly incremented
    return;
}

Net effect: every slot without an aggregator (the common case) increments success_duties_total{duty="aggregator"} and expect_duties_total{duty="aggregator"} — same for sync_contribution. Dashboards / SLOs computed from these metrics will be wrong.

The Rust table test analyse_fetcher_failed_table correctly asserts analyse_duty_failed returns None for these cases, but it doesn't catch the downstream metric mishandling because the test stops at the analyser.

Suggested fix: either (a) change analyse_duty_failed's return to carry the step even on "not failed" — e.g. (bool, Step, Option<DutyFailure>) like Go's signature — and use it when calling the reporter, or (b) detect this case in TrackerService::analyse by recomputing duty_failed_step before reporting. Option (a) is closer to Go.

Fix this →


3. OnceLock cache for incl_supported() ignores runtime feature flag changes — analysis.rs:45-57

pub(crate) fn incl_supported() -> &'static HashSet<DutyType> {
    static CACHE: OnceLock<HashSet<DutyType>> = OnceLock::new();
    CACHE.get_or_init(|| {
        // reads GLOBAL_STATE.enabled(Feature::AttestationInclusion) once
    })
}

The earlier RwLock+allocation criticism is addressed. But this introduces a subtle behavioral divergence: Go reads inclSupported() per analysis, so any runtime flip of Feature::AttestationInclusion takes effect immediately. The Rust port locks the value to whatever was set at the time of the first call (which may be very early in startup). In Charon, feature flags are typically configured at startup and never mutated, so this is likely safe in practice — but the assumption deserves an explicit comment, and if GLOBAL_STATE mutability is even theoretically supported, this becomes a real divergence. Worth a comment, or use a Once-style invariant that's documented at the GLOBAL_STATE site.


4. validators_per_duty is fed straight into expected_per_peermod.rs:432 + reporters.rs:248

Cross-references fine with Go, but worth flagging that expected_per_peer == validators_per_duty == distinct pubkeys with any event. For an aggregator slot where only some validators were selected (say 2 of 10 attestation duties are aggregator candidates), expected_per_peer reflects only those 2 — not the cluster's full validator count. That matches Go (pubkeyMapLen), so this is parity, not a bug; just worth documenting since the name "expected per peer" reads as "expected validators per peer over time" but actually means "validators that recorded any event for this specific duty".


5. Producer paths block on send_event even during shutdown — mod.rs:206-215

This was noted in the earlier review and I largely agree it's not a correctness bug (the receiver drop on cancel makes send() return an error). Worth a small confirming comment in send_event so future readers know shutdown is handled via channel-closing rather than explicit cancel-select.


Things I checked and look correct

  • analyse_duty_failed's step-by-step branching (analysis.rs:156-226) matches Go's analyseDutyFailed arm-for-arm including the early return on ParSigDBExternal with err, the DutyDB → ValidatorAPI step override, and the Bcast / ChainInclusion "missing event" fallbacks.
  • extract_par_sigs dedup-by-(pubkey, share_idx) and per-pubkey grouping matches Go's extractParSigs.
  • is_par_sig_event_expected correctly routes Randao→Proposer/BuilderProposer, PrepareAggregator→Attester, PrepareSyncContribution/SyncMessage→SyncContribution.
  • unsupported_ignorer_state_machine test is thorough including the "supported → failures no longer ignored" transition.
  • STEPS array in duty_failed_step covers every step from ChainInclusion down to Fetcher, matching Go's for step := sentinel - 1; step > zero; step--.
  • The report_par_sigs per-pubkey root-count check at reporters.rs:316 (if by_root.len() <= 1) is the correct fix over Go's len(parsigMsgs) <= 1 outer-map check (Go bug). Keep the comment.
  • Default impls touching global metric state are surprising in Rust but the previous review's note about that still stands.

Summary

Two genuine correctness regressions against Charon — the absent-peer log inversion (#1) is a pure logic flip and easy to fix; the empty-aggregator success-counter inflation (#2) needs a small return-type refactor to plumb the step through. Both will produce visibly wrong output: noisy/missing logs and inflated success rates on dashboards.
| Branch: tracker_full_module

/// configured once at startup and never mutated afterward — matching Go,
/// which reads the flag on every call but relies on the same invariant.
pub(crate) fn incl_supported() -> &'static HashSet<DutyType> {
static CACHE: OnceLock<HashSet<DutyType>> = OnceLock::new();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure should we cache this or keep this same as charon, decision is up to you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not modified later I would keep the cache, not to recreate a hashmap few times.

@mskrzypkows mskrzypkows merged commit c896771 into main Jun 2, 2026
15 checks passed
@mskrzypkows mskrzypkows deleted the tracker_full_module branch June 2, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants