feat(core): additional tracker modules by mskrzypkows · Pull Request #455 · NethermindEth/pluto

mskrzypkows · 2026-05-29T09:06:31Z

No description provided.

architecture

…dule_skeleton

Ports analyseDutyFailed, analyseParticipation, extractParSigs, and all supporting helpers from charon/core/tracker/tracker.go. Replaces the TODO in TrackerService::run with the full analyse-on-deadline loop. Known deviation: fetch errors always map to REASON_BUG_FETCH_ERROR because StepError is opaque (Arc<dyn Error>); REASON_FETCH_BN_ERROR requires a typed wrapper to detect beacon-node HTTP errors.

eth2api: declare indexmap dependency explicitly types.rs references indexmap::IndexSet directly but the crate was not listed in Cargo.toml. Nix masked this via cached build artifacts; clean builds outside nix failed at the pluto-eth2api compile step.

…ll_module

mskrzypkows · 2026-05-29T09:12:14Z

@claude review

claude · 2026-05-29T09:12:35Z

Claude finished @mskrzypkows's task in 6m 8s —— View job

Code Review — `feat: additional tracker modules`

I read the new modules (analysis.rs, reporters.rs, metrics.rs), the updated tracker/mod.rs and step.rs, and cross-referenced them with the Charon Go source at charon/core/tracker/{tracker.go,inclusion.go}. Overall this is a faithful, well-tested port — the failure-classification machinery and participation accounting line up step-for-step with analyseDutyFailed, analyseFetcherFailed*, and analyseParticipation. The progressive test for analyse_duty_failed and the table tests for analyse_fetcher_failed are particularly nice direct ports.

A handful of issues to consider before merging.

Bugs / Correctness

1. Dead-code shim hiding an unused-import warning — crates/core/src/tracker/analysis.rs:237

let _ = REASON_BUG_PAR_SIG_DB_EXTERNAL; // silence unused-import lint on this branch

REASON_BUG_PAR_SIG_DB_EXTERNAL is imported at line 15, but the Step::ParSigDBExternal branch on line 192 uses the fully-qualified path crate::tracker::reason::REASON_BUG_PAR_SIG_DB_EXTERNAL instead. The let _ = … line then exists only to silence the resulting unused-import lint. Fix by either using the imported identifier directly on line 192 or removing it from the import on line 15.

Fix this →

2. Producers block indefinitely with no shutdown bail-out — tracker/mod.rs:206

async fn send_event(&self, event: Event) {
    if let Err(e) = self.input_tx.send(event).await {
        tracing::warn!(…);
    }
}

Go uses a select between t.input <- event and <-t.quit, so producers (fetcher, consensus, parsigex, etc.) can bail out when the tracker is shutting down. The Rust port awaits send(event) unconditionally. If the tracker loop stalls or the channel fills (EVENT_BUFFER = 1024) right around shutdown, callers will block until the receiver is finally dropped. Consider adding a tokio::select! with cancel.cancelled() in send_event, or document why this is acceptable for the Pluto pipeline.

3. EVENT_BUFFER = 1024 may be undersized — tracker/mod.rs:165

The comment says "epoch's worth of events". In a typical Holesky-scale cluster a single fetcher fan-out emits one event per validator pubkey, which can be hundreds to thousands per duty. Combined with many duty types per slot × 32 slots, an epoch easily exceeds 1024. Consider bumping the default or making it configurable from the start.

4. incl_supported() re-allocates a HashSet and takes a global RwLock on every duty analysis — analysis.rs:46-56

pub(crate) fn incl_supported() -> HashSet<DutyType> {
    let mut set = HashSet::new();
    set.insert(DutyType::Proposer);
    let state = GLOBAL_STATE.read().expect("featureset poisoned");
    if state.enabled(Feature::AttestationInclusion) {
        set.insert(DutyType::Attester);
        set.insert(DutyType::Aggregator);
    }
    set
}

Called once per last_step() (so once per analyser tick). Cheap, but the lock + allocation is avoidable. Since the feature set is effectively static after startup, snapshot it at TrackerService::start and pass it into the loop, or memoise behind a OnceLock. Minor perf, not a correctness bug.

5. Stale participation gauge when only unexpected events arrive — reporters.rs:252-264

if part > 0 {
    TRACKER_METRICS.participation[&labels].set(1);
} else if unexp > 0 {
    // increments unexpected_events_total but does NOT touch the gauge
} else {
    absent.push(peer.name.clone());
    TRACKER_METRICS.participation[&labels].set(0);
}

If a peer sent only an unexpected event this duty, the participation gauge keeps whatever value it had last duty — which could be 1 from a previous successful duty. Go has the exact same shape, so this is a faithful port, but worth flagging since it can mislead dashboards. Setting the gauge to 0 in the unexp > 0 branch (or always setting at top of loop) would be more honest.

Tests

6. fan_out_sends_one_event_per_pubkey doesn't actually assert the fan-out — tracker/mod.rs:1061

The test name promises it verifies one event per pubkey, but the body just sends two duties and then cancels — no assertion on the number of events or per-pubkey shape. Either rename it to fan_out_does_not_panic or wire it through RecordingFailureReporter/RecordingParticipationReporter and assert the resulting per-pubkey counts.

7. from_slot_filters_old_events doesn't actually verify filtering — tracker/mod.rs:750

Same shape — sends a below-from_slot event and an above-from_slot event, yields, cancels. Nothing asserts that the slot-5 event was discarded vs. the slot-15 one accepted. Consider adding a recording deadliner or a counter and asserting the analyser only ever heard about slot 15.

Code quality / nits

8. String clones per peer per duty in MetricsParticipationReporter::report — reporters.rs:246

(dt_str.clone(), peer.name.clone()) allocates twice per peer per call. If your LabeledFamily index accepts (&str, &str) you could likely sidestep this; if not, cache dt_str once outside the loop (you already do — good) and consider switching peer names to Arc<str> or &'static str if possible.

9. saturating_add(1) / saturating_sub are over-defensive — analysis.rs:481-486 and reporters.rs:250

A usize counter can't realistically overflow in practice; Go uses plain ++. The saturating variants hide rather than highlight any real bug. Plain += 1 would match the Go more closely and document the intent.

10. analyse calls events.get(duty) twice — tracker/mod.rs:428-433

The inner analyse_duty_failed also re-fetches all_events.get(duty). Not a bug, just a minor duplicate lookup; ignore unless you're tightening the hot path.

11. Step::Zero matched but never populated

Step::Zero is documented as "no step reached". The match arm at analysis.rs:226 injects a string_error("no events for duty") — but duty_failed_step only returns Step::Zero when events.is_empty(). The last_for_duty comparison branch (last.step == last_for_duty && last.step_err.is_none()) can never produce Step::Zero either. So this arm is reachable only from the empty-events path, where reason = REASON_UNKNOWN and the error string is the only signal. Consider asserting / documenting this invariant; today it's load-bearing but easy to break in a refactor.

12. MetricsFailedDutyReporter::default() performs side effects (counter registration)

Default::default() invoking new() means any test or caller using Default quietly registers Prometheus counters. This is consistent with Go's "init counters at startup" pattern, but in Rust a Default impl that touches global state is surprising. Consider dropping the Default impl and forcing ::new() at the call sites, which makes the side effect explicit.

13. Cargo workspace indexmap dependency

The PR description is empty but the diff adds indexmap = { version = "2", features = ["serde"] } to workspace Cargo.toml and to crates/eth2api/Cargo.toml (with a cargo-machete ignore). Looks unrelated to the tracker work — picked up from an earlier merge (fd87328 indexmap fix eth2api). Either confirm in the description that this rides along intentionally, or split it out.

Things I checked and liked

The STEPS array in duty_failed_step covers every step from ChainInclusion down to Fetcher, matching Go's for step := sentinel - 1; step > zero; step-- exactly (analysis.rs:108-120).
extract_par_sigs dedup-by-(pubkey, share_idx) matches Go's dedupKey; the test extract_par_sigs_groups_by_msg_root_per_pubkey is a clean port of TestAnalyseParSigs.
is_par_sig_event_expected cleanly mirrors the Go branching (Randao → Proposer/BuilderProposer, PrepareAggregator → Attester, PrepareSyncContribution/SyncMessage → SyncContribution).
The newtype wrappers AnalyserRx / DeleterRx are a nice Rust-specific safety improvement over Go (prevents accidentally swapping the two channels at the call site).
The Tracker trait uses impl Future<Output = ()> returns rather than async-trait boxing — good for the hot path.
unsupported_ignorer_state_machine is thorough, including the "supported → failures no longer ignored" transition.

Todo

Gather PR context and read changed files
Cross-reference Go (Charon) implementation for parity
Review analysis.rs for correctness, idioms, and bugs
Review reporters.rs for correctness and concurrency safety
Review metrics.rs and step.rs changes
Review tracker/mod.rs integration and tests
Post review
| Branch: tracker_full_module

mskrzypkows

This PR implements the tracker analysis pipeline — analyse_duty_failed, analyse_participation, extract_par_sigs, reporters, and metrics — a substantial and well-structured port of Go's Charon tracker. Test coverage for the analysis functions is thorough.

Issues to address before merge:

incl_supported() acquires a global RwLock and allocates a HashSet on every call — called 2–4× per duty analysis. Should be computed once and passed down.
The events HashMap in run() has no per-duty event cap — Vecs can grow without bound if a peer floods par-sig events before the deleter fires.
analyse_fetcher_failed is missing Go's context.Canceled special-case — parity gap in error classification.

Several minor style and naming issues noted inline.

mskrzypkows · 2026-06-01T09:44:08Z

@claude review my changes, concentrate on major issues

claude · 2026-06-01T09:44:25Z

Claude finished @mskrzypkows's task in 5m 25s —— View job

Code Review — Major Issues

I focused on correctness / parity bugs that change observable behavior versus Charon Go. Two of them look like real regressions, plus a couple of secondary issues worth addressing.

Todo

Read analysis.rs, reporters.rs, metrics.rs, mod.rs, step.rs
Cross-reference against Charon Go (charon/core/tracker/*.go)
Identify major correctness / parity / concurrency issues
Post review

1. BUG: absent-peer logging condition is inverted — `reporters.rs:269`

if self.prev_absent.get(&duty.duty_type) == Some(&absent) {
    if absent.is_empty() {
        tracing::info!(duty = %duty, "All peers participated in duty");
    } else if absent.len() == self.peers.len() {
        tracing::info!(duty = %duty, "No peers participated in duty");
    } else {
        tracing::info!(duty = %duty, absent = ?absent, "Not all peers participated in duty");
    }
}

Go's equivalent at tracker.go:669 uses != (state has changed):

if fmt.Sprint(prevAbsent[duty.Type]) != fmt.Sprint(absentPeers) {
    // log
}

The comment immediately above the Rust block correctly states the intent:

Only log when the absent set changes from the previous duty of this type, to avoid log spam every slot.

…but the code does the opposite: it logs only when the set is unchanged, and suppresses logs when it changes. Concrete consequences:

Scenario	Go behavior	Rust behavior
First duty of a type (prev_absent = None)	logs	does NOT log
State unchanged across duties	suppresses (anti-spam)	spams every slot
State changes (peer flaps in/out)	logs (the useful signal)	suppresses

Flip the condition to != (or != Some(&absent) accounting for the None first-time case — Go's fmt.Sprint(nil) != fmt.Sprint([p1 p2]) always fires on first call).

Fix this →

2. BUG: empty-aggregator / empty-sync-contribution slots inflate the success counter — `mod.rs:417-424` + `reporters.rs:69`

For Aggregator / SyncContribution duties, Go's analyseFetcherFailedAggregator / analyseFetcherFailedSyncContribution returns (false, fetcher, reason{}, nil) when fetchErr == nil — i.e. "no aggregators selected this slot". failedDutyReporter then short-circuits via:

if !failed {
    if step == fetcher {
        return // no metric updates
    }
    // increment expect + success counters
}

Avoiding double-counting these no-op slots, which happen on most slots in production.

The Rust port loses the step information here. analyse_fetcher_failed_aggregator returns None via fetch_err.as_ref()?, which propagates as Option<DutyFailure> = None up to TrackerService::analyse. Then:

// mod.rs:417
let (step, reason, err) = outcome
    .as_ref()
    .map_or((Step::Zero, REASON_UNKNOWN, None), |f| (f.step, f.reason, f.err.as_ref()));

Step is collapsed to Step::Zero, so MetricsDutyReporter::report (reporters.rs:69) misses the short-circuit:

if !failed {
    if step == Step::Fetcher {   // never true — we lost the Fetcher signal
        return;
    }
    TRACKER_METRICS.expect_duties_total[&dt].inc();   // wrongly incremented
    TRACKER_METRICS.success_duties_total[&dt].inc();  // wrongly incremented
    return;
}

Net effect: every slot without an aggregator (the common case) increments success_duties_total{duty="aggregator"} and expect_duties_total{duty="aggregator"} — same for sync_contribution. Dashboards / SLOs computed from these metrics will be wrong.

The Rust table test analyse_fetcher_failed_table correctly asserts analyse_duty_failed returns None for these cases, but it doesn't catch the downstream metric mishandling because the test stops at the analyser.

Suggested fix: either (a) change analyse_duty_failed's return to carry the step even on "not failed" — e.g. (bool, Step, Option<DutyFailure>) like Go's signature — and use it when calling the reporter, or (b) detect this case in TrackerService::analyse by recomputing duty_failed_step before reporting. Option (a) is closer to Go.

Fix this →

3. `OnceLock` cache for `incl_supported()` ignores runtime feature flag changes — `analysis.rs:45-57`

pub(crate) fn incl_supported() -> &'static HashSet<DutyType> {
    static CACHE: OnceLock<HashSet<DutyType>> = OnceLock::new();
    CACHE.get_or_init(|| {
        // reads GLOBAL_STATE.enabled(Feature::AttestationInclusion) once
    })
}

The earlier RwLock+allocation criticism is addressed. But this introduces a subtle behavioral divergence: Go reads inclSupported() per analysis, so any runtime flip of Feature::AttestationInclusion takes effect immediately. The Rust port locks the value to whatever was set at the time of the first call (which may be very early in startup). In Charon, feature flags are typically configured at startup and never mutated, so this is likely safe in practice — but the assumption deserves an explicit comment, and if GLOBAL_STATE mutability is even theoretically supported, this becomes a real divergence. Worth a comment, or use a Once-style invariant that's documented at the GLOBAL_STATE site.

4. `validators_per_duty` is fed straight into `expected_per_peer` — `mod.rs:432` + `reporters.rs:248`

Cross-references fine with Go, but worth flagging that expected_per_peer == validators_per_duty == distinct pubkeys with any event. For an aggregator slot where only some validators were selected (say 2 of 10 attestation duties are aggregator candidates), expected_per_peer reflects only those 2 — not the cluster's full validator count. That matches Go (pubkeyMapLen), so this is parity, not a bug; just worth documenting since the name "expected per peer" reads as "expected validators per peer over time" but actually means "validators that recorded any event for this specific duty".

5. Producer paths block on `send_event` even during shutdown — `mod.rs:206-215`

This was noted in the earlier review and I largely agree it's not a correctness bug (the receiver drop on cancel makes send() return an error). Worth a small confirming comment in send_event so future readers know shutdown is handled via channel-closing rather than explicit cancel-select.

Things I checked and look correct

analyse_duty_failed's step-by-step branching (analysis.rs:156-226) matches Go's analyseDutyFailed arm-for-arm including the early return on ParSigDBExternal with err, the DutyDB → ValidatorAPI step override, and the Bcast / ChainInclusion "missing event" fallbacks.
extract_par_sigs dedup-by-(pubkey, share_idx) and per-pubkey grouping matches Go's extractParSigs.
is_par_sig_event_expected correctly routes Randao→Proposer/BuilderProposer, PrepareAggregator→Attester, PrepareSyncContribution/SyncMessage→SyncContribution.
unsupported_ignorer_state_machine test is thorough including the "supported → failures no longer ignored" transition.
STEPS array in duty_failed_step covers every step from ChainInclusion down to Fetcher, matching Go's for step := sentinel - 1; step > zero; step--.
The report_par_sigs per-pubkey root-count check at reporters.rs:316 (if by_root.len() <= 1) is the correct fix over Go's len(parsigMsgs) <= 1 outer-map check (Go bug). Keep the comment.
Default impls touching global metric state are surprising in Rust but the previous review's note about that still stands.

Summary

Two genuine correctness regressions against Charon — the absent-peer log inversion (#1) is a pure logic flip and easy to fix; the empty-aggregator success-counter inflation (#2) needs a small return-type refactor to plumb the step through. Both will produce visibly wrong output: noisy/missing logs and inflated success rates on dashboards.
| Branch: tracker_full_module

iamquang95 · 2026-06-02T08:37:09Z

+/// configured once at startup and never mutated afterward — matching Go,
+/// which reads the flag on every call but relies on the same invariant.
+pub(crate) fn incl_supported() -> &'static HashSet<DutyType> {
+    static CACHE: OnceLock<HashSet<DutyType>> = OnceLock::new();


Not sure should we cache this or keep this same as charon, decision is up to you.

If it's not modified later I would keep the cache, not to recreate a hashmap few times.

mskrzypkows added 13 commits May 22, 2026 15:26

Corrected DutyDefinitionSet to match Charon implementation

4412e8e

tracker module and reason.rs with errors

52ea465

Merge branch 'main' of github.com:NethermindEth/pluto into step_2

a6c9fc9

Step enum and Tracker trait

97a9791

step to separate file, tracker handlera similar to deadliner

7fc29c1

architecture

Merge branch 'main' of github.com:NethermindEth/pluto into tracker_mo…

a0a1e45

…dule_skeleton

Major issue fixed, async functions in the Tracker trait

47cb887

review corrections

f6a46fb

indexmap fix

fd87328

eth2api: declare indexmap dependency explicitly types.rs references indexmap::IndexSet directly but the crate was not listed in Cargo.toml. Nix masked this via cached build artifacts; clean builds outside nix failed at the pluto-eth2api compile step.

Merge branch 'main' of github.com:NethermindEth/pluto into tracker_fu…

4f1b2f2

…ll_module

Analysis, metrics and reporters tracker modules.

6f0b9a8

additional tests for reporters

e7dd034

mskrzypkows changed the title ~~Tracker full module~~ feat: additional tracker modules May 29, 2026

Merge branch 'main' of github.com:NethermindEth/pluto into tracker_fu…

d47b69b

…ll_module

mskrzypkows changed the title ~~feat: additional tracker modules~~ feat(core): additional tracker modules May 29, 2026

mskrzypkows added 2 commits May 29, 2026 12:51

unused variable

b43c82f

removed bool flag from DutyFailure

8c3d3f4

mskrzypkows commented May 29, 2026

View reviewed changes

mskrzypkows added 6 commits May 29, 2026 15:52

fixed import

052fa51

skipped recreating the hashSet in incl_supported

2ab5763

todo

8678232

Corrected review comments

04a4270

unneeded dependency

f53773e

improved tests

6f2e1a3

mskrzypkows marked this pull request as ready for review June 1, 2026 09:44

mskrzypkows requested a review from iamquang95 June 1, 2026 09:44

mskrzypkows requested review from emlautarom1 and varex83 June 1, 2026 09:45

mskrzypkows added 3 commits June 1, 2026 11:47

unneeded

bb26c5a

Fixed first two bugs found by claude.

0ee9ef6

additional comments

17841be

iamquang95 approved these changes Jun 2, 2026

View reviewed changes

mskrzypkows merged commit c896771 into main Jun 2, 2026
15 checks passed

mskrzypkows deleted the tracker_full_module branch June 2, 2026 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): additional tracker modules#455

feat(core): additional tracker modules#455
mskrzypkows merged 25 commits into
mainfrom
tracker_full_module

mskrzypkows commented May 29, 2026

Uh oh!

mskrzypkows commented May 29, 2026

Uh oh!

claude Bot commented May 29, 2026 •

edited

Loading

Uh oh!

mskrzypkows left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mskrzypkows commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

iamquang95 Jun 2, 2026

Uh oh!

mskrzypkows Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mskrzypkows commented May 29, 2026

Uh oh!

mskrzypkows commented May 29, 2026

Uh oh!

claude Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — feat: additional tracker modules

Bugs / Correctness

Tests

Code quality / nits

Things I checked and liked

Todo

Uh oh!

mskrzypkows left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mskrzypkows commented Jun 1, 2026

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — Major Issues

Todo

1. BUG: absent-peer logging condition is inverted — reporters.rs:269

2. BUG: empty-aggregator / empty-sync-contribution slots inflate the success counter — mod.rs:417-424 + reporters.rs:69

3. OnceLock cache for incl_supported() ignores runtime feature flag changes — analysis.rs:45-57

4. validators_per_duty is fed straight into expected_per_peer — mod.rs:432 + reporters.rs:248

5. Producer paths block on send_event even during shutdown — mod.rs:206-215

Things I checked and look correct

Summary

Uh oh!

iamquang95 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

mskrzypkows Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented May 29, 2026 •

edited

Loading

Code Review — `feat: additional tracker modules`

claude Bot commented Jun 1, 2026 •

edited

Loading

1. BUG: absent-peer logging condition is inverted — `reporters.rs:269`

2. BUG: empty-aggregator / empty-sync-contribution slots inflate the success counter — `mod.rs:417-424` + `reporters.rs:69`

3. `OnceLock` cache for `incl_supported()` ignores runtime feature flag changes — `analysis.rs:45-57`

4. `validators_per_duty` is fed straight into `expected_per_peer` — `mod.rs:432` + `reporters.rs:248`

5. Producer paths block on `send_event` even during shutdown — `mod.rs:206-215`