feat(core): additional tracker modules#455
Conversation
Ports analyseDutyFailed, analyseParticipation, extractParSigs, and all supporting helpers from charon/core/tracker/tracker.go. Replaces the TODO in TrackerService::run with the full analyse-on-deadline loop. Known deviation: fetch errors always map to REASON_BUG_FETCH_ERROR because StepError is opaque (Arc<dyn Error>); REASON_FETCH_BN_ERROR requires a typed wrapper to detect beacon-node HTTP errors.
eth2api: declare indexmap dependency explicitly types.rs references indexmap::IndexSet directly but the crate was not listed in Cargo.toml. Nix masked this via cached build artifacts; clean builds outside nix failed at the pluto-eth2api compile step.
|
@claude review |
|
Claude finished @mskrzypkows's task in 6m 8s —— View job Code Review —
|
mskrzypkows
left a comment
There was a problem hiding this comment.
This PR implements the tracker analysis pipeline — analyse_duty_failed, analyse_participation, extract_par_sigs, reporters, and metrics — a substantial and well-structured port of Go's Charon tracker. Test coverage for the analysis functions is thorough.
Issues to address before merge:
incl_supported()acquires a globalRwLockand allocates aHashSeton every call — called 2–4× per duty analysis. Should be computed once and passed down.- The
eventsHashMap inrun()has no per-duty event cap — Vecs can grow without bound if a peer floods par-sig events before the deleter fires. analyse_fetcher_failedis missing Go'scontext.Canceledspecial-case — parity gap in error classification.
Several minor style and naming issues noted inline.
|
@claude review my changes, concentrate on major issues |
|
Claude finished @mskrzypkows's task in 5m 25s —— View job Code Review — Major IssuesI focused on correctness / parity bugs that change observable behavior versus Charon Go. Two of them look like real regressions, plus a couple of secondary issues worth addressing. Todo
1. BUG: absent-peer logging condition is inverted —
|
| Scenario | Go behavior | Rust behavior |
|---|---|---|
| First duty of a type (prev_absent = None) | logs | does NOT log |
| State unchanged across duties | suppresses (anti-spam) | spams every slot |
| State changes (peer flaps in/out) | logs (the useful signal) | suppresses |
Flip the condition to != (or != Some(&absent) accounting for the None first-time case — Go's fmt.Sprint(nil) != fmt.Sprint([p1 p2]) always fires on first call).
2. BUG: empty-aggregator / empty-sync-contribution slots inflate the success counter — mod.rs:417-424 + reporters.rs:69
For Aggregator / SyncContribution duties, Go's analyseFetcherFailedAggregator / analyseFetcherFailedSyncContribution returns (false, fetcher, reason{}, nil) when fetchErr == nil — i.e. "no aggregators selected this slot". failedDutyReporter then short-circuits via:
if !failed {
if step == fetcher {
return // no metric updates
}
// increment expect + success counters
}Avoiding double-counting these no-op slots, which happen on most slots in production.
The Rust port loses the step information here. analyse_fetcher_failed_aggregator returns None via fetch_err.as_ref()?, which propagates as Option<DutyFailure> = None up to TrackerService::analyse. Then:
// mod.rs:417
let (step, reason, err) = outcome
.as_ref()
.map_or((Step::Zero, REASON_UNKNOWN, None), |f| (f.step, f.reason, f.err.as_ref()));Step is collapsed to Step::Zero, so MetricsDutyReporter::report (reporters.rs:69) misses the short-circuit:
if !failed {
if step == Step::Fetcher { // never true — we lost the Fetcher signal
return;
}
TRACKER_METRICS.expect_duties_total[&dt].inc(); // wrongly incremented
TRACKER_METRICS.success_duties_total[&dt].inc(); // wrongly incremented
return;
}Net effect: every slot without an aggregator (the common case) increments success_duties_total{duty="aggregator"} and expect_duties_total{duty="aggregator"} — same for sync_contribution. Dashboards / SLOs computed from these metrics will be wrong.
The Rust table test analyse_fetcher_failed_table correctly asserts analyse_duty_failed returns None for these cases, but it doesn't catch the downstream metric mishandling because the test stops at the analyser.
Suggested fix: either (a) change analyse_duty_failed's return to carry the step even on "not failed" — e.g. (bool, Step, Option<DutyFailure>) like Go's signature — and use it when calling the reporter, or (b) detect this case in TrackerService::analyse by recomputing duty_failed_step before reporting. Option (a) is closer to Go.
3. OnceLock cache for incl_supported() ignores runtime feature flag changes — analysis.rs:45-57
pub(crate) fn incl_supported() -> &'static HashSet<DutyType> {
static CACHE: OnceLock<HashSet<DutyType>> = OnceLock::new();
CACHE.get_or_init(|| {
// reads GLOBAL_STATE.enabled(Feature::AttestationInclusion) once
})
}The earlier RwLock+allocation criticism is addressed. But this introduces a subtle behavioral divergence: Go reads inclSupported() per analysis, so any runtime flip of Feature::AttestationInclusion takes effect immediately. The Rust port locks the value to whatever was set at the time of the first call (which may be very early in startup). In Charon, feature flags are typically configured at startup and never mutated, so this is likely safe in practice — but the assumption deserves an explicit comment, and if GLOBAL_STATE mutability is even theoretically supported, this becomes a real divergence. Worth a comment, or use a Once-style invariant that's documented at the GLOBAL_STATE site.
4. validators_per_duty is fed straight into expected_per_peer — mod.rs:432 + reporters.rs:248
Cross-references fine with Go, but worth flagging that expected_per_peer == validators_per_duty == distinct pubkeys with any event. For an aggregator slot where only some validators were selected (say 2 of 10 attestation duties are aggregator candidates), expected_per_peer reflects only those 2 — not the cluster's full validator count. That matches Go (pubkeyMapLen), so this is parity, not a bug; just worth documenting since the name "expected per peer" reads as "expected validators per peer over time" but actually means "validators that recorded any event for this specific duty".
5. Producer paths block on send_event even during shutdown — mod.rs:206-215
This was noted in the earlier review and I largely agree it's not a correctness bug (the receiver drop on cancel makes send() return an error). Worth a small confirming comment in send_event so future readers know shutdown is handled via channel-closing rather than explicit cancel-select.
Things I checked and look correct
analyse_duty_failed's step-by-step branching (analysis.rs:156-226) matches Go'sanalyseDutyFailedarm-for-arm including the early return onParSigDBExternalwith err, theDutyDB → ValidatorAPIstep override, and theBcast/ChainInclusion"missing event" fallbacks.extract_par_sigsdedup-by-(pubkey, share_idx)and per-pubkey grouping matches Go'sextractParSigs.is_par_sig_event_expectedcorrectly routes Randao→Proposer/BuilderProposer, PrepareAggregator→Attester, PrepareSyncContribution/SyncMessage→SyncContribution.unsupported_ignorer_state_machinetest is thorough including the "supported → failures no longer ignored" transition.STEPSarray induty_failed_stepcovers every step fromChainInclusiondown toFetcher, matching Go'sfor step := sentinel - 1; step > zero; step--.- The
report_par_sigsper-pubkey root-count check atreporters.rs:316(if by_root.len() <= 1) is the correct fix over Go'slen(parsigMsgs) <= 1outer-map check (Go bug). Keep the comment. Defaultimpls touching global metric state are surprising in Rust but the previous review's note about that still stands.
Summary
Two genuine correctness regressions against Charon — the absent-peer log inversion (#1) is a pure logic flip and easy to fix; the empty-aggregator success-counter inflation (#2) needs a small return-type refactor to plumb the step through. Both will produce visibly wrong output: noisy/missing logs and inflated success rates on dashboards.
| Branch: tracker_full_module
| /// configured once at startup and never mutated afterward — matching Go, | ||
| /// which reads the flag on every call but relies on the same invariant. | ||
| pub(crate) fn incl_supported() -> &'static HashSet<DutyType> { | ||
| static CACHE: OnceLock<HashSet<DutyType>> = OnceLock::new(); |
There was a problem hiding this comment.
Not sure should we cache this or keep this same as charon, decision is up to you.
There was a problem hiding this comment.
If it's not modified later I would keep the cache, not to recreate a hashmap few times.
No description provided.