Skip to content

feat: relay server missing features#354

Merged
mskrzypkows merged 19 commits intomainfrom
claude/issue-129-20260313-1533
Apr 28, 2026
Merged

feat: relay server missing features#354
mskrzypkows merged 19 commits intomainfrom
claude/issue-129-20260313-1533

Conversation

@mskrzypkows
Copy link
Copy Markdown
Contributor

@mskrzypkows mskrzypkows commented Apr 24, 2026

github-actions Bot and others added 6 commits March 13, 2026 15:51
- Start Prometheus monitoring server on configured monitoring_addr
- Handle relay swarm events: ReservationReqAccepted/Denied, ReservationTimedOut,
  CircuitReqAccepted/Denied, CircuitClosed — with structured logging
- Track connection metrics (connection_total, active_connections) via RELAY_METRICS
  on SwarmEvent::ConnectionEstablished and ConnectionClosed
- Log version, git hash, and build time on startup using pluto_core::version
- Make RelayMetrics fields public; add PeerWithPeerClusterLabels::new() constructor
- Remove #[allow(missing_docs)] from p2p.rs; add module doc comment
- Add vise-exporter to regular (non-dev) dependencies
- Update serve_addr_metrics test: remove #[ignore], test monitoring endpoint,
  check for prometheus output

Closes #129

Co-authored-by: Bohdan Ohorodnii <varex83@users.noreply.github.com>
- Dereference LazyLock for VERSION in tracing macro: %*VERSION
- Use inc_by(1)/dec_by(1) instead of inc()/dec() on vise::Gauge
- Add .. to ReservationReqDenied and CircuitReqDenied patterns for missing status field
- Merge pluto_p2p imports and reorder metrics imports alphabetically
- Reformat ListenerClosed, ReservationReqAccepted, ReservationReqDenied match arms per rustfmt

Co-authored-by: Bohdan Ohorodnii <varex83@users.noreply.github.com>
Manually merges changes from main branch commit e41e6d9:
- Add crates/p2p/src/bootnode.rs and crates/p2p/src/relay.rs
- Add crates/p2p/examples/bootnode.rs
- Update p2p peer.rs: add Default for MutablePeer, addr_infos_from_p2p_addrs
- Update p2p gater.rs: remove Arc wrapper from relay peers
- Update p2p utils.rs: add multi_addrs_via_relay helper
- Update p2p lib.rs: export bootnode and relay modules
- Update p2p Cargo.toml: add backon, reqwest, url deps
- Fix crypto share indexing: use 1-indexed keys throughout
- Remove MathError enum, replace with direct DivisionByZero variant
- Fix cluster test_cluster.rs: use 1-indexed share lookup
- Fix app obolapi: add FailedToConvertShareIndexToU8/MathOverflow errors
- Add app/src/utils.rs with archive helpers
- Add core/src/parasigdb/memory.rs
- Update Cargo.toml and Cargo.lock with flate2, tar deps
- Update relay_server example: simplify TCP addr, add docblock

Co-authored-by: Bohdan Ohorodnii <varex83@users.noreply.github.com>
Manual merge of 4e68ef9 into this branch (shallow clone prevents git merge).

- Add crates/core/src/signeddata.rs and signeddata testdata
- Add pub mod signeddata to crates/core/src/lib.rs
- Update crates/core/Cargo.toml with signeddata deps
- Update crates/crypto/Cargo.toml: remove pluto-core dep (avoids cycle)
- Update crates/crypto/src/tblsconv.rs: remove core conversions (avoids cycle)
- Add/update crates/eth2api/ spec types for signeddata
- Add/update crates/eth2util/src/types.rs
- Update Cargo.lock

Co-authored-by: Lautaro Emanuel <emlautarom1@users.noreply.github.com>
@mskrzypkows
Copy link
Copy Markdown
Contributor Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Claude finished @mskrzypkows's task in 4m 39s —— View job


PR Review

  • Read changed files and gather context
  • Review crates/core/src/parasigdb/memory.rs (new file)
  • Review crates/eth2api/src/spec/ssz_types.rs (new file)
  • Review crates/relay-server/src/p2p.rs
  • Review crates/relay-server/src/web.rs
  • Review crates/relay-server/src/metrics.rs
  • Review crates/cli/src/commands/relay.rs
  • Review crates/p2p/src/utils.rs
  • Review crates/relay-server/Cargo.toml
  • Post final review

Summary

This PR adds a Prometheus monitoring server for the relay, refactors swarm event handling, makes relay metrics accessible, adds SSZ type wrappers for the eth2 API, and adds a new parasigdb/memory.rs file. The relay-server improvements are solid overall, but there are some critical structural issues with the parasigdb module addition.


Critical Issues

crates/core/src/parasigdb/memory.rs — Orphaned file, will not compile

The new file lives in crates/core/src/parasigdb/memory.rs (parasigdb with an extra a), but the existing module is parsigdb (no a). Several problems:

  1. Not declared as a module anywhere. There is no mod.rs in the parasigdb/ directory, no mod parasigdb in any lib.rs, and no pub mod memory anywhere that references this file. The file is unreachable dead code.

  2. References a non-existent submodule. Line 8 imports:

    use crate::parasigdb::metrics::PARASIG_DB_METRICS;

    But crates/core/src/parasigdb/ only contains memory.rs — there is no metrics.rs. The existing metrics.rs lives in parsigdb/ (no a) and exports PARSIG_DB_METRICS (not PARASIG_DB_METRICS).

  3. References a non-existent test file. At the bottom:

    #[path = "memory_internal_test.rs"]
    mod memory_internal_test;

    memory_internal_test.rs exists in crates/core/src/parsigdb/, not in parasigdb/.

  4. MemDBError drops SignedDataError. The existing parsigdb/memory.rs has SignedDataError(#[from] SignedDataError) in MemDBError and calls sig.signed_data.message_root().map_err(...). The new file calls sig.signed_data.message_root() without handling a potential error — if message_root() can return an Err, this will fail to compile or silently drop errors depending on the API.

This file appears to be an in-progress replacement that was committed prematurely. It needs either a proper parasigdb/mod.rs + metrics.rs + integration into core/lib.rs, or to be removed until ready. Fix this →


Moderate Issues

crates/relay-server/src/web.rs:251-252 — Magic dummy port numbers in ENR handler

(Some((ip, tcp_p)), None) => (ip, tcp_p, 9999), // Dummy UDP port
(None, Some((ip, udp_p))) => (ip, 9999, udp_p), // Dummy TCP port

Port 9999 is used as a dummy when only one transport is available, with no explanation and no validation. The ENR will be published with a port that doesn't actually accept connections. Consider returning an error instead (consistent with the (None, None) case below), or documenting why a dummy value is acceptable here. Fix this →

crates/core/src/parasigdb/memory.rs:291unwrap() after is_none() check (less idiomatic)

if deadliner_rx.is_none() {
    warn!("Deadliner channel is not available");
    return;
}
let mut deadliner_rx = deadliner_rx.unwrap();

The existing parsigdb/memory.rs correctly uses let Some(mut deadliner_rx) = self.deadliner.c() else { ... }. The new file reverts to a less idiomatic pattern that would trigger a Clippy warning. Fix this →

crates/relay-server/src/p2p.rsnetwork_sent_bytes_total / network_received_bytes_total never populated

The RelayMetrics struct now has network_sent_bytes_total and network_received_bytes_total fields (public), but handle_swarm_event only updates connection_total and active_connections. The byte-count metrics are never incremented and will always read zero. Either wire them up from SwarmEvent::OutboundProbeOk/similar events, or remove them from the struct until they are implemented. Fix this →


Minor Issues

crates/core/src/parasigdb/memory.rs:343-345keys_by_duty accumulates duplicate keys

inner
    .keys_by_duty
    .entry(k.duty.clone())
    .or_insert_with(Vec::new)
    .push(k.clone());   // always pushes, even if k is already present

Every new signature for a given (duty, pubkey) pushes the same key into the keys_by_duty Vec again. With N validators signing the same duty+pubkey, the key appears N times, so trim() calls entries.remove(&key) N times (all but the first are no-ops). This wastes memory and is shared with the existing parsigdb/memory.rs. A HashSet<Key> or a pre-existence check would be cleaner.

crates/relay-server/src/p2p.rs:185ListenerClosed has no log entry

NewListenAddr and ExpiredListenAddr both log at debug!, but ListenerClosed produces no log. This makes it hard to diagnose listener lifecycle in production. Fix this →

crates/eth2api/src/spec/ssz_types.rs — Missing From<SszVector<T, SIZE>> for Vec<T>

SszList has both From<Vec<T>> and From<SszList<T,MAX>> for Vec<T>, but SszVector only has From<Vec<T>> with no inverse conversion. For consistency, add:

impl<T, const SIZE: usize> From<SszVector<T, SIZE>> for Vec<T> {
    fn from(value: SszVector<T, SIZE>) -> Self { value.0 }
}

crates/cli/src/commands/relay.rs:402-415serve_addr_enr_ext_host test may be timing-dependent

The test asserts enr.ip().unwrap().is_loopback(), relying on the external host resolver not having run yet when the request is made. Since www.google.com resolves quickly and the resolver interval could tick before the test request, this could be flaky in slow CI or if the resolver's first tick runs immediately. Consider using a hostname guaranteed not to resolve (e.g. invalid.invalid) or controlling the resolver interval in tests.


Positive Aspects

  • Clean AddrUpdate enum in p2p.rs decouples address management from the event loop — good pattern.
  • Monitoring server is properly integrated with graceful shutdown and matches the ENR server pattern.
  • Version/git-hash logging on startup is a valuable operational improvement.
  • PeerWithPeerClusterLabels::new() constructor is a clean addition.
  • vise-exporter promoted from dev-deps to deps — correct since it's now used in production code.
  • SSZ type wrappers (SszList, SszVector, BitList, BitVector) are well-implemented with correct TreeHash semantics and good test coverage.
  • Good integration tests in relay.rs covering the relay lifecycle, ENR, multiaddrs, external IP/host, and metrics endpoints.

@mskrzypkows mskrzypkows changed the title Claude/issue 129 20260313 1533 feat: relay server missing features Apr 24, 2026
@mskrzypkows
Copy link
Copy Markdown
Contributor Author

@claude review again

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

Claude finished @mskrzypkows's task in 5m 43s —— View job


PR Review (re-review)

  • Read changed files via git diff
  • Review crates/p2p/src/bandwidth.rs (new file)
  • Review crates/p2p/src/p2p.rs
  • Review crates/relay-server/src/p2p.rs
  • Review crates/relay-server/src/metrics.rs
  • Review crates/relay-server/src/web.rs
  • Review crates/cli/src/commands/relay.rs

Summary

This re-review reflects the current state of the branch. The PR is substantially improved from before — the previous issues around parasigdb and SSZ types are gone. The current changes focus on bandwidth tracking (crates/p2p/src/bandwidth.rs), relay swarm event handling, monitoring server, and granular P2PError variants. There is one critical bug and a few moderate/minor issues.


Critical Issues

crates/relay-server/src/p2p.rs:55-63 — Double listen on TCP addresses

Node::new_server (line 32) calls apply_config internally, which already calls swarm.listen_on for all configured TCP addresses (see crates/p2p/src/p2p.rs:364). Then run_relay_p2p_node immediately calls listen_on again for the same TCP addresses:

for tcp_addr in config.p2p_config.tcp_multiaddrs()? {
    node.listen_on(tcp_addr)?;   // ← already done inside new_server via apply_config
}

This creates a second listener on the same port. Depending on timing and OS behavior:

  • The second bind fails (EADDRINUSE) → SwarmEvent::ListenerErrorSwarmEvent::ListenerClosed fires → the address is removed from listeners → ENR has no addresses.
  • Or it creates a duplicate listener → two NewListenAddr events → duplicate entries in the listeners Vec → duplicate addresses in ENR/multiaddr responses.

The UDP loop on lines 60-63 is correct and new — apply_config skips UDP for NodeType::TCP nodes. Only the TCP loop is the problem. Fix: remove the TCP listen_on loop (lines 55-58); keep the UDP one. Fix this →


Moderate Issues

crates/relay-server/src/web.rs:152-161monitoring_server silently swallows parse errors

pub async fn monitoring_server(monitoring_addr: String, ct: CancellationToken) {
    let Ok(bind_addr) = monitoring_addr.parse::<SocketAddr>() else {
        warn!("Failed to parse monitoring address: {monitoring_addr}");
        return;   // ← returns (), caller cannot detect failure
    };

The function returns () and callers (run_relay_p2p_node) cannot distinguish a successfully-running monitoring server from one that silently failed to start due to a bad address. Consider returning Result<()> and propagating the error upward so misconfiguration is caught at startup. Fix this →

crates/p2p/src/bandwidth.rs:56-65peer_cluster always empty; diverges from Charon

pub fn for_peer(peer_id: &PeerId) -> Self {
    let label = PeerLabel {
        peer: peer_id.to_string(),
        peer_cluster: String::new(),   // ← always empty
    };

In Charon, peer_cluster is populated by monitorConnections via peerinfo protocol lookups. Pluto always emits peer_cluster="", so dashboards/alerts that group by cluster will never work. The doc comment in p2p.rs acknowledges this, but it's worth flagging explicitly: the metric schema is in place, but the cluster dimension is permanently dark until peerinfo is implemented.


Minor Issues

crates/relay-server/src/p2p.rs:270SwarmEvent::ListenerError silently dropped

The _ => AddrUpdate::None wildcard catches SwarmEvent::ListenerError, which will be the observable symptom of the double-listen bug above (and any future listener failures). Consider explicitly matching and logging it:

SwarmEvent::ListenerError { listener_id, error } => {
    warn!(?listener_id, ?error, "listener error");
    AddrUpdate::None
}

Fix this →

crates/cli/src/commands/relay.rs — Metrics test only verifies name presence, not counter values

assert!(body.contains("relay_p2p_network_sent_bytes"));
assert!(body.contains("relay_p2p_network_receive_bytes"));

These substring checks pass as long as vise emits # HELP/# TYPE lines for the registered metric families, even when all counters are zero (no peers connected during the test). The test doesn't validate that the transport layer actually increments the counters. This is fine as a smoke test, but worth noting that functional correctness of byte counting is untested.


Positive Aspects

  • AddrUpdate enum cleanly decouples swarm event handling from address-list management — nice pattern.
  • build_server_swarm correctly constructs a manual transport stack (bypassing SwarmBuilder's convenience API) to insert PeerBandwidthTransport — the design approach is right.
  • PeerBandwidthTransport correctly wraps both inbound (poll_inbound) and outbound (poll_outbound) muxer substreams; try_from(...).unwrap_or(u64::MAX) for the byte count cast is safe.
  • Relay circuit/reservation events (ReservationReqAccepted, CircuitAccepted, CircuitClosed, etc.) are now properly logged and handled — significant operational improvement.
  • Granular P2PError variants (FailedToConfigureNoise, FailedToConfigureTcp, etc.) are far more useful than the single FailedToBuildSwarm they replace.
  • Monitoring server integrates cleanly with CancellationToken and matches the ENR server shutdown pattern.
  • PeerWithPeerClusterLabels::new() constructor and making fields pub are clean ergonomic improvements.
  • vise-exporter promoted to deps is correct since it's now used in production code.

Comment thread crates/p2p/src/bandwidth.rs Outdated
@mskrzypkows mskrzypkows merged commit ae0f152 into main Apr 28, 2026
8 checks passed
@mskrzypkows mskrzypkows deleted the claude/issue-129-20260313-1533 branch April 28, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants