Unsafe dictionary iteration in async context causes RuntimeError during Bitswap file sharing ( Universal Connectivity DApp ).
#1238
sumanjeet0012
started this conversation in
General
Replies: 2 comments
-
Bitswap.file.sharing.mp4 |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
@seetadev We need to implement certain optimizations in the codebase for file sharing using the Universal Connectivity DApp. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Using the Bitswap protocol for file sharing (via
MerkleDag.fetch_file()→BitswapClient.get_block()) causes aRuntimeError: dictionary changed size during iterationcrash. The error occurs because multiple methods acrosspubsub.py,bitswap/client.py,swarm.py, andgossipsub.pyiterate over shared dictionaries using live views (.values(),.items(),.keys(), or the dict itself) while concurrent trio tasks — or synchronous callbacks within the same loop body — mutate those same dictionaries.This is not a Bitswap-specific bug. The unsafe iteration patterns exist in core pubsub and swarm code and can be triggered by any workload that causes concurrent peer churn + message broadcasting — Bitswap file transfers just happen to reliably expose it.
Environment
mainas of Feb 2026)How to reproduce
MerkleDag, store blocks, publish CID via pubsub)MerkleDag.fetch_file()→ sequentialget_block()calls)The crash is non-deterministic but highly reproducible under any peer churn (peers connecting/disconnecting while pubsub messages are being broadcast).
Root cause
In Python, iterating over a dictionary view (
.keys(),.values(),.items(), or the dict directly) while another piece of code adds or removes entries raisesRuntimeError.With trio's cooperative multitasking, every
awaitinside aforloop is a yield point — other nursery tasks can run at that exact moment and mutate shared state. Additionally, synchronous callbacks (like exception handlers) can mutate the dict being iterated within the same call frame.Primary crash site:
Pubsub.message_all_peers()Two problems here:
_handle_dead_peer()doesdel self.peers[peer_id]synchronously inside the loop body'sexcepthandler — this directly mutates the dict theforloop is iterating over.await stream.write(...)is a yield point where other concurrent tasks (dead peer handler, new peer handler) can add/remove entries fromself.peers.All affected locations
I've identified 13 unsafe iteration patterns across 4 files:
libp2p/pubsub/pubsub.py(3 issues)message_all_peersfor stream in self.peers.values()_handle_dead_peer()deletes fromself.peersmid-iteration;awaityield points_handle_dead_peerfor topic in self.peer_topicsself.peer_topics_teardown_if_connectedfor _topic, peerset in self.peer_topics.items()self.peer_topicslibp2p/bitswap/client.py(7 issues)_broadcast_wantlistpeers = self.host.get_network().connections.keys()_broadcast_cancelpeers = self.host.get_network().connections.keys()await new_stream()is a yield point_request_blockfinallyawait self.cancel_want(cid)+del self._pending_requests[cid]_read_responses_from_streamfinallyfor i, cid in enumerate(self._expected_blocks[peer_id])+del self._expected_blocks[peer_id]_process_blocks_v100for pid in self._expected_blocks_process_blocks_v110for peer_id in self._expected_blocks_notify_peers_about_blockfor peer_id, wantlist in self._peer_wantlists.items()awaityield points in the loop bodylibp2p/network/swarm.py(2 issues)get_connectionsfor conns in self.connections.values()connections_legacyfor peer_id, conns in self.connections.items()libp2p/pubsub/gossipsub.py(1 issue)leavefor peer in self.mesh[topic]withawait self.emit_prune(...)awaityield point inside loopProposed fix
The fix is straightforward and mechanical: snapshot all shared collections before iterating them using
list(), and defer mutations until after iteration completes.Pattern 1: Wrap dict/set views with
list()Pattern 2: Defer mutations until after iteration
Pattern 3: Use
pop()instead ofdelfor cleanupPattern 4: Add existence guards after snapshot
Complete list of changes
libp2p/pubsub/pubsub.pymessage_all_peers()— snapshotself.peers.values()and defer dead peer handling:_handle_dead_peer()— snapshotself.peer_topics:_teardown_if_connected()— snapshotself.peer_topics.items():libp2p/bitswap/client.py_broadcast_wantlist()and_broadcast_cancel()— snapshot connections:_request_block()finally — usepop()instead ofdel, skip broadcast during cleanup:_read_responses_from_stream()finally — snapshot set, usepop():_process_blocks_v100()and_process_blocks_v110()— snapshot dict + guard:_notify_peers_about_block()— snapshot + copy:libp2p/network/swarm.pyget_connections()andconnections_legacy— snapshot:libp2p/pubsub/gossipsub.pyleave()— snapshot set:Why this matters beyond Bitswap
These patterns affect any concurrent workload on py-libp2p:
message_all_peerscrash_handle_dead_peercrashget_connections/connections_legacycrashhost.new_stream()in a loop over connected peers → connection dict mutationBitswap file sharing is just the most reliable trigger because it combines:
Additional notes
list()snapshot pattern is the standard Python approach and has negligible overhead for typical peer counts (tens to low hundreds).list()creates a snapshot before any yield point.I'm happy to submit a PR with these fixes if the maintainers agree with the approach.
Beta Was this translation helpful? Give feedback.
All reactions