p2p/discover/topicindex: fix search deadlock and early close when empty buckets#60
Draft
srene wants to merge 5 commits into
Draft
p2p/discover/topicindex: fix search deadlock and early close when empty buckets#60srene wants to merge 5 commits into
srene wants to merge 5 commits into
Conversation
…drains IsDone gated on queriesWithoutNewNodes >= 4 after verifying the unasked set was empty. That counter only advances when AddNodes runs without finding novel nodes, but once every bucket's `new` set is empty QueryTarget returns nil, so no query is ever dispatched, no response arrives, AddNodes is never called, and the counter cannot reach 4. The search loop then sits with only s.quit progressable, leaking the iterator until Close. Treat "no unasked nodes and no buffered results" as the terminating condition. Drop queriesWithoutNewNodes — nothing else read it. Closes #27.
This was referenced May 12, 2026
Open
The previous QueryTarget implementation tracked a warm-up frontier and randomized the next pick across buckets that had received at least one query. When the farthest bucket was empty and numRequests==0, the frontier could not advance and the search stalled even when closer buckets held unasked nodes (issue #65). Replace with a deterministic linear walk: scan buckets far -> close and return the first unasked node. Each call re-walks from the farthest bucket so AddNodes responses that land in earlier buckets are picked up before the search descends to closer ones. Empty buckets are skipped by construction. Drop the numRequests counter (no longer read) and the math/rand import. Closes #65.
The previous commit's strict far→close drain removed cross-bucket query spreading. Once a closer bucket became active, farther buckets that refilled via AddNodes would take over completely, pausing all closer work until they drained again. In a session where aux responses keep seeding far buckets faster than they drain, the close-to-topic buckets (where actual registrants live) could be starved indefinitely. Restore the cross-bucket random selection across visited buckets, with the warm-up frontier (numRequests gate) intact. Apply the issue #65 fix narrowly: move the break inside the `if len(new) > 0` block, so the frontier only halts on a populated unqueried bucket. Empty unqueried buckets are now invisible to the gate and the scan continues past them. Net change vs pre-PR upstream is one line moved (the break) and one test added (TestSearchQueryTargetWidensFrontierAfterResponse). Closes #65.
The test was skipped under #30 (in-flight RPC cancellation latency), which is fixed by #58. The remaining failure was a data race in the test itself: it seeded node1's topic table by calling topicTable.Add directly from the test goroutine, racing with node1's dispatch loop reading the table via NextExpiryTime. Seed the table through onDispatchCh instead, so the writes run on the goroutine that owns the table. The search now converges given the IsDone/QueryTarget fixes on this branch; passes 10x under -race.
# Conflicts: # p2p/discover/topicindex/search_test.go
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix two related stalls in
Search:IsDonedeadlock (closes topicindex/search: IsDone can miss termination — searcher hangs until Close #27): treat no unasked nodes AND no buffered results as the terminating condition. Drop thequeriesWithoutNewNodes >= 4gate and the counter that backed it.QueryTargetempty-bucket stall (closes topicSearch: QueryTarget stalls when farthest bucket lacks unasked nodes and numRequests==0 #65): move thenumRequests == 0break inside thelen(new) > 0block so the warm-up frontier only halts on a populated unqueried bucket. Empty unqueried buckets are now invisible to the gate.Supersedes #45.
The IsDone bug
IsDonereturned true only ifqueriesWithoutNewNodes >= 4, after verifying the result buffer and every bucket'snewset were empty. The counter only advances whenAddNodesruns without finding novel nodes.But once every bucket's
newset is empty:QueryTargetreturns nil.AddNodesis never called.queriesWithoutNewNodescan never reach 4.The search loop in
v5_topic.go: topicSearch.run()then sat with only<-s.quitprogressable:iter.Next()blocks indefinitely waiting on the (now silent)resultCh.The QueryTarget bug
QueryTargetwalks the search buckets far → close, collecting buckets with unasked nodes into awithnewpool. After each bucket is added, the original code applied a "warm-up frontier" gate:if numRequests == 0 { break }halted the scan so closer-to-topic buckets weren't probed until the current bucket had received at least one response. This avoids spamming nodes close to the topic before farther rings have been sampled.The bug was that this gate fired regardless of whether the bucket had any candidates. An empty unqueried farthest bucket — natural state when the routing-table feed is slow to deliver maximally-distant peers — broke the scan at
i=0withwithnew = [], returning nil and stalling the search even though closer buckets held unasked candidates.Fix: move the
breakinside thelen(new) > 0block. The frontier still halts on populated unqueried buckets (so warm-up semantics and load-spreading intent are preserved), but empty buckets are now invisible to the gate and the scan continues past them.Change
p2p/discover/topicindex/search.goIsDone: returns true as soon asresultBufferis empty and allb.neware empty.QueryTarget: warm-up frontier gate (breakonnumRequests == 0) is now nested inside thelen(new) > 0check. Cross-bucket randomization,numRequestsfield, andmath/randusage are unchanged from upstream.p2p/discover/topicindex/search_test.goTestSearchIsDoneNoUnaskedNodes,TestSearchIsDoneBufferedResults: cover the IsDone fix.TestSearchQueryTargetSkipsEmptyFarBucket: empty bucket 0 + populated bucket 5 → picks from bucket 5.TestSearchQueryTargetPrefersFarthest: bucket 0 with unqueried nodes locks the pool to itself; closer buckets only join after bucket 0 drains.TestSearchQueryTargetWidensFrontierAfterResponse: once bucket 0 has been queried, refilled bucket 0 + populated bucket 5 both appear in picks across many calls.TestSearchQueryTargetRestartsAfterRefill: bucket 0 refill takes priority over remaining bucket 5 candidate.Test plan
go test ./p2p/discover/topicindex/ -run TestSearchpassesgo test -run TestTopic ./p2p/discover/passesgo vet ./p2p/discover/...clean