Skip to content

p2p/discover/topicindex: fix search deadlock and early close when empty buckets#60

Draft
srene wants to merge 5 commits into
topdiscfrom
fix/search-isdone-termination
Draft

p2p/discover/topicindex: fix search deadlock and early close when empty buckets#60
srene wants to merge 5 commits into
topdiscfrom
fix/search-isdone-termination

Conversation

@srene
Copy link
Copy Markdown
Member

@srene srene commented May 11, 2026

Summary

Fix two related stalls in Search:

  1. IsDone deadlock (closes topicindex/search: IsDone can miss termination — searcher hangs until Close #27): treat no unasked nodes AND no buffered results as the terminating condition. Drop the queriesWithoutNewNodes >= 4 gate and the counter that backed it.

  2. QueryTarget empty-bucket stall (closes topicSearch: QueryTarget stalls when farthest bucket lacks unasked nodes and numRequests==0 #65): move the numRequests == 0 break inside the len(new) > 0 block so the warm-up frontier only halts on a populated unqueried bucket. Empty unqueried buckets are now invisible to the gate.

Supersedes #45.

The IsDone bug

IsDone returned true only if queriesWithoutNewNodes >= 4, after verifying the result buffer and every bucket's new set were empty. The counter only advances when AddNodes runs without finding novel nodes.

But once every bucket's new set is empty:

  • QueryTarget returns nil.
  • No query is dispatched.
  • No response arrives.
  • AddNodes is never called.
  • queriesWithoutNewNodes can never reach 4.

The search loop in v5_topic.go: topicSearch.run() then sat with only <-s.quit progressable:

select {
case <-s.quit:                  // only live case
case queryCh <- nextQuery:      // queryCh nil (no target)
case resp := <-s.queryRespCh:   // no response coming
case resultCh <- result:        // no result to emit
}

iter.Next() blocks indefinitely waiting on the (now silent) resultCh.

The QueryTarget bug

QueryTarget walks the search buckets far → close, collecting buckets with unasked nodes into a withnew pool. After each bucket is added, the original code applied a "warm-up frontier" gate: if numRequests == 0 { break } halted the scan so closer-to-topic buckets weren't probed until the current bucket had received at least one response. This avoids spamming nodes close to the topic before farther rings have been sampled.

The bug was that this gate fired regardless of whether the bucket had any candidates. An empty unqueried farthest bucket — natural state when the routing-table feed is slow to deliver maximally-distant peers — broke the scan at i=0 with withnew = [], returning nil and stalling the search even though closer buckets held unasked candidates.

Fix: move the break inside the len(new) > 0 block. The frontier still halts on populated unqueried buckets (so warm-up semantics and load-spreading intent are preserved), but empty buckets are now invisible to the gate and the scan continues past them.

Change

  • p2p/discover/topicindex/search.go
    • IsDone: returns true as soon as resultBuffer is empty and all b.new are empty.
    • QueryTarget: warm-up frontier gate (break on numRequests == 0) is now nested inside the len(new) > 0 check. Cross-bucket randomization, numRequests field, and math/rand usage are unchanged from upstream.
  • p2p/discover/topicindex/search_test.go
    • TestSearchIsDoneNoUnaskedNodes, TestSearchIsDoneBufferedResults: cover the IsDone fix.
    • TestSearchQueryTargetSkipsEmptyFarBucket: empty bucket 0 + populated bucket 5 → picks from bucket 5.
    • TestSearchQueryTargetPrefersFarthest: bucket 0 with unqueried nodes locks the pool to itself; closer buckets only join after bucket 0 drains.
    • TestSearchQueryTargetWidensFrontierAfterResponse: once bucket 0 has been queried, refilled bucket 0 + populated bucket 5 both appear in picks across many calls.
    • TestSearchQueryTargetRestartsAfterRefill: bucket 0 refill takes priority over remaining bucket 5 candidate.

Test plan

  • go test ./p2p/discover/topicindex/ -run TestSearch passes
  • go test -run TestTopic ./p2p/discover/ passes
  • go vet ./p2p/discover/... clean

…drains

IsDone gated on queriesWithoutNewNodes >= 4 after verifying the unasked
set was empty. That counter only advances when AddNodes runs without
finding novel nodes, but once every bucket's `new` set is empty
QueryTarget returns nil, so no query is ever dispatched, no response
arrives, AddNodes is never called, and the counter cannot reach 4. The
search loop then sits with only s.quit progressable, leaking the
iterator until Close.

Treat "no unasked nodes and no buffered results" as the terminating
condition. Drop queriesWithoutNewNodes — nothing else read it.

Closes #27.
srene added 2 commits May 12, 2026 15:17
The previous QueryTarget implementation tracked a warm-up frontier and
randomized the next pick across buckets that had received at least one
query. When the farthest bucket was empty and numRequests==0, the
frontier could not advance and the search stalled even when closer
buckets held unasked nodes (issue #65).

Replace with a deterministic linear walk: scan buckets far -> close
and return the first unasked node. Each call re-walks from the
farthest bucket so AddNodes responses that land in earlier buckets
are picked up before the search descends to closer ones. Empty
buckets are skipped by construction.

Drop the numRequests counter (no longer read) and the math/rand
import.

Closes #65.
The previous commit's strict far→close drain removed cross-bucket query
spreading. Once a closer bucket became active, farther buckets that
refilled via AddNodes would take over completely, pausing all closer
work until they drained again. In a session where aux responses keep
seeding far buckets faster than they drain, the close-to-topic buckets
(where actual registrants live) could be starved indefinitely.

Restore the cross-bucket random selection across visited buckets, with
the warm-up frontier (numRequests gate) intact. Apply the issue #65 fix
narrowly: move the break inside the `if len(new) > 0` block, so the
frontier only halts on a populated unqueried bucket. Empty unqueried
buckets are now invisible to the gate and the scan continues past them.

Net change vs pre-PR upstream is one line moved (the break) and one
test added (TestSearchQueryTargetWidensFrontierAfterResponse).

Closes #65.
@srene srene changed the title p2p/discover/topicindex: fix Search.IsDone deadlock when unasked set drains p2p/discover/topicindex: fix search deadlock and early close when empty buckets May 12, 2026
srene added 2 commits June 1, 2026 17:37
The test was skipped under #30 (in-flight RPC
cancellation latency), which is fixed by #58. The remaining failure was
a data race in the test itself: it seeded node1's topic table by calling
topicTable.Add directly from the test goroutine, racing with node1's
dispatch loop reading the table via NextExpiryTime.

Seed the table through onDispatchCh instead, so the writes run on the
goroutine that owns the table. The search now converges given the
IsDone/QueryTarget fixes on this branch; passes 10x under -race.
# Conflicts:
#	p2p/discover/topicindex/search_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant