Add Kahn's algorithm as parallel BFS traversal option in visitParallel#351
Add Kahn's algorithm as parallel BFS traversal option in visitParallel#351openroad-ci wants to merge 19 commits into
Conversation
… level. version 1 Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
|
Document to be deleted- currently kept only to track the changes |
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements Kahn's algorithm for topological traversal as an alternative to the existing level-based parallel BFS in OpenSTA. The goal is to eliminate per-level barriers and improve thread utilization during timing analysis. The implementation includes a functional specification, Tcl plumbing for a new design-level toggle (sta_use_kahns_bfs), and the core logic within the BfsIterator class. Feedback focuses on potential performance bottlenecks: the fine-grained scheduling of individual vertices into the dispatch queue may introduce excessive overhead, and the current visitor implementation performs redundant enqueuing into the level-based queue that is unnecessary when Kahn's algorithm is active.
| dispatch_queue_->dispatch([&process, succ](size_t t) { | ||
| process(succ, t); | ||
| }); |
There was a problem hiding this comment.
Dispatching every single ready vertex as a separate task into the DispatchQueue can lead to significant overhead and lock contention, especially for large designs with many vertices having low in-degrees. This fine-grained scheduling might negate the performance benefits of eliminating per-level barriers. Consider processing a chain of ready vertices in the current thread or batching ready vertices before dispatching to amortize the task management cost.
| process = [&, bfs_index, pred, in_deg_size](Vertex *vertex, | ||
| size_t tid) { | ||
| vertex->setBfsInQueue(bfs_index, false); | ||
| visitors[tid]->visit(vertex); | ||
| total_visited.fetch_add(1, std::memory_order_relaxed); | ||
| kahnForEachSuccessor(vertex, pred, [&](Vertex *succ) { | ||
| VertexId sid = graph_->id(succ); | ||
| if (sid < in_deg_size && in_deg[sid] >= 0) { | ||
| int prev = in_degree[sid] | ||
| .fetch_sub(1, std::memory_order_acq_rel); | ||
| if (prev == 1) { | ||
| // Successor is now ready -- dispatch immediately. | ||
| dispatch_queue_->dispatch([&process, succ](size_t t) { | ||
| process(succ, t); | ||
| }); | ||
| } | ||
| } | ||
| }); | ||
| }; |
There was a problem hiding this comment.
The ArrivalVisitor::visit method (called via visitors[tid]->visit(vertex) at line 391) unconditionally calls search_->arrivalIterator()->enqueueAdjacentVertices(...). When Kahn's algorithm is active, this enqueues successors into the level-based queue_, which is redundant because Kahn's Stage 3 already handles successor propagation via kahnForEachSuccessor. These redundant enqueue calls involve mutex locking and vector operations, adding unnecessary overhead. Furthermore, the queue_ is cleared at the end of visitParallel (line 427), meaning this work is entirely wasted. Consider gating the enqueueAdjacentVertices call in the visitor when Kahn's is enabled.
There was a problem hiding this comment.
The enqueueAdjacentVertices call in ArrivalVisitor::visit is not unconditional — it is guarded by always_to_endpoints_ || arrivals_changed and is skipped entirely when arrivals do not change. The call also does not generally acquire a mutex: BfsIterator::enqueue runs an unlocked !vertex->bfsInQueue(bfs_index_) check up front, and in Kahn's mode every active vertex has that flag set during Phase 1 discovery, so for active successors enqueue returns on the flag read without acquiring queue_lock_. And the cleanup in visitParallel only clears levels up to to_level -- entries beyond that are deliberately preserved for subsequent calls, so enqueues onto those levels are not wasted.
That said, you are right that some redundant work is paid on the Kahn's path. For each visited vertex, enqueueAdjacentVertices iterates out-edges and evaluates adj_pred_->searchThru/searchTo on each;
Kahn's Stage 3 then iterates the same out-edges again via kahnForEachSuccessor. The successor enqueue calls are fast-path no-ops for active successors, but the edge iteration and predicate evaluation cost
is paid twice. For a ~100K-vertex graph with out-degree ~3, that is on the order of tens of milliseconds per visitParallel. We are not applying the gate today because (a) it has to be designed not to drop beyond-to_level enqueues, which are correct behavior and required for incremental correctness; (b) the cleanest version is a scoped in_kahns_traversal_ flag plus a check on the successor's bfsInQueue state, which adds complexity in exchange for the edge-iteration and predicate-evaluation savings only; and (c) I am yet to profile to show this as a hot spot. Keeping the visitor oblivious to which BFS algorithm is above it preserves a useful abstraction boundary -- the visitor is shared with the non-Kahn's path.
If profiling on larger designs surfaces this as material wall-time overhead, the targeted fix is well-scoped and we should revisit. This is consistent with the roadmap items already documented in the spec for reducing per-vertex overhead.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 907d5f8c64
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| check_crpr_(new CheckCrpr(this)) | ||
| { | ||
| initVars(); | ||
| arrival_iter_->setKahnPred(search_adj_); |
There was a problem hiding this comment.
Preserve loop-aware predicate for Kahn arrival discovery
Use a loop-aware predicate when enabling Kahn traversal for arrivals: arrival_iter_->setKahnPred(search_adj_) binds a SearchAdj that was constructed with tag_bldr_ == nullptr, but SearchAdj::searchThru() only re-enables disabled loop edges via hasPendingLoopPaths() when a tag builder is present. In multi-threaded runs where sta_use_kahns_bfs is enabled (default) and sta_dynamic_loop_breaking is on, Kahn discovery/propagation will skip loop-qualified edges that the original ArrivalVisitor path (adj_pred_) can traverse, which can leave parts of the valid arrival cone unvisited and produce incorrect arrival/slack updates.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Hmm this is a pretty narrow scoped issue, but a bu nonetheless. We cannot use Kahn's BFS when sta_dynamic_loop_breaking is on. This feature relies on arrival tags that only emerge during propagation to decide whether an otherwise-disabled loop edge can be traversed. Kahn's needs the active subgraph and in-degrees known before propagation begins, so it cannot consult those tags. To avoid silently missing vertices, visitParallel should guard the Kahn's path with an explicit check on variables_->dynamicLoopBreaking() and falls back to the original level-based BFS whenever dynamic loop breaking is active.
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
|
@codex review |
|
To use Codex here, create a Codex account and connect to github. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces Kahn's algorithm for topological traversal as an alternative to the existing level-based parallel BFS in OpenSTA. The goal is to eliminate per-level synchronization barriers, allowing threads to process vertices as soon as their dependencies are met. The implementation includes a functional specification, persistent state management to minimize allocation overhead, and Tcl integration for runtime toggling. Feedback suggests further optimizing performance by moving the active_vertices vector into the persistent KahnState to avoid repeated allocations during frequent incremental updates.
| kahn_state_->resetPrevious(); | ||
|
|
||
| std::vector<int> &in_deg = kahn_state_->in_degree_init; | ||
| std::vector<Vertex*> active_vertices; |
There was a problem hiding this comment.
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2daf7a9bb7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| level = saved_first; | ||
| while (levelLessOrEqual(level, saved_last) | ||
| && levelLessOrEqual(level, to_level)) { | ||
| queue_[level].clear(); | ||
| incrLevel(level); |
There was a problem hiding this comment.
Preserve visitor-enqueued vertices when clearing Kahn levels
After Kahn traversal finishes, this loop clears every queued level in the original [saved_first, saved_last] window without checking whether new vertices were enqueued during visitor->visit(). ArrivalVisitor::enqueueRefPinInputDelays() can enqueue vertices that are not part of the precomputed Kahn active set, so they are removed here before being visited; because queue_[level].clear() does not reset bfsInQueue, those vertices can also remain permanently marked as in-queue and be skipped by future enqueue() calls. This can drop required arrival propagation in designs using reference-pin input delays.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I was able to create a regression to catch this issue too. Implemented a fix now
There was a problem hiding this comment.
Post-Kahn cleanup was queue_[level].clear(), which dropped pointers enqueued during Stage 2 by ArrivalVisitor::enqueueRefPinInputDelays without clearing their bfsInQueue flag - leaving it stuck at true and silently no-op'ing every subsequent enqueue() for those vertices. Fix: replaced clear() with a filter that keeps entries whose bfsInQueue is still true (extracted into helper BfsIterator::dropProcessedEntries), so ref-pin-enqueued vertices survive to the next call as seeds.
| kahnForEachSuccessor(vertex, kahn_pred_, | ||
| [&](Vertex *succ) { | ||
| if (!levelLessOrEqual(succ->level(), to_level)) | ||
| return; | ||
| VertexId sid = graph_->id(succ); |
There was a problem hiding this comment.
Honor clks_only frontier when discovering Kahn active vertices
The Kahn discovery phase unconditionally follows graph successors, but findClkArrivals() relies on ArrivalVisitor with clks_only=true to stop propagation at register clock endpoints (it postpones fanout instead of enqueueing it). Because discovery runs before visitor logic and ignores that stop condition, fanout beyond reg clock pins is still added to active_vertices and visited in this call, which can turn a clock-only update into broad/full arrival propagation and regress runtime/state behavior for clock-latency flows.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Level-BFS stops at register CK via postponeClkFanouts, but Kahn's Stage 1 walked successors unconditionally, discovering and visiting the full data fanout past every reg CK on clock-only passes (findClkArrivals).
Fix: added virtual VertexVisitor::stopDiscoveryAtRegClk() returning false by default, overridden in ArrivalVisitor to return clks_only_. When true, Stage 1 discovery continues past vertex->isRegClk() (and Stage 2 worker mirrors the skip for efficiency), narrowing Kahn's active set to match level-BFS exactly - Kahn's itself still runs, only the reg-CK fanout is excluded.
…stead of immediate dispatch-when-ready Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
…l].clear() which dropped vertices that ArrivalVisitor::enqueueRefPinInputDelays had enqueued during Stage 2 without clearing their bfsInQueue flag leaving it stuck at true and permanently suppressing all future enqueue() calls for those vertices. Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
…marked as in-queue and be skipped by future enqueue() Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
There was a problem hiding this comment.
How about using .md format for this document?
There was a problem hiding this comment.
I had this initially but moved to txt as I was using Microsoft Copilot to create the func spec from that. But now that I am using Claude Cowork to make the spec, I will make it into md again.
|
@dsengupta0628 Looks like a big changes on STA core algorithm. |
It is just a different type of BFS, but sits on the same BFS iterator- so not too big a difference in architecture. Tom is also getting feedback from James regarding this. If he agrees, we can push this upstream too. If not, we just need to merge carefully in the future. |
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
…it_skip Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7979ce01d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pred_changed[vid].store(in_deg[vid] == 0 ? 1 : 0, | ||
| std::memory_order_relaxed); |
There was a problem hiding this comment.
Preserve seed invalidations when visit-skip is enabled
The visit-skip initialization currently keys only on in_deg == 0, so a vertex that was explicitly seeded (queued as invalid) but has an active-set predecessor starts with pred_changed=0 and can be skipped entirely if that predecessor reports no change. In incremental/batched edits that invalidate both a fanin and its fanout, this drops the mandatory recomputation of the fanout and can leave stale timing state. Seed membership needs to be tracked independently of in-degree so all original queued seeds are forced to run visit().
Useful? React with 👍 / 👎.
Replaces the per-level barrier model in BfsIterator::visitParallel with
Kahn's topological traversal for both forward (arrival) and backward
(required-time) propagation. Enabled by default via use_kahns_ flag;
original level-based BFS retained as a fallback.
The original visitParallel processes vertices one level at a time with a
finishTasks() barrier between levels. Uneven level populations leave
threads idle at barriers. Kahn's removes that: a vertex becomes eligible
for processing as soon as all its predecessors are done, so vertices at
different levels execute concurrently.
Key changes:
include/sta/Bfs.hh: Add kahnForEachSuccessor virtual (forward out-edges,
backward in-edges), persistent KahnState, use_kahns_ toggle,
kahn_pred_ (separate from search_pred_), resetLevelBounds helper.
search/Bfs.cc: Two-stage Kahn's implementation -- single-threaded
discovery with flat in-degree array indexed by vertex ID, then
recursive-dispatch parallel traversal (each ready vertex dispatched
as its own DispatchQueue task; single finishTasks at end, no batch
barriers). KahnState arrays persist across calls to avoid
re-allocation.
search/Search.cc: Wire search_adj_ as kahn_pred_ on arrival and
required iterators in the Search constructor.
Bugs caught by regressions and fixed:
Vertex IDs can exceed vertexCount() after cell deletions
(ObjectTable block-based IDs). Fix: dynamic array growth during
discovery + bounds checks. Caught by rmp.gcd_restructure.
Arrival iterator is constructed with search_pred_ = nullptr -- the
original BFS never uses it, the visitor provides its own filter.
Kahn's discovery needed the filter upfront. Fix: separate kahn_pred_
wired in Search constructor; falls back to original BFS if null.
Caught by rmp.gcd_restructure.