Skip to content

TL/UCP: Update topo aware ring algorithm #1288

Open
Juee14Desai wants to merge 2 commits into
openucx:masterfrom
Juee14Desai:ucc-ring-algo
Open

TL/UCP: Update topo aware ring algorithm #1288
Juee14Desai wants to merge 2 commits into
openucx:masterfrom
Juee14Desai:ucc-ring-algo

Conversation

@Juee14Desai
Copy link
Copy Markdown
Collaborator

What

Add topology aware multi ring algorithms for allgather, reduce_scatter, and allreduce in TL/UCP. The ring algorithms use team->cuda_ring to route data along NVLink optimal paths with up to 8 parallel rings, instead of the default single ring.

Why ?

The default ring algorithms use a flat rank ordering that does not account for the underlying GPU interconnect topology. On multi GPU systems with NVLink, this results in suboptimal data routing transfers may traverse slower paths instead of direct NVLink links.

How ?

Allgather:

  • Rewritten to derive ring rank, peer, and block indices from the cuda_ring topology pattern instead of flat team rank.
  • Each of the up to 8 rings transfers its own sub block slice concurrently.
  • Algorithm auto selected for CUDA memory >4KB when cuda_ring is available via dynamic score string in allgather.c.
  • Service allgather decoupled into dedicated service_allgather_ring_start/progress functions in tl_ucp_service_coll.c so internal collectives continue using the flat rank ring.
Count Size (bytes) UCC master Time avg (us) UCC master BW avg (GB/s) UCC this PR Time avg (us) UCC this PR BW avg (GB/s)
1048576 4194304 1318.36 47.72 1107.44 56.81
2097152 8388608 2527.03 49.79 1355.50 92.83
4194304 16777216 4946.76 50.87 1738.13 144.79
8388608 33554432 9760.98 51.56 2051.20 245.38
16777216 67108864 19394.07 51.90 3189.02 315.66
33554432 134217728 38630.25 52.12 5770.31 348.90
67108864 268435456 77089.72 52.23 10886.80 369.85

Reduce_scatter:

  • Rewritten to use cuda_ring for multi ring topology aware transfers.
  • Each ring handles its own sub block slice with per ring GPU reductions via the executor before forwarding to the next peer.
  • Scratch buffer management simplified to a single ucc_mc_alloc/free per task lifetime.
Count Size (bytes) UCC master Time avg (us) UCC master BW avg (GB/s) UCC this PR Time avg (us) UCC this PR BW avg (GB/s)
16777216 67108864 4500.15 13.98 9472.58 6.64
33554432 134217728 4724.07 26.64 6875.58 18.30
67108864 268435456 6985.83 36.02 7527.26 33.43
134217728 536870912 11992.22 41.97 8311.25 60.56
268435456 1073741824 22032.55 45.69 10223.47 98.46
536870912 2147483648 42097.26 47.82 14125.28 142.53
1073741824 4294967296 82387.74 48.87 23308.91 172.75

Allreduce:

  • Monolithic implementation that fuses reduce_scatter and allgather into a single task/progress function, avoiding schedule overhead.
  • Phase 0 receives into scratch, reduces with the local dst block via GPU executor, then forwards the accumulated result. Phase 1 performs an in-place ring allgather.
  • Tagged send/recv counters are reset at the phase transition.
  • Auto selected for CUDA memory >4KB via dynamic score string in allreduce.c.
Count Size (bytes) UCC master Time avg (us) UCC master BW avg (GB/s) UCC this PR Time avg (us) UCC this PR BW avg (GB/s)
1048576 4194304 N/A N/A 999.99 7.86
2097152 8388608 N/A N/A 1095.41 14.36
4194304 16777216 N/A N/A 1290.22 24.38
8388608 33554432 N/A N/A 4255.08 14.79
16777216 67108864 N/A N/A 921.83 136.50
33554432 134217728 N/A N/A 1796.86 140.05
67108864 268435456 N/A N/A 3513.99 143.23

@Juee14Desai
Copy link
Copy Markdown
Collaborator Author

/build

@wfaderhold21
Copy link
Copy Markdown
Collaborator

@Juee14Desai @janjust I assume this will replace PR #1258 ?

@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Mar 24, 2026

We talked about this, it shouldn't need to. If they are both good to go then let's have both ring and topo-aware ring and we can phase one out as needed.

@Juee14Desai Juee14Desai force-pushed the ucc-ring-algo branch 3 times, most recently from b80e124 to 9472c37 Compare March 31, 2026 05:48
@Juee14Desai Juee14Desai force-pushed the ucc-ring-algo branch 2 times, most recently from bd96bad to e827813 Compare April 1, 2026 20:51
@Juee14Desai
Copy link
Copy Markdown
Collaborator Author

/build

@janjust janjust requested a review from MamziB April 2, 2026 15:43
@Sergei-Lebedev Sergei-Lebedev added the ai-review start ai-review label Apr 2, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 2, 2026

Greptile Summary

This PR rewrites the allgather, reduce_scatter, and allreduce ring algorithms to use team->cuda_ring for topology-aware multi-ring transfers over NVLink, and decouples the service allgather into its own flat-ring path. The CUDA sysinfo fallback gains a primary-context scan when cuCtxGetCurrent fails.

  • P1 — algorithm regression for non-CUDA teams: the new guard if (algo_num == RING && !cuda_ring) algo_num = KNOMIAL silently changes the default for all odd-size non-CUDA teams from the flat ring to knomial, which is not a transparent replacement in terms of performance or semantics.
  • P1 — missing count-divisibility check in allgather: allgather_ring_init_common lacks the count % tsize != 0 → UCC_ERR_NOT_SUPPORTED guard that reduce_scatter_ring_init_common correctly adds; uneven counts will silently truncate the transfer.

Confidence Score: 3/5

Not safe to merge as-is: the allgather count-divisibility gap can silently produce incorrect results, and the default-algorithm change regresses all odd-size non-CUDA teams.

Two P1 issues in the core algorithm paths — a missing validation that silently corrupts data and an unintentional behavioral change that affects non-GPU workloads — prevent a clean merge despite the overall correctness of the topo-aware ring logic for the targeted CUDA+NVLink case.

allgather/allgather.c (algo fallback logic) and allgather/allgather_ring.c (missing divisibility guard).

Important Files Changed

Filename Overview
src/components/tl/ucp/allgather/allgather.c Score string generation rewritten to prefer the topo-aware ring for CUDA; introduces an unintended algorithm change (RING → KNOMIAL) for odd-size non-CUDA teams when cuda_ring is NULL.
src/components/tl/ucp/allgather/allgather_ring.c Fully rewritten to use cuda_ring topology pattern and multiple parallel rings; missing count-divisibility guard (present in the parallel reduce_scatter path) and the counting sentinel is undocumented.
src/components/tl/ucp/reduce_scatter/reduce_scatter_ring.c Old bidirectional-schedule implementation replaced by a simpler single-task topo-aware ring; non-persistent non-in-place reduction writes back into src buffer (flagged in prior thread); missing equality assertion between send/recv posted counters.
src/components/tl/ucp/reduce_scatter/reduce_scatter.c New score_str_get function added to enable dynamic algo selection; pattern is consistent with the allgather counterpart.
src/components/tl/ucp/tl_ucp_service_coll.c Service allgather correctly decoupled into dedicated flat-ring start/progress functions so internal collectives are unaffected by the topo-aware rewrite.
src/components/topo/cuda/ucc_sysinfo_cuda.c Adds a fallback to scan primary CUDA contexts when cuCtxGetCurrent fails; logic and goto use are correct; adds a debug log for visibility.
src/components/tl/ucp/tl_ucp_task.h Adds scratch_mc_header fields to reduce_scatter_ring and reduce_scatterv_ring task structs; straightforward and consistent.
src/components/tl/ucp/tl_ucp_coll.c Switches reduce_scatter default descriptor from static string to dynamic str_get_fn; correct use of the existing pattern.
src/components/tl/ucp/reduce_scatter/reduce_scatter.h Exports new ring init/progress symbols and updates default algo select string macro to accept a printf argument; consistent with usage.

Comments Outside Diff (1)

  1. src/components/tl/ucp/allgather/allgather_ring.c, line 255-283 (link)

    P1 Missing count-divisibility check — silent data truncation

    reduce_scatter_ring_init_common (the parallel path introduced in the same PR) guards against total_count % tsize != 0 and returns UCC_ERR_NOT_SUPPORTED. The new allgather_ring_init_common has no such check.

    When args->dst.info.count is not divisible by ucc_ring_pattern_size(ring, 0), the integer division block_count = count / tsize silently truncates and the last count % tsize elements are never transferred, producing an incorrect allgather result with no error reported.

    tsize = ucc_ring_pattern_size(ring, 0);
    if (count % tsize != 0) {
        tl_error(UCC_TASK_LIB(task),
                 "allgather ring: count %zu not divisible by tsize %u",
                 count, tsize);
        return UCC_ERR_NOT_SUPPORTED;
    }

Reviews (5): Last reviewed commit: "TL/UCP: topo aware ring algo for reduce_..." | Re-trigger Greptile

Comment thread src/components/tl/ucp/allreduce/allreduce_ring.c Outdated
Comment on lines +157 to +160

send_idx = ucc_ring_pattern_get_send_block(ring, ring_id,
rrank, step);
ring_offset = ucc_buffer_block_offset(block_cnt, nrings, ring_id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing send_posted == recv_posted assertion

The allgather ring progress function asserts task->tagged.send_posted == task->tagged.recv_posted at the top of its loop (since sends and recvs are always posted in pairs). The allreduce ring's RS-phase loop has the send_posted > 0 / recv_posted > 0 guards but omits the equality check, making it harder to catch a bookkeeping bug early. Consider adding the same assertion for consistency and diagnosability.

Comment thread src/components/tl/ucp/reduce_scatter/reduce_scatter.c
Comment thread src/components/tl/ucp/allgather/allgather.c
Comment thread src/components/tl/ucp/allgather/allgather_ring.c Outdated
Comment thread src/components/tl/ucp/allgather/allgather_ring.c Outdated
Comment thread src/components/tl/ucp/allreduce/allreduce_ring.c Outdated
Comment thread src/components/tl/ucp/allreduce/allreduce_ring.c Outdated
Comment thread src/components/tl/ucp/allreduce/allreduce_ring.c Outdated
Comment thread src/components/tl/ucp/allreduce/allreduce_ring.c Outdated
Comment on lines +82 to +84
reduce_target = PTR_OFFSET(
sbuf,
(recv_block * block_cnt + ring_offset) * dt_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Non-in-place operation silently corrupts the source buffer

For non-in-place reduce_scatter, sbuf = args->src.info.buffer. On every non-final step the reduction writes back into that same buffer:

reduce_target = PTR_OFFSET(sbuf, (recv_block * block_cnt + ring_offset) * dt_size);

The subsequent send loop then reads from the now-mutated sbuf location to forward the intermediate partial sum. The old implementation accumulated partial results in the dedicated s_scratch send buffer and left sbuf read-only. Callers with separate source and destination buffers (i.e., any non-in-place reduce_scatter over CUDA memory when cuda_ring is present) will have their source data silently overwritten.

The simplest safe fix is to reject non-in-place in reduce_scatter_ring_init_common (analogous to the UCC_IS_PERSISTENT guard already there):

if (!UCC_IS_INPLACE(*args)) {
    return UCC_ERR_NOT_SUPPORTED;
}

Or alternatively, allocate a separate per-ring work buffer for intermediate reductions instead of reusing sbuf.

@Juee14Desai
Copy link
Copy Markdown
Collaborator Author

/build

2 similar comments
@Juee14Desai
Copy link
Copy Markdown
Collaborator Author

/build

@janjust
Copy link
Copy Markdown
Collaborator

janjust commented Apr 30, 2026

/build

@Juee14Desai
Copy link
Copy Markdown
Collaborator Author

/build

@Juee14Desai
Copy link
Copy Markdown
Collaborator Author

/build

Replace the default ring allgather with a topo aware
multi ring implementation that uses team->cuda_ring to route data
along NVLink optimal paths (up to 8 parallel rings).

Algorithm changes:
- Ring rank, peer, and block indices are now derived from the
  cuda_ring topology pattern instead of flat team rank ordering.
- Each ring transfers its own slice of each block, enabling
  concurrent data movement across multiple NVLink paths.
- Algorithm auto selected for CUDA memory >4KB when cuda_ring
  is available; falls back to knomial otherwise.

Also fixes CUDA primary context detection in ucc_sysinfo_cuda.c
and decouples the service allgather from the topo aware ring.

Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
Replace the default ring reduce_scatter with a topo aware
multi ring implementation that uses team->cuda_ring to route data
along NVLink optimal paths (up to 8 parallel rings).

Algorithm changes:
- Ring rank, peer, and block indices are now derived from the
  cuda_ring topology pattern instead of flat team rank ordering.
- Each ring handles its own sub block slice, with per ring GPU
  reductions via the executor before forwarding to the next peer.
- Scratch buffer management simplified to a single mc_alloc/free
  per task lifetime (removed fragmentation logic).

Signed-off-by: Juee Himalbhai Desai <jueehimalbha@nvidia.com>
@Juee14Desai
Copy link
Copy Markdown
Collaborator Author

/build

Comment on lines 76 to +79
}
}

if (algo_num == UCC_TL_UCP_ALLGATHER_ALG_RING && !team->cuda_ring) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Algorithm regression for odd-size non-CUDA teams

When cuda_ring is NULL (any CPU team, or GPU team without NVLink topology), odd-size teams previously used the flat RING algorithm. After this guard, they fall through to KNOMIAL. This silently changes the default for every odd-size non-CUDA workload, including large CPU clusters.

The underlying reason is that allgather_ring_init_common now hard-fails with UCC_ERR_NOT_SUPPORTED when cuda_ring == NULL, so the score-string must avoid selecting it. However, the correct fallback for the "no-topology-info" case is still the flat ring (old behavior), not knomial. Consider keeping the flat-ring algorithm alive under a separate name/enum and using it as the fallback, or only switching to KNOMIAL when the caller is guaranteed to benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review start ai-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants