Skip to content

CORE: various fixes for ucc_mem_map#1306

Open
wfaderhold21 wants to merge 2 commits into
openucx:masterfrom
wfaderhold21:topic/memh-fixes
Open

CORE: various fixes for ucc_mem_map#1306
wfaderhold21 wants to merge 2 commits into
openucx:masterfrom
wfaderhold21:topic/memh-fixes

Conversation

@wfaderhold21
Copy link
Copy Markdown
Collaborator

What

Fix a collection of correctness bugs in the ucc_mem_map/ucc_mem_unmap path and specifically TL/UCP that can lead to failures or memory leaks.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR fixes a collection of correctness bugs in the ucc_mem_map/ucc_mem_unmap path and the TL/UCP layer, including a wrong loop variable (ctx->tl_ctx[i]ctx->tl_ctx[j]) in the unmap cleanup, a missing offset stride in the offload-import TL scan, missing NULL guards for tl_data, and dangling-pointer issues after buffer release.

  • tl_ucp_context.c: Offset calculation fix for ucc_tl_ucp_mem_map_offload_import, NULL guard for tl_h->tl_data on import failure, ucp_mem_unmap call for IMPORT_OFFLOAD in unmap, and pointer-zeroing after buffer release in error paths.
  • ucc_context.c: Fixes loop variable bug in failed_mem_map, adds NULL input guards, expands tl_h allocation to max(n_tl_ctx, num_tls), adds structured import-error unmap, and cleans up local_memh->tl_h unconditionally on success. Three bugs remain in the EXPORT_OFFLOAD error path (see inline comments).

Confidence Score: 3/5

The core export path has unfixed bugs in the EXPORT_OFFLOAD error handling that corrupt the caller's memory handle and leak memory.

Three distinct defects remain in the EXPORT_OFFLOAD pack-failure path: the packed_buffers array itself is never freed, per-entry cleanup uses the wrong index space and misses buffers, and failed_mem_map unconditionally frees local_memh which is the caller-owned *memh handle. Any pack failure in EXPORT_OFFLOAD mode will corrupt the caller's handle and leak memory.

src/core/ucc_context.c — specifically the failed_pack and failed_mem_map labels and their interaction with EXPORT_OFFLOAD mode

Important Files Changed

Filename Overview
src/core/ucc_context.c Fixes several correctness bugs in mem_map/mem_unmap, but introduces new issues in the EXPORT_OFFLOAD error path: the packed_buffers array itself is never freed on failure, wrong indices are used to clean up packed buffers in EXPORT_OFFLOAD mode, and failed_mem_map unconditionally frees local_memh (= the caller's *memh).
src/components/tl/ucp/tl_ucp_context.c Fixes offset calculation in offload-import TL scan, adds NULL guard for tl_data on import failure, adds ucp_mem_unmap for IMPORT_OFFLOAD in unmap, and zeroes pointers after releasing buffers in error paths. Changes look correct.

Comments Outside Diff (3)

  1. src/core/ucc_context.c, line 1520-1524 (link)

    P1 packed_buffers array pointer leaked in every error path

    failed_pack frees individual entries via the for j loop, but never calls ucc_free(packed_buffers) for the array itself. Because failed_mem_map can be reached via goto before packed_buffers is even allocated (from the mem-map loop), adding a free there would require a NULL guard. The clean fix is to free the array immediately after the per-entry loop and before falling through to failed_mem_map. This leak is triggered on every goto failed_pack (pack failure, exported_memh alloc failure) in both EXPORT and EXPORT_OFFLOAD modes.

  2. src/core/ucc_context.c, line 1520-1538 (link)

    P1 EXPORT_OFFLOAD error path frees the caller's memory handle and may access tl_h out-of-bounds

    In EXPORT_OFFLOAD mode local_memh = *memh (the user-supplied handle, not a locally allocated one). When packing fails, failed_pack falls through to failed_mem_map which unconditionally calls ucc_free(local_memh) and *memh = NULL, destroying the caller's handle and causing a use-after-free on any subsequent access.

    Additionally, failed_mem_map loops j in [0, ctx->n_tl_ctx) and accesses local_memh->tl_h[j]. For EXPORT_OFFLOAD, local_memh->tl_h only has local_memh->num_tls entries. When ctx->n_tl_ctx > local_memh->num_tls, this produces an out-of-bounds read. Both the ucc_free and the unmap loop should be guarded by if (mode == UCC_MEM_MAP_MODE_EXPORT); EXPORT_OFFLOAD mode doesn't map any memory in this call so there is nothing to unmap on failure.

  3. src/core/ucc_context.c, line 1520-1523 (link)

    P1 EXPORT_OFFLOAD: failed_pack frees buffers at wrong indices, leaking packed data

    For EXPORT_OFFLOAD, memh_pack stores results at packed_buffers[tlh_index] where tlh_index is the exporter slot (0..num_tls-1). The cleanup loop frees packed_buffers[0..i-1] where i is the importer TL-context loop counter — an unrelated index space. Any buffer stored at an exporter slot that doesn't coincide with [0, i) is silently leaked. For example, if the first importer iteration (i=0) resolved to tlh_index=2, then a failure on the second iteration (i=1) leaves packed_buffers[2] unreleased. The cleanup should iterate all slots up to ucc_max(ctx->n_tl_ctx, num_tls) rather than [0, i).

Reviews (2): Last reviewed commit: "TL/UCP: various memh fixes" | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants