Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion src/core/ucc_global_opts.c
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ ucc_global_config_t ucc_global_config = {
.log_buffer_size = 1024,
.log_data_size = 0,
.log_print_enable = 0,
.log_level_trigger = UCC_LOG_LEVEL_FATAL};
.log_level_trigger = UCC_LOG_LEVEL_FATAL,
.modules = {{NULL, 0}, 0}};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Magic numeric literal for modules.mode

The static initializer uses 0 as the value for modules.mode, relying on the assumption that UCC_CONFIG_ALLOW_LIST_ALLOW_ALL == 0. This is an implicit contract with the UCX library's enum ordering. If UCX ever changes this enum (e.g., adds a value before ALLOW_ALL), or if ucc_component_is_enabled is ever called before ucc_config_read processes the default "all" value, all components would be silently disabled because the mode == UCC_CONFIG_ALLOW_LIST_ALLOW_ALL guard would not fire, the empty array would match nothing, and neither the ALLOW nor NEGATE branches would return 1.

Use the named constant to make the intent explicit and safe:

Suggested change
.modules = {{NULL, 0}, 0}};
.log_level_trigger = UCC_LOG_LEVEL_FATAL,
.modules = {{NULL, 0}, UCC_CONFIG_ALLOW_LIST_ALLOW_ALL}};


ucc_config_field_t ucc_global_config_table[] = {
{"LOG_LEVEL", "warn",
Expand Down Expand Up @@ -108,5 +109,17 @@ ucc_config_field_t ucc_global_config_table[] = {
"Log level to trigger error handling.",
ucc_offsetof(ucc_global_config_t, log_level_trigger), UCC_CONFIG_TYPE_ENUM(ucc_log_level_names)},

{"MODULES", "all",
"Comma-separated list of component modules to load.\n"
"Use <name> to match a component in any framework, or\n"
"<framework>_<name> for a specific framework (e.g. tl_cuda).\n"
"Prefix with '^' to negate.\n"
" - all - load all available modules (default).\n"
" - ^cuda - load all modules except cuda (any framework).\n"
" - ^tl_cuda,ec_cuda - load all modules except tl_cuda and ec_cuda.\n"
" - ucp,nccl - load only ucp and nccl.",
ucc_offsetof(ucc_global_config_t, modules),
UCC_CONFIG_TYPE_ALLOW_LIST},

{NULL}
};
2 changes: 2 additions & 0 deletions src/core/ucc_global_opts.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ typedef struct ucc_global_config {
/* Coll component libraries path */
char *component_path;
char *install_path;
/* Modules filter (allow/negate list) */
ucc_config_allow_list_t modules;
int initialized;
/* Profiling mode */
uint64_t profile_mode;
Expand Down
49 changes: 49 additions & 0 deletions src/utils/ucc_component.c
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,47 @@
#define IFACE_NAME_LEN_MAX \
(UCC_MAX_FRAMEWORK_NAME_LEN + UCC_MAX_COMPONENT_NAME_LEN + 32)

/* Check if a module name from the allow list matches the given component.
Entries with a framework prefix are matched as <framework>_<component>
(e.g. "tl_cuda"), entries without a known framework prefix are matched
against the component name only (e.g. "cuda" matches cuda in any
framework). */
static int ucc_modules_list_search(const ucc_config_names_array_t *names,
const char *framework_name,
const char *component_name)
{
char qualified[UCC_MAX_FRAMEWORK_NAME_LEN +
UCC_MAX_COMPONENT_NAME_LEN + 2];
unsigned i;

ucc_snprintf_safe(qualified, sizeof(qualified), "%s_%s",
framework_name, component_name);

for (i = 0; i < names->count; i++) {
if ((strcmp(names->names[i], qualified) == 0) ||
(strcmp(names->names[i], component_name) == 0)) {
return i;
}
}
return -1;
}
Comment on lines +40 to +48
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No diagnostic for unrecognised module names in the allow list

If a user makes a typo in UCC_MODULES (e.g., UCC_MODULES=^tl_cudaa), the entry silently fails to match any component — no warning is emitted and the filtering just has no effect. This can be confusing to debug.

Consider emitting a ucc_warn when a ALLOW or NEGATE list is active and none of the entries match any component that was actually found on disk. At minimum, a debug-level note when an entry in the list is never matched during the whole component-loading scan would help users detect typos in their configuration.


static int ucc_component_is_enabled(const char *framework_name,
const char *component_name)
{
ucc_config_allow_list_t *modules = &ucc_global_config.modules;
int found;

if (modules->mode == UCC_CONFIG_ALLOW_LIST_ALLOW_ALL) {
return 1;
}

found = ucc_modules_list_search(&modules->array, framework_name,
component_name) >= 0;
return ((modules->mode == UCC_CONFIG_ALLOW_LIST_ALLOW) && found) ||
((modules->mode == UCC_CONFIG_ALLOW_LIST_NEGATE) && !found);
}

static ucc_status_t ucc_component_load_one(const char *so_path,
const char *framework_name,
ucc_component_iface_t **c_iface)
Expand Down Expand Up @@ -50,6 +91,14 @@ static ucc_status_t ucc_component_load_one(const char *so_path,
ucc_strncpy_safe(iface_struct, so_path + basename_start,
iface_struct_name_len + 1);

if (!ucc_component_is_enabled(framework_name,
iface_struct + strlen(framework_pattern))) {
ucc_debug("component '%s_%s' is disabled by UCC_MODULES configuration",
framework_name, iface_struct + strlen(framework_pattern));
*c_iface = NULL;
return UCC_ERR_NO_MESSAGE;
Comment on lines +94 to +99
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ALLOW list silently breaks required frameworks

When UCC_MODULES is set to an ALLOW list (no ^ prefix), the filter is applied per-component across all frameworks. The PR's own documented example UCC_MODULES=ucp,nccl will cause UCC initialization to fail because:

  • The cl framework has components basic, hier, doca_urom — none of which match ucp or nccl.
  • All CL components are filtered out → ucc_components_load("cl", ...) returns UCC_ERR_NOT_FOUND.
  • ucc_constructor.c line 138 treats that as fatal: "no CL components were found in the ucc modules dir".
  • The same problem applies to the mc framework, which is also treated as fatal (lines 159–163).

So the documented ucp,nccl example would crash initialization for any installation that has CL components. The negation examples (^cuda, ^tl_cuda,ec_cuda) work correctly because they only exclude matching components. Consider either:

  1. Filtering only the framework(s) whose component names appear in the allow list (cross-framework awareness), or
  2. Documenting that the ALLOW form requires the user to also include all required CL/MC component names (e.g., UCC_MODULES=ucp,nccl,basic,hier,cpu), and removing the ucp,nccl example from the help string until that is clarified.

}
Comment on lines +94 to +100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Disabled component returns UCC_ERR_NO_MESSAGE, indistinguishable from load failure

Returning UCC_ERR_NO_MESSAGE for a component that is intentionally disabled by the user mixes two semantically different outcomes under the same error code. The same code is already returned when the .so basename doesn't contain the framework pattern (line 83), and when dlopen/dlsym fails (lines 107/128). In all three cases ucc_components_load simply continues.

This isn't a bug today, but it means callers can never distinguish "this component was deliberately filtered out by the user" from "this component failed to load." If in the future a caller wants to emit a summarized diagnostic (e.g., "3 components were disabled, 1 failed to load"), the distinction will be impossible to recover. Using a separate status (e.g., UCC_ERR_NOT_SUPPORTED, or even UCC_OK with *c_iface = NULL) for the intentional-disable case would preserve that information.


handle = dlopen(so_path, RTLD_LAZY);
if (!handle) {
error = dlerror();
Expand Down
Loading