Skip to content

AZP: Build UCXX conda + wheel packages (Phase #2)#11484

Closed
Alexey-Rivkin wants to merge 3 commits into
openucx:masterfrom
Alexey-Rivkin:ucxx-azure-phase2-on-phase1
Closed

AZP: Build UCXX conda + wheel packages (Phase #2)#11484
Alexey-Rivkin wants to merge 3 commits into
openucx:masterfrom
Alexey-Rivkin:ucxx-azure-phase2-on-phase1

Conversation

@Alexey-Rivkin
Copy link
Copy Markdown
Contributor

@Alexey-Rivkin Alexey-Rivkin commented May 24, 2026

What?

Add UCXX_build stage: conda C++/Python pkgs + libucxx/ucxx/distributed-ucxx wheels from rapidsai/ucxx.

Why?

Phase 2 of moving UCXX CI onto UCX Azure pipeline. Builds on Phase 1 UCXX_tests stage (#11473).

How?

Shared buildlib/tools/build_ucxx.sh runner with phase dispatch, called by 5 thin templates. New ucxx_rapidsai_ci_wheel container for wheel builds. Job graph mirrors upstream artifact flow.

Draft - depends on #11473; non-UCXX stages and Static_check gating restored before review.

Wire rapidsai/ucxx tests into the UCX PR pipeline. New stage runs
parallel to Coverity, gated by Static_check. Pulls rapidsai/ucxx via
an Azure secondary repository checkout, runs C++ tests/benchmarks
inside a derived rapidsai/ci-conda image on MLNX agents.

Layout:

* buildlib/dockers/rapidsai-ci-conda.Dockerfile - thin wrapper around
  rapidsai/ci-conda:26.06-latest that adds sudo + opens /opt/conda
  (the stock image fails Azure container job contracts: no sudo, and
  /opt/conda mode 2770 root:conda hides the conda env from the
  Azure-injected step user). Published as
  rdmz-harbor.rdmz.labs.mlnx/ucx/rapidsai-ci-conda:26.06-azp-1
  multi-arch (amd64 + arm64).
* buildlib/azure-pipelines-pr.yml - declare rapidsai/ucxx secondary
  repository checkout (Mellanox-lab endpoint, refs/heads/main).
* buildlib/pr/main.yml - container resources for the CPU stage
  (ucxx_rapidsai_ci_conda) and the GPU stage
  (ucxx_rapidsai_ci_conda_gpu, +DOCKER_OPT_GPU, --user 0:0).
  UCXX_tests stage is invoked via a single template call.
* buildlib/pr/ucxx_tests_stage.yml - one template owns the stage
  definition, the matrix slice list, and the per-slice job body.
  GPU/CPU branching uses ${{ if eq(slice.gpu, ...) }} for the job-
  level config (container, displayName, timeout) and bash runtime
  conditionals (`IS_GPU=${{ slice.gpu }}; if [ "$IS_GPU" = "True" ]`)
  for the body-level differences:
    - CPU: shims missing nvidia-smi in test_common.sh, runs gtest
      with UCX_TLS=tcp,sm,self and GTEST_FILTER=-RMM*.*:CCCL*.*,
      skips ci/test_python.sh.
    - GPU: shims rapids-configure-sccache (py 3.13+ sccache crashes
      on CMake TryCompile scratch dirs), patches python_future_task.h
      to include <unistd.h> (upstream missing include, surfaces on
      newer libstdc++ header chains), and wraps test invocation in
      `sudo -E env ... CUDA_MPS_PIPE_DIRECTORY=/tmp/no-mps-here` so
      the MLNX host MPS daemon does not block the test client.

Matrix:
* CPU (mirrors upstream conda-cpp-build): cuda 12.9.1 + 13.2.0 x
  x86_64 + aarch64, all py 3.11. Confirmed green on Azure.
* GPU (mirrors x86_64 subset of upstream conda-cpp-tests):
  amd64 slices on cuda 13.0.2 py 3.12 and cuda 13.2.0 py 3.13.
  Best-effort under current MLNX MPS+EXCLUSIVE_PROCESS environment;
  reliability across CUDA-only test suites is left as a Phase 1
  follow-up for the UCXX team.
@Alexey-Rivkin Alexey-Rivkin changed the title AZP: UCXX Phase 2 - build conda + wheel packages AZP: UCXX - build conda + wheel packages (Phase #2) May 24, 2026
@Alexey-Rivkin Alexey-Rivkin changed the title AZP: UCXX - build conda + wheel packages (Phase #2) AZP: Build UCXX conda + wheel packages (Phase #2) May 24, 2026
@Alexey-Rivkin Alexey-Rivkin force-pushed the ucxx-azure-phase2-on-phase1 branch 8 times, most recently from c359ce8 to 68ce00f Compare May 24, 2026 18:37
Mark every non-UCXX-related stage in the PR pipeline with condition: false
and override UCXX_tests_stage's dependsOn to [] so the new stage runs
immediately on each PR push without the rest of the matrix consuming MLNX
agents. Pure scaffolding - revert before Phase 5 merge.
Add UCXX_build stage to the UCX PR pipeline. Builds rapidsai/ucxx
conda C++ + Python packages and the libucxx / ucxx / distributed-ucxx
wheels against every UCX PR.

* Single buildlib/tools/build_ucxx.sh runner with phase dispatch
  (conda_cpp / conda_python / wheel_libucxx / wheel_ucxx /
  wheel_distributed_ucxx), called by 5 thin templates.
* New container resource ucxx_rapidsai_ci_wheel
  (buildlib/dockers/rapidsai-ci-wheel.Dockerfile) for the wheel jobs;
  conda jobs reuse the Phase 1 ucxx_rapidsai_ci_conda image.
* Job graph mirrors upstream artifact flow: conda-python-build
  depends on conda-cpp-build; wheel-ucxx depends on wheel-libucxx
  (paired by ARCH/CUDA).
* Shared shims for rapids-download-*-from-github, no-op
  rapids-configure-sccache, and the missing <unistd.h> patch live
  in build_ucxx.sh so all 5 phases get them once.
@Alexey-Rivkin Alexey-Rivkin force-pushed the ucxx-azure-phase2-on-phase1 branch from 68ce00f to c356872 Compare May 24, 2026 19:25
@Alexey-Rivkin
Copy link
Copy Markdown
Contributor Author

Collapsed into PR #11473. Phase 2 commits now part of unified UCXX integration PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant