AZP: Build UCXX conda + wheel packages (Phase #2)#11484
Closed
Alexey-Rivkin wants to merge 3 commits into
Closed
Conversation
Wire rapidsai/ucxx tests into the UCX PR pipeline. New stage runs
parallel to Coverity, gated by Static_check. Pulls rapidsai/ucxx via
an Azure secondary repository checkout, runs C++ tests/benchmarks
inside a derived rapidsai/ci-conda image on MLNX agents.
Layout:
* buildlib/dockers/rapidsai-ci-conda.Dockerfile - thin wrapper around
rapidsai/ci-conda:26.06-latest that adds sudo + opens /opt/conda
(the stock image fails Azure container job contracts: no sudo, and
/opt/conda mode 2770 root:conda hides the conda env from the
Azure-injected step user). Published as
rdmz-harbor.rdmz.labs.mlnx/ucx/rapidsai-ci-conda:26.06-azp-1
multi-arch (amd64 + arm64).
* buildlib/azure-pipelines-pr.yml - declare rapidsai/ucxx secondary
repository checkout (Mellanox-lab endpoint, refs/heads/main).
* buildlib/pr/main.yml - container resources for the CPU stage
(ucxx_rapidsai_ci_conda) and the GPU stage
(ucxx_rapidsai_ci_conda_gpu, +DOCKER_OPT_GPU, --user 0:0).
UCXX_tests stage is invoked via a single template call.
* buildlib/pr/ucxx_tests_stage.yml - one template owns the stage
definition, the matrix slice list, and the per-slice job body.
GPU/CPU branching uses ${{ if eq(slice.gpu, ...) }} for the job-
level config (container, displayName, timeout) and bash runtime
conditionals (`IS_GPU=${{ slice.gpu }}; if [ "$IS_GPU" = "True" ]`)
for the body-level differences:
- CPU: shims missing nvidia-smi in test_common.sh, runs gtest
with UCX_TLS=tcp,sm,self and GTEST_FILTER=-RMM*.*:CCCL*.*,
skips ci/test_python.sh.
- GPU: shims rapids-configure-sccache (py 3.13+ sccache crashes
on CMake TryCompile scratch dirs), patches python_future_task.h
to include <unistd.h> (upstream missing include, surfaces on
newer libstdc++ header chains), and wraps test invocation in
`sudo -E env ... CUDA_MPS_PIPE_DIRECTORY=/tmp/no-mps-here` so
the MLNX host MPS daemon does not block the test client.
Matrix:
* CPU (mirrors upstream conda-cpp-build): cuda 12.9.1 + 13.2.0 x
x86_64 + aarch64, all py 3.11. Confirmed green on Azure.
* GPU (mirrors x86_64 subset of upstream conda-cpp-tests):
amd64 slices on cuda 13.0.2 py 3.12 and cuda 13.2.0 py 3.13.
Best-effort under current MLNX MPS+EXCLUSIVE_PROCESS environment;
reliability across CUDA-only test suites is left as a Phase 1
follow-up for the UCXX team.
c359ce8 to
68ce00f
Compare
Mark every non-UCXX-related stage in the PR pipeline with condition: false and override UCXX_tests_stage's dependsOn to [] so the new stage runs immediately on each PR push without the rest of the matrix consuming MLNX agents. Pure scaffolding - revert before Phase 5 merge.
Add UCXX_build stage to the UCX PR pipeline. Builds rapidsai/ucxx conda C++ + Python packages and the libucxx / ucxx / distributed-ucxx wheels against every UCX PR. * Single buildlib/tools/build_ucxx.sh runner with phase dispatch (conda_cpp / conda_python / wheel_libucxx / wheel_ucxx / wheel_distributed_ucxx), called by 5 thin templates. * New container resource ucxx_rapidsai_ci_wheel (buildlib/dockers/rapidsai-ci-wheel.Dockerfile) for the wheel jobs; conda jobs reuse the Phase 1 ucxx_rapidsai_ci_conda image. * Job graph mirrors upstream artifact flow: conda-python-build depends on conda-cpp-build; wheel-ucxx depends on wheel-libucxx (paired by ARCH/CUDA). * Shared shims for rapids-download-*-from-github, no-op rapids-configure-sccache, and the missing <unistd.h> patch live in build_ucxx.sh so all 5 phases get them once.
68ce00f to
c356872
Compare
Contributor
Author
|
Collapsed into PR #11473. Phase 2 commits now part of unified UCXX integration PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What?
Add
UCXX_buildstage: conda C++/Python pkgs + libucxx/ucxx/distributed-ucxx wheels fromrapidsai/ucxx.Why?
Phase 2 of moving UCXX CI onto UCX Azure pipeline. Builds on Phase 1
UCXX_testsstage (#11473).How?
Shared
buildlib/tools/build_ucxx.shrunner with phase dispatch, called by 5 thin templates. Newucxx_rapidsai_ci_wheelcontainer for wheel builds. Job graph mirrors upstream artifact flow.Draft - depends on #11473; non-UCXX stages and
Static_checkgating restored before review.