fix(docker): bypass nvcr base-image poisoned cache for cutlass-dsl

z52527 · claude · z52527 · commit 5efdadc1b0d2 · 2026-05-14T23:36:59.000-07:00
nvcr.io/nvidia/pytorch:26.02-py3's pre-populated pip cache contains
an nvcr-built nvidia-cutlass-dsl-libs-base==4.4.1 wheel whose
cute/arch/__init__.py is 9 bytes shorter than PyPI's public 4.4.1
wheel and omits the top-level ProxyKind / SharedSpace re-export
that flash_attn.cute requires. Plain `pip install
'nvidia-cutlass-dsl[cu13]==4.4.1'` hits the bad cached wheel via
pip's extra-resolution code path, even with --no-cache-dir.

Switch to --no-deps + the three cutlass-dsl subpackages spelled
out explicitly — that routes pip through the simpler explicit-args
install path where the cache trap doesn't apply. Re-pin all three
subpackages on the bundled `pip install` too, otherwise other
packages' deps (quack-kernels, apache-tvm-ffi) cascade and bump
cutlass-dsl to a mismatched newer minor.

The verify-line `python -c "from cutlass.cute.arch import
ProxyKind, SharedSpace"` fail-fasts the build if the upgrade
ever stops taking effect.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Runchu Zhao &lt;zhaorunchu@gmail.com&gt;
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -39,21 +39,22 @@ RUN if [ "${TRITONSERVER_BUILD}" = "1" ]; then \
 # Megatron-LM core_v0.13.1: contains d9608004f which gates
 # ChainedOptimizer.count_zeros_fp32 on log_num_zeros_in_grad (~4 ms/step
 # saving on HSTU bf16); 0.12.x always pays the cost.
-#
-# nvidia-cutlass-dsl: base image ships 4.3.0 with a .pth that survives
-# pip uninstall, so we rm -rf the tree before installing 4.4.x (which
-# adds ProxyKind / SharedSpace, required by flash_attn.cute). The final
-# `python -c "from cutlass.cute.arch ..."` fails the build immediately
-# if the upgrade ever stops taking effect.
+# nvidia-cutlass-dsl: --no-deps + --no-cache-dir + 3 explicit subpackages
+# avoids a poisoned base-image pip cache wheel; keep the re-pins on the
+# bundled install too.
 RUN pip uninstall -y nvidia-cutlass-dsl nvidia-cutlass-dsl-libs-base nvidia-cutlass-dsl-libs-cu13 || true && \
     rm -rf /usr/local/lib/python3.12/dist-packages/nvidia_cutlass_dsl* && \
+    pip install --no-cache-dir --no-deps \
+        'nvidia-cutlass-dsl==4.4.1' \
+        'nvidia-cutlass-dsl-libs-base==4.4.1' \
+        'nvidia-cutlass-dsl-libs-cu13==4.4.1' && \
+    python -c "from cutlass.cute.arch import ProxyKind, SharedSpace" && \
     git clone -b core_v0.13.1 https://github.com/NVIDIA/Megatron-LM.git megatron-lm && \
     pip install --no-deps -e ./megatron-lm && \
-    pip install torchx gin-config torchmetrics==1.0.3 typing-extensions iopath pyvers \
-    cloudpickle triton==3.6.0 'nvidia-cutlass-dsl[cu13]==4.4.1' \
-    'quack-kernels>=0.3.3' 'apache-tvm-ffi>=0.1.6' torch-c-dlpack-ext \
-    --no-cache pre-commit && \
-    python -c "from cutlass.cute.arch import ProxyKind, SharedSpace"
+    pip install --no-cache-dir torchx gin-config torchmetrics==1.0.3 typing-extensions iopath pyvers \
+        cloudpickle triton==3.6.0 \
+        'nvidia-cutlass-dsl==4.4.1' 'nvidia-cutlass-dsl-libs-base==4.4.1' 'nvidia-cutlass-dsl-libs-cu13==4.4.1' \
+        'quack-kernels>=0.3.3' 'apache-tvm-ffi>=0.1.6' torch-c-dlpack-ext pre-commit
 
 # -- Layer 3: FBGEMM (long build, own layer for caching) ---
 RUN pip install --no-cache-dir setuptools-git-versioning scikit-build && \