ci(vllm-tensorizer): Separate BuildKit cache slots per matrix variant#164
Conversation
| image-name: vllm-tensorizer | ||
| folder: vllm-tensorizer | ||
| tag-suffix: ${{ matrix.tag-suffix }} | ||
| cache-key: ${{ matrix.tag-suffix }} |
There was a problem hiding this comment.
The cache key should be the CUDA/Ubuntu version; it shouldn't include the vLLM commit (which is included in the tag-suffix). You won't be building from multiple vLLM versions in a matrix, and if you go forward, you are more than likely not going back, so you would want to use what you can from the last vLLM commit's build's cache, and then go from there.
|
@JustinPerlman Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/26179428631 |
|
@JustinPerlman Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/26179428631 |
|
@JustinPerlman Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/26180903764 |
Summary
Pass
cache-key: ${{ matrix.tag-suffix }}to the build workflow so each cuda variant of thevllm-tensorizermatrix gets its own BuildKit registry cache slot, instead of fighting over a single shared one.The problem
Observed in build: the cuda13.2 variant builds in ~3 min while cuda12.9 builds take ~2 hours, even after multiple commits in-PR and with sccache reporting 100% hit rate on cuda12.9.
Root cause:
build.ymlcomputes the BuildKit registry cache reference as${arch}-${image-name}[-${cache-key}]. We weren't passingcache-key, so both matrix variants pushed to and pulled from the same slot:amd64-vllm-tensorizer.Each matrix run, both variants pulled the most recently pushed cache. That cache's
builder-baselayer was builtFROMthe other cuda's base image, so the digest didn't match and BuildKit had to rebuild every layer from scratch. Once nvcc was invoked, sccache found everything in S3 (hence the 100% hit rate) — but fetching tens of thousands of.ofiles one-by-one from S3 still takes hours.The fix