Update docs and preserve runtime LD_LIBRARY_PATH in entrypoints

fluidnumericsJoe · claude · fluidnumericsJoe · commit 01842e344ed8 · 2026-03-21T13:39:26.000-04:00
Update README and CLAUDE.md to reflect the consolidated cuda/
directory, new none/ environment, and current tagging scheme.

Fix entrypoints in cuda and rocm Dockerfiles to save and restore
LD_LIBRARY_PATH around Spack activate, so GPU driver libraries
injected by enroot/pyxis are not clobbered.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,16 +1,16 @@
 # Repository Guidelines
 
 ## Project Structure & Module Organization
-Source files live under `envs/<cpu>/<gpu>/`, where each leaf directory owns a `spack.yaml` manifest and (optionally) a generated `Dockerfile`. Keep CPU targets (`x86`, …) and accelerator targets (`gfx90a`, `sm72`, `none`) granular so images stay purpose-built, and limit the root `README.md` to high-level context.
+Source files live under `envs/<cpu>/<gpu>/`, where each leaf directory owns a `spack.yaml` manifest and a `Dockerfile`. Current environments are `envs/x86/rocm` (AMD GPUs), `envs/x86/cuda` (NVIDIA GPUs, parameterised by `CUDA_ARCH` and `CUDA_VERSION` build args), and `envs/x86/none` (CPU-only). Keep CPU targets (`x86`, …) and accelerator targets (`rocm`, `cuda`, `none`) as separate directories so images stay purpose-built, and limit the root `README.md` to high-level context.
 
 ## Build, Test, and Development Commands
-- `spack spec -e envs/x86/gfx90a/spack.yaml` — concretizes the manifest locally; run this before opening a PR so dependency drift is caught early.
-- `spack containerize envs/x86/gfx90a/spack.yaml > envs/x86/gfx90a/Dockerfile` — regenerates the Dockerfile after manifest edits (avoid hand-tuning output).
-- `docker build -f envs/x86/gfx90a/Dockerfile -t selfish:gfx90a .` — builds the shareable runtime image; tag images `<cpu>-<gpu>` for clarity.
-- `docker run --rm selfish:gfx90a spack find hdf5` — smoke-tests that the expected view was installed inside the image.
+- `docker build -f envs/x86/rocm/Dockerfile --build-arg GPU_ARCH=gfx90a -t selfish:x86-rocm-gfx90a .` — builds the ROCm image for a specific AMD GPU arch.
+- `docker build -f envs/x86/cuda/Dockerfile --build-arg CUDA_ARCH=70 --build-arg CUDA_VERSION=12.4 -t selfish:x86-cuda-sm70 .` — builds the CUDA image for a specific NVIDIA GPU arch.
+- `docker build -f envs/x86/none/Dockerfile -t selfish:x86-none .` — builds the CPU-only image.
+- `docker run --rm selfish:x86-none spack find hdf5` — smoke-tests that the expected view was installed inside the image.
 
 ## Coding Style & Naming Conventions
-Spack YAML uses 2-space indentation, lowercase keys, and quoted constraint strings (`"target=x86_64_v3"`). Group `specs` alphabetically, keep `packages` overrides sorted by scope, and rely on multiline `RUN` blocks with trailing `\` alignment plus brief comments for non-obvious workarounds. Name new environments after the hardware tuple (`x86/gfx942`, `x86/none`) so downstream scripts can glob predictably.
+Spack YAML uses 2-space indentation, lowercase keys, and quoted constraint strings (`"target=x86_64_v3"`). Group `specs` alphabetically, keep `packages` overrides sorted by scope, and rely on multiline `RUN` blocks with trailing `\` alignment plus brief comments for non-obvious workarounds. Name new environments after the hardware tuple (`x86/cuda`, `x86/none`) so downstream scripts can glob predictably.
 
 ## Testing Guidelines
 For each environment change, run `spack spec` followed by `spack install --fail-fast` inside a disposable builder container to verify concretization. Container builds must pass `docker build` locally before review; capture the last ~20 lines for the PR description. When adding MPI/HDF5 variants, run `docker run --rm <tag> mpichversion` (or another representative binary) to prove runtime availability. There is no coverage gate, but every new spec should ship with at least one build log, and GitHub Actions now double-checks gfx90a builds and publishes them to `higherordermethods/selfish`.
diff --git a/README.md b/README.md
@@ -10,59 +10,76 @@ The core SELF team at Fluid Numerics has adopted enroot+pyxis with Slurm for our
 See [Repository Guidelines](CLAUDE.md) for contributor expectations, build commands, and review checklists.
 
 
-More docs coming soon
+## Organization
 
+The `envs/` subdirectory defines all of the base environments that are aimed at providing base images with all the dependencies required for developing SELF. The subdirectory structure is `envs/{cpu_platform}/{gpu_backend}`.
 
-## Organization
+| Directory | GPU backend | Build args | MPI |
+|-----------|-------------|------------|-----|
+| `envs/x86/rocm/` | AMD ROCm | `GPU_ARCH` (e.g. `gfx90a`), `GPU_BACKEND_VERSION` (e.g. `6.4.3`) | OpenMPI |
+| `envs/x86/cuda/` | NVIDIA CUDA | `CUDA_ARCH` (e.g. `70`, `100`), `CUDA_VERSION` (e.g. `12.4`, `13.0`) | OpenMPI |
+| `envs/x86/none/` | None (CPU-only) | — | OpenMPI |
 
-The `envs/` subdirectory defines all of the base environments that are aimed at providing base images with all the dependencies required for developing SELF. The subdirectory structure is as `envs/{cpu_platform}/{gpu_backend}`. When `{gpu_platform}=none`, that environment is an environment for working with non-gpu accelerated implementations of SELF.
+Each directory contains a `spack.yaml` manifest, a `Dockerfile`, and a `feq-parse.patch`.
 
 ## Container Images
 
-SELFish provides pre-built container images with all dependencies for GPU-accelerated spectral element computations. Images are tagged using a **version-architecture** naming scheme to support multiple GPU targets.
+SELFish provides pre-built container images with all dependencies for spectral element computations. Images are published to Docker Hub under `higherordermethods/selfish`.
 
 ### Image Tagging Scheme
 
 Images follow the pattern: `higherordermethods/selfish:<version>-<cpu_platform>-<gpu_backend>-<gpu_arch>`
 
-- **`<version>`**: Semantic version (e.g., `v1.2.3`) or release channel (`latest`, `dev`)
-- **`<cpu_platform>`** : Target cpu architecture (e.g. `x86`, `arm` )
-- **`<gpu_backend>`** : GPU backend provider with version (e.g. `rocm643`, `cuda112`)
-- **`<gpu_arch>`**: Target GPU architecture (e.g., `gfx90a`, `gfx906`, `gfx942`)
+- **`<version>`**: `latest` or a commit SHA
+- **`<cpu_platform>`**: Target CPU architecture (e.g. `x86`)
+- **`<gpu_backend>`**: GPU backend with version (e.g. `rocm643`, `cuda124`) or `none` for CPU-only
+- **`<gpu_arch>`**: Target GPU architecture (e.g. `gfx90a`, `sm70`); omitted for CPU-only images
 
-#### Examples:
+#### Examples
 ```bash
-# Stable release for MI210/MI250 (gfx90a)
-docker pull higherordermethods/selfish:v1.2.3-gfx90a
+# AMD MI210/MI250 (gfx90a) with ROCm 6.4.3
+docker pull higherordermethods/selfish:latest-x86-rocm643-gfx90a
+
+# NVIDIA V100 (sm70) with CUDA 12.4
+docker pull higherordermethods/selfish:latest-x86-cuda124-sm70
 
-# Latest stable for Radeon Instinct MI100 (gfx908)
-docker pull higherordermethods/selfish:latest-gfx908
+# NVIDIA Blackwell (sm100) with CUDA 13.0
+docker pull higherordermethods/selfish:latest-x86-cuda130-sm100
 
-# Development build for MI300A (gfx942)
-docker pull higherordermethods/selfish:dev-gfx942
+# CPU-only
+docker pull higherordermethods/selfish:latest-x86-none
 ```
 
-### Supported GPU Architectures
+### Supported Architectures
 
-| Architecture | GPU Models | Tag Suffix |
-|--------------|------------|------------|
-| gfx90a | MI210, MI250, MI250X | `-gfx90a` |
-| gfx908 | MI100 | `-gfx908` |
-| gfx906 | MI50, MI60, Radeon VII | `-gfx906` |
-| gfx942 | MI300A, MI300X | `-gfx942` |
-| sm_72  | V100 | -sm72 |
+#### AMD ROCm
+| Architecture | GPU Models | Tag |
+|--------------|------------|-----|
+| gfx906 | MI50, MI60, Radeon VII | `latest-x86-rocm643-gfx906` |
+| gfx90a | MI210, MI250, MI250X | `latest-x86-rocm643-gfx90a` |
+| gfx942 | MI300A, MI300X | `latest-x86-rocm643-gfx942` |
 
-### Determining Your GPU Architecture
+#### NVIDIA CUDA
+| Architecture | GPU Models | Tag |
+|--------------|------------|-----|
+| sm70 | V100 | `latest-x86-cuda124-sm70` |
+| sm100 | B200, B300 | `latest-x86-cuda130-sm100` |
+
+#### CPU-only
+| Tag |
+|-----|
+| `latest-x86-none` |
 
-If you're unsure which image to use, check your GPU architecture.
+### Determining Your GPU Architecture
 
-For AMD GPUs,
+For AMD GPUs:
 ```bash
-# Using rocminfo
 rocminfo | grep "Name:" | grep "gfx"
+```
 
-# Using rocm-smi
-rocm-smi --showproductname
+For NVIDIA GPUs:
+```bash
+nvidia-smi --query-gpu=compute_cap --format=csv,noheader
 ```
 
 ### Using with Slurm
@@ -71,25 +88,20 @@ Specify the architecture-specific image in your job script:
 ```bash
 #!/bin/bash
 #SBATCH --gpus=1
-#SBATCH --container-image=higherordermethods/selfish:v1.2.3-gfx90a
+#SBATCH --container-image=higherordermethods/selfish:latest-x86-rocm643-gfx90a
 
 ./run_simulation.sh
 ```
 
-### Version Pinning Recommendations
-
-- **Production**: Pin to specific versions (e.g., `v1.2.3-gfx90a`) for reproducibility
-- **Development**: Use `latest-<arch>` for convenience (auto-updates with new releases)
-- **Testing CI**: Use `dev-<arch>` to test against bleeding-edge builds
-
 ### Image Metadata
 
 All images include OCI labels for programmatic inspection:
 ```bash
-docker inspect higherordermethods/selfish:v1.2.3-gfx90a | grep -A5 Labels
+docker inspect higherordermethods/selfish:latest-x86-cuda124-sm70 | grep -A5 Labels
 ```
 
 Key labels:
-- `com.fluidnumerics.rocm.target`: GPU architecture target
-- `com.fluidnumerics.selfish.version`: SELFish version
-- `org.opencontainers.image.version`: Container image version
+- `com.fluidnumerics.rocm.target` / `com.fluidnumerics.cuda.target`: GPU architecture target
+- `com.fluidnumerics.rocm.version` / `com.fluidnumerics.cuda.version`: Backend version
+- `org.opencontainers.image.source`: Source repository
+- `org.opencontainers.image.revision`: Git commit SHA
diff --git a/envs/x86/cuda/Dockerfile b/envs/x86/cuda/Dockerfile
@@ -171,7 +171,11 @@ COPY --from=builder /opt/views /opt/views
 
 RUN { \
       echo '#!/bin/sh' \
+      && echo '# Save LD_LIBRARY_PATH set by the container runtime (e.g. enroot/pyxis)' \
+      && echo '_pre_spack_ldlp="${LD_LIBRARY_PATH}"' \
       && echo '.' /opt/spack-environment/activate.sh \
+      && echo '# Restore runtime-injected paths (NVIDIA driver libs) after Spack activate' \
+      && echo 'export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}${_pre_spack_ldlp}"' \
       && echo 'exec "$@"'; \
     } > /entrypoint.sh \
 && chmod a+x /entrypoint.sh \
diff --git a/envs/x86/rocm/Dockerfile b/envs/x86/rocm/Dockerfile
@@ -185,7 +185,11 @@ COPY --from=builder /opt/views /opt/views
 
 RUN { \
       echo '#!/bin/sh' \
+      && echo '# Save LD_LIBRARY_PATH set by the container runtime (e.g. enroot/pyxis)' \
+      && echo '_pre_spack_ldlp="${LD_LIBRARY_PATH}"' \
       && echo '.' /opt/spack-environment/activate.sh \
+      && echo '# Restore runtime-injected paths (GPU driver libs) after Spack activate' \
+      && echo 'export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}${_pre_spack_ldlp}"' \
       && echo 'exec "$@"'; \
     } > /entrypoint.sh \
 && chmod a+x /entrypoint.sh \