Skip to content

Commit 01842e3

Browse files
Update docs and preserve runtime LD_LIBRARY_PATH in entrypoints
Update README and CLAUDE.md to reflect the consolidated cuda/ directory, new none/ environment, and current tagging scheme. Fix entrypoints in cuda and rocm Dockerfiles to save and restore LD_LIBRARY_PATH around Spack activate, so GPU driver libraries injected by enroot/pyxis are not clobbered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent fa13036 commit 01842e3

4 files changed

Lines changed: 66 additions & 46 deletions

File tree

CLAUDE.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
# Repository Guidelines
22

33
## Project Structure & Module Organization
4-
Source files live under `envs/<cpu>/<gpu>/`, where each leaf directory owns a `spack.yaml` manifest and (optionally) a generated `Dockerfile`. Keep CPU targets (`x86`, …) and accelerator targets (`gfx90a`, `sm72`, `none`) granular so images stay purpose-built, and limit the root `README.md` to high-level context.
4+
Source files live under `envs/<cpu>/<gpu>/`, where each leaf directory owns a `spack.yaml` manifest and a `Dockerfile`. Current environments are `envs/x86/rocm` (AMD GPUs), `envs/x86/cuda` (NVIDIA GPUs, parameterised by `CUDA_ARCH` and `CUDA_VERSION` build args), and `envs/x86/none` (CPU-only). Keep CPU targets (`x86`, …) and accelerator targets (`rocm`, `cuda`, `none`) as separate directories so images stay purpose-built, and limit the root `README.md` to high-level context.
55

66
## Build, Test, and Development Commands
7-
- `spack spec -e envs/x86/gfx90a/spack.yaml`concretizes the manifest locally; run this before opening a PR so dependency drift is caught early.
8-
- `spack containerize envs/x86/gfx90a/spack.yaml > envs/x86/gfx90a/Dockerfile`regenerates the Dockerfile after manifest edits (avoid hand-tuning output).
9-
- `docker build -f envs/x86/gfx90a/Dockerfile -t selfish:gfx90a .` — builds the shareable runtime image; tag images `<cpu>-<gpu>` for clarity.
10-
- `docker run --rm selfish:gfx90a spack find hdf5` — smoke-tests that the expected view was installed inside the image.
7+
- `docker build -f envs/x86/rocm/Dockerfile --build-arg GPU_ARCH=gfx90a -t selfish:x86-rocm-gfx90a .`builds the ROCm image for a specific AMD GPU arch.
8+
- `docker build -f envs/x86/cuda/Dockerfile --build-arg CUDA_ARCH=70 --build-arg CUDA_VERSION=12.4 -t selfish:x86-cuda-sm70 .`builds the CUDA image for a specific NVIDIA GPU arch.
9+
- `docker build -f envs/x86/none/Dockerfile -t selfish:x86-none .` — builds the CPU-only image.
10+
- `docker run --rm selfish:x86-none spack find hdf5` — smoke-tests that the expected view was installed inside the image.
1111

1212
## Coding Style & Naming Conventions
13-
Spack YAML uses 2-space indentation, lowercase keys, and quoted constraint strings (`"target=x86_64_v3"`). Group `specs` alphabetically, keep `packages` overrides sorted by scope, and rely on multiline `RUN` blocks with trailing `\` alignment plus brief comments for non-obvious workarounds. Name new environments after the hardware tuple (`x86/gfx942`, `x86/none`) so downstream scripts can glob predictably.
13+
Spack YAML uses 2-space indentation, lowercase keys, and quoted constraint strings (`"target=x86_64_v3"`). Group `specs` alphabetically, keep `packages` overrides sorted by scope, and rely on multiline `RUN` blocks with trailing `\` alignment plus brief comments for non-obvious workarounds. Name new environments after the hardware tuple (`x86/cuda`, `x86/none`) so downstream scripts can glob predictably.
1414

1515
## Testing Guidelines
1616
For each environment change, run `spack spec` followed by `spack install --fail-fast` inside a disposable builder container to verify concretization. Container builds must pass `docker build` locally before review; capture the last ~20 lines for the PR description. When adding MPI/HDF5 variants, run `docker run --rm <tag> mpichversion` (or another representative binary) to prove runtime availability. There is no coverage gate, but every new spec should ship with at least one build log, and GitHub Actions now double-checks gfx90a builds and publishes them to `higherordermethods/selfish`.

README.md

Lines changed: 52 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -10,59 +10,76 @@ The core SELF team at Fluid Numerics has adopted enroot+pyxis with Slurm for our
1010
See [Repository Guidelines](CLAUDE.md) for contributor expectations, build commands, and review checklists.
1111

1212

13-
More docs coming soon
13+
## Organization
1414

15+
The `envs/` subdirectory defines all of the base environments that are aimed at providing base images with all the dependencies required for developing SELF. The subdirectory structure is `envs/{cpu_platform}/{gpu_backend}`.
1516

16-
## Organization
17+
| Directory | GPU backend | Build args | MPI |
18+
|-----------|-------------|------------|-----|
19+
| `envs/x86/rocm/` | AMD ROCm | `GPU_ARCH` (e.g. `gfx90a`), `GPU_BACKEND_VERSION` (e.g. `6.4.3`) | OpenMPI |
20+
| `envs/x86/cuda/` | NVIDIA CUDA | `CUDA_ARCH` (e.g. `70`, `100`), `CUDA_VERSION` (e.g. `12.4`, `13.0`) | OpenMPI |
21+
| `envs/x86/none/` | None (CPU-only) || OpenMPI |
1722

18-
The `envs/` subdirectory defines all of the base environments that are aimed at providing base images with all the dependencies required for developing SELF. The subdirectory structure is as `envs/{cpu_platform}/{gpu_backend}`. When `{gpu_platform}=none`, that environment is an environment for working with non-gpu accelerated implementations of SELF.
23+
Each directory contains a `spack.yaml` manifest, a `Dockerfile`, and a `feq-parse.patch`.
1924

2025
## Container Images
2126

22-
SELFish provides pre-built container images with all dependencies for GPU-accelerated spectral element computations. Images are tagged using a **version-architecture** naming scheme to support multiple GPU targets.
27+
SELFish provides pre-built container images with all dependencies for spectral element computations. Images are published to Docker Hub under `higherordermethods/selfish`.
2328

2429
### Image Tagging Scheme
2530

2631
Images follow the pattern: `higherordermethods/selfish:<version>-<cpu_platform>-<gpu_backend>-<gpu_arch>`
2732

28-
- **`<version>`**: Semantic version (e.g., `v1.2.3`) or release channel (`latest`, `dev`)
29-
- **`<cpu_platform>`** : Target cpu architecture (e.g. `x86`, `arm` )
30-
- **`<gpu_backend>`** : GPU backend provider with version (e.g. `rocm643`, `cuda112`)
31-
- **`<gpu_arch>`**: Target GPU architecture (e.g., `gfx90a`, `gfx906`, `gfx942`)
33+
- **`<version>`**: `latest` or a commit SHA
34+
- **`<cpu_platform>`**: Target CPU architecture (e.g. `x86`)
35+
- **`<gpu_backend>`**: GPU backend with version (e.g. `rocm643`, `cuda124`) or `none` for CPU-only
36+
- **`<gpu_arch>`**: Target GPU architecture (e.g. `gfx90a`, `sm70`); omitted for CPU-only images
3237

33-
#### Examples:
38+
#### Examples
3439
```bash
35-
# Stable release for MI210/MI250 (gfx90a)
36-
docker pull higherordermethods/selfish:v1.2.3-gfx90a
40+
# AMD MI210/MI250 (gfx90a) with ROCm 6.4.3
41+
docker pull higherordermethods/selfish:latest-x86-rocm643-gfx90a
42+
43+
# NVIDIA V100 (sm70) with CUDA 12.4
44+
docker pull higherordermethods/selfish:latest-x86-cuda124-sm70
3745

38-
# Latest stable for Radeon Instinct MI100 (gfx908)
39-
docker pull higherordermethods/selfish:latest-gfx908
46+
# NVIDIA Blackwell (sm100) with CUDA 13.0
47+
docker pull higherordermethods/selfish:latest-x86-cuda130-sm100
4048

41-
# Development build for MI300A (gfx942)
42-
docker pull higherordermethods/selfish:dev-gfx942
49+
# CPU-only
50+
docker pull higherordermethods/selfish:latest-x86-none
4351
```
4452

45-
### Supported GPU Architectures
53+
### Supported Architectures
4654

47-
| Architecture | GPU Models | Tag Suffix |
48-
|--------------|------------|------------|
49-
| gfx90a | MI210, MI250, MI250X | `-gfx90a` |
50-
| gfx908 | MI100 | `-gfx908` |
51-
| gfx906 | MI50, MI60, Radeon VII | `-gfx906` |
52-
| gfx942 | MI300A, MI300X | `-gfx942` |
53-
| sm_72 | V100 | -sm72 |
55+
#### AMD ROCm
56+
| Architecture | GPU Models | Tag |
57+
|--------------|------------|-----|
58+
| gfx906 | MI50, MI60, Radeon VII | `latest-x86-rocm643-gfx906` |
59+
| gfx90a | MI210, MI250, MI250X | `latest-x86-rocm643-gfx90a` |
60+
| gfx942 | MI300A, MI300X | `latest-x86-rocm643-gfx942` |
5461

55-
### Determining Your GPU Architecture
62+
#### NVIDIA CUDA
63+
| Architecture | GPU Models | Tag |
64+
|--------------|------------|-----|
65+
| sm70 | V100 | `latest-x86-cuda124-sm70` |
66+
| sm100 | B200, B300 | `latest-x86-cuda130-sm100` |
67+
68+
#### CPU-only
69+
| Tag |
70+
|-----|
71+
| `latest-x86-none` |
5672

57-
If you're unsure which image to use, check your GPU architecture.
73+
### Determining Your GPU Architecture
5874

59-
For AMD GPUs,
75+
For AMD GPUs:
6076
```bash
61-
# Using rocminfo
6277
rocminfo | grep "Name:" | grep "gfx"
78+
```
6379

64-
# Using rocm-smi
65-
rocm-smi --showproductname
80+
For NVIDIA GPUs:
81+
```bash
82+
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
6683
```
6784

6885
### Using with Slurm
@@ -71,25 +88,20 @@ Specify the architecture-specific image in your job script:
7188
```bash
7289
#!/bin/bash
7390
#SBATCH --gpus=1
74-
#SBATCH --container-image=higherordermethods/selfish:v1.2.3-gfx90a
91+
#SBATCH --container-image=higherordermethods/selfish:latest-x86-rocm643-gfx90a
7592

7693
./run_simulation.sh
7794
```
7895

79-
### Version Pinning Recommendations
80-
81-
- **Production**: Pin to specific versions (e.g., `v1.2.3-gfx90a`) for reproducibility
82-
- **Development**: Use `latest-<arch>` for convenience (auto-updates with new releases)
83-
- **Testing CI**: Use `dev-<arch>` to test against bleeding-edge builds
84-
8596
### Image Metadata
8697

8798
All images include OCI labels for programmatic inspection:
8899
```bash
89-
docker inspect higherordermethods/selfish:v1.2.3-gfx90a | grep -A5 Labels
100+
docker inspect higherordermethods/selfish:latest-x86-cuda124-sm70 | grep -A5 Labels
90101
```
91102

92103
Key labels:
93-
- `com.fluidnumerics.rocm.target`: GPU architecture target
94-
- `com.fluidnumerics.selfish.version`: SELFish version
95-
- `org.opencontainers.image.version`: Container image version
104+
- `com.fluidnumerics.rocm.target` / `com.fluidnumerics.cuda.target`: GPU architecture target
105+
- `com.fluidnumerics.rocm.version` / `com.fluidnumerics.cuda.version`: Backend version
106+
- `org.opencontainers.image.source`: Source repository
107+
- `org.opencontainers.image.revision`: Git commit SHA

envs/x86/cuda/Dockerfile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,11 @@ COPY --from=builder /opt/views /opt/views
171171

172172
RUN { \
173173
echo '#!/bin/sh' \
174+
&& echo '# Save LD_LIBRARY_PATH set by the container runtime (e.g. enroot/pyxis)' \
175+
&& echo '_pre_spack_ldlp="${LD_LIBRARY_PATH}"' \
174176
&& echo '.' /opt/spack-environment/activate.sh \
177+
&& echo '# Restore runtime-injected paths (NVIDIA driver libs) after Spack activate' \
178+
&& echo 'export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}${_pre_spack_ldlp}"' \
175179
&& echo 'exec "$@"'; \
176180
} > /entrypoint.sh \
177181
&& chmod a+x /entrypoint.sh \

envs/x86/rocm/Dockerfile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,11 @@ COPY --from=builder /opt/views /opt/views
185185

186186
RUN { \
187187
echo '#!/bin/sh' \
188+
&& echo '# Save LD_LIBRARY_PATH set by the container runtime (e.g. enroot/pyxis)' \
189+
&& echo '_pre_spack_ldlp="${LD_LIBRARY_PATH}"' \
188190
&& echo '.' /opt/spack-environment/activate.sh \
191+
&& echo '# Restore runtime-injected paths (GPU driver libs) after Spack activate' \
192+
&& echo 'export LD_LIBRARY_PATH="${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}${_pre_spack_ldlp}"' \
189193
&& echo 'exec "$@"'; \
190194
} > /entrypoint.sh \
191195
&& chmod a+x /entrypoint.sh \

0 commit comments

Comments
 (0)