|
| 1 | +name: Security Audit |
| 2 | + |
| 3 | +on: |
| 4 | + push: |
| 5 | + branches: [main] |
| 6 | + pull_request: |
| 7 | + types: [opened, synchronize, reopened] |
| 8 | + |
| 9 | +jobs: |
| 10 | + security-audit: |
| 11 | + runs-on: ubuntu-latest |
| 12 | + permissions: |
| 13 | + contents: read |
| 14 | + steps: |
| 15 | + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 |
| 16 | + with: |
| 17 | + fetch-depth: 0 |
| 18 | + |
| 19 | + - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4 |
| 20 | + with: |
| 21 | + node-version: "20" |
| 22 | + |
| 23 | + - name: Install Claude Code |
| 24 | + run: npm install -g @anthropic-ai/claude-code |
| 25 | + |
| 26 | + - name: Generate diff |
| 27 | + run: git diff ${{ github.event.before || github.event.pull_request.base.sha }}...${{ github.sha }} > /tmp/changes.diff |
| 28 | + |
| 29 | + - name: Run security audit |
| 30 | + id: audit |
| 31 | + env: |
| 32 | + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} |
| 33 | + run: | |
| 34 | + { |
| 35 | + cat <<'PROMPT' |
| 36 | + You are a senior security engineer performing a penetration-test-style review of a |
| 37 | + change that just landed on the main branch of the kernels-community project. This |
| 38 | + repository hosts the source code for compute kernels (CUDA, Metal, ROCm, XPU, Triton, |
| 39 | + C++, Python torch extensions) that are built by CI and uploaded to |
| 40 | + `hf.co/kernels-community` on the Hugging Face Hub. End users download and load these |
| 41 | + kernels — and execute the resulting native code — on their own machines (including in |
| 42 | + CI environments and training jobs) via the `kernels` Python package. Treat the attack |
| 43 | + surface accordingly: a backdoor that slips into a merged kernel will be served to |
| 44 | + every downstream user. |
| 45 | +
|
| 46 | + A brief overview of the repository layout: |
| 47 | +
|
| 48 | + * Each top-level directory (e.g. `flash-attn3`, `relu`, `paged-attention`, `rmsnorm`) |
| 49 | + is a single kernel with the following typical structure: |
| 50 | + - `build.toml`: declares the kernel's name, repo-id, backends, dependencies, and |
| 51 | + source file list. CI uses this to drive the build and the Hub upload. |
| 52 | + - `flake.nix` / `flake.lock`: Nix build pinning. Pulls `kernel-builder` and any |
| 53 | + C/C++/CUDA toolchain dependencies. |
| 54 | + - Backend source directories (`*_cuda`, `*_metal`, `*_xpu`, `*_rocm`, etc.): |
| 55 | + native kernel implementations. |
| 56 | + - `torch-ext/`: Python entry points, `torch_binding.cpp`/`torch_binding.h` that |
| 57 | + register the kernel as a Torch op, and a Python package directory with |
| 58 | + `__init__.py` that exposes the kernel via `._ops`. |
| 59 | + - `tests/`: pytest suite. Run by users and by CI on GPU runners. |
| 60 | + - `benchmarks/`: benchmark scripts. |
| 61 | + * `.github/workflows/`: CI for building, testing, validating and uploading kernels. |
| 62 | + * `scripts/`: maintenance scripts (freshness checks, failure reporting). |
| 63 | +
|
| 64 | + The diff of the change follows below. You also have access to the full repository — |
| 65 | + explore it when the diff alone is not sufficient to assess impact (e.g. to check who |
| 66 | + calls a modified function, to confirm whether a hardcoded URL is reachable from a |
| 67 | + build step, or to understand what a CUDA kernel actually computes). |
| 68 | +
|
| 69 | + Think like an attacker. The threat model is: a malicious contributor (or compromised |
| 70 | + upstream) lands code that (a) executes attacker-controlled logic on every user that |
| 71 | + loads this kernel, (b) exfiltrates secrets from users' machines or from this repo's |
| 72 | + CI, (c) injects backdoors into the built artifact uploaded to the Hub, or (d) abuses |
| 73 | + the GPU/CPU side effects of the kernel to leak data or escalate privilege. |
| 74 | +
|
| 75 | + Focus on: |
| 76 | + - **Malicious code in kernel sources (CUDA, Metal, ROCm, XPU, Triton, C/C++):** Look |
| 77 | + for arbitrary out-of-bounds reads/writes that go beyond the declared tensor shapes, |
| 78 | + kernels that scan or copy memory they were not asked to touch, hard-coded device |
| 79 | + pointers, inline PTX/asm that performs unexpected operations, and side channels |
| 80 | + (timing, cache, shared-memory residue) that could leak tensor contents. Watch for |
| 81 | + kernels whose math doesn't match their name (e.g. a "relu" kernel that also writes |
| 82 | + to a second buffer). |
| 83 | + - **Python torch-ext code:** `torch-ext/**/__init__.py` and any helper |
| 84 | + Python modules run inside the user's interpreter the moment the kernel is loaded. |
| 85 | + Flag any of: network calls (`urllib`, `requests`, `socket`, `http.client`, etc.), |
| 86 | + filesystem writes, spawning of other processes (e.g. `subprocess`/`os.system`/`os.exec*`), |
| 87 | + `eval`/`exec`/`compile` of dynamic strings, `ctypes.CDLL`/`cffi` loading of |
| 88 | + arbitrary paths, `importlib` with attacker-controllable module names, base64/hex |
| 89 | + blobs decoded then executed, environment variable reads that change control flow in |
| 90 | + non-obvious ways, and `sys.modules` manipulation that could shadow stdlib or |
| 91 | + third-party modules. |
| 92 | + - **Torch op registration and namespace hygiene:** Every kernel registers ops under a |
| 93 | + namespace derived from its package. Look for registrations that omit |
| 94 | + `add_op_namespace_prefix`, that register ops under a name belonging to another |
| 95 | + kernel or to PyTorch core, that override `aten::` ops, or that mutate |
| 96 | + `torch.ops`/`torch.library` global state in surprising ways. |
| 97 | + - **`torch_binding.cpp` / C++ glue:** TORCH_LIBRARY registrations, dispatch keys, |
| 98 | + mutable-argument annotations, and any C++ that does I/O, dlopen, or shells out |
| 99 | + beyond pure tensor math. |
| 100 | + - **`build.toml` manipulation:** Source-file lists, dependency declarations, |
| 101 | + `repo-id`, and backend selection. A new `.cu`/`.cpp`/`.py` source added to the file |
| 102 | + list that doesn't appear in the kernel's stated functionality, a `repo-id` change |
| 103 | + that retargets uploads to a different Hub repo, or a dependency on an unfamiliar |
| 104 | + package are all red flags. |
| 105 | + - **`flake.nix` / `flake.lock` / build pinning:** Look for new flake inputs pointing |
| 106 | + at attacker-controlled forks, removal of `--no-write-lock-file`-style guards, |
| 107 | + relaxation of the Nix sandbox (`sandbox = relaxed`, `sandbox = false`, |
| 108 | + `extra-sandbox-paths`), addition of `__noChroot` or `allowSubstitutes = false`, |
| 109 | + and changes to `kernel-builder`/`nixpkgs` pins. Stale or downgraded lock entries |
| 110 | + for security-relevant packages (CUDA toolchain, glibc, OpenSSL, PyTorch) are also |
| 111 | + high-impact. |
| 112 | + - **Embedded download URLs and fetch-at-build-time:** `curl`/`wget`/`fetchurl`/ |
| 113 | + `fetchTarball`/`fetchGit` calls with unpinned refs or shell snippets in `build.toml`/`flake.nix`. Anything |
| 114 | + that pulls bytes from the network at build or import time without a SHA pin is a |
| 115 | + supply-chain risk. |
| 116 | + - **Test and benchmark scripts:** `tests/` and `benchmarks/` run on CI GPU runners |
| 117 | + that may have privileged access. Flag tests that download external data, write |
| 118 | + outside the test temp dir, read CI secrets/env vars, or call out to the network. |
| 119 | + Also flag tests that disable safety checks (e.g. `torch.compile` cache poisoning, |
| 120 | + `weights_only=False` torch.load) on attacker-controlled inputs. |
| 121 | + - **CI/CD security (`.github/workflows/`, `.github/actions/`, `scripts/`):** Workflow |
| 122 | + permission scopes (`contents: write`, `id-token: write`, `packages: write`, |
| 123 | + `pull-requests: write`), secret exposure in `run:` steps, script injection via |
| 124 | + PR-controlled values (`github.event.pull_request.title`, `head_ref`, commit |
| 125 | + messages, file paths) interpolated into shell, and changes that grant |
| 126 | + `pull_request_target` or that make trusted jobs run untrusted PR code. Verify all |
| 127 | + third-party actions are SHA-pinned (full 40-char SHA), not version-tagged. Flag any |
| 128 | + workflow that hands `secrets.HF_TOKEN`, `secrets.ANTHROPIC_API_KEY`, |
| 129 | + `secrets.SLACK_WEBHOOK_URL*` (including `SLACK_WEBHOOK_URL` and `SLACK_WEBHOOK_URL_SECURITY`), |
| 130 | + or Cachix signing keys to PR-triggered code paths. |
| 131 | + - **Hub upload pipeline (`manual-build-upload.yaml`, release workflows):** Anything |
| 132 | + that widens what gets uploaded, that uploads to a `repo-id` not declared in the |
| 133 | + kernel's `build.toml`, or that uses elevated tokens in steps that also execute |
| 134 | + PR-derived code. |
| 135 | + - **Obfuscation and steganography:** Long base64/hex/zlib blobs in Python or C++; |
| 136 | + Unicode lookalike characters in identifiers (homoglyph attacks); zero-width or |
| 137 | + right-to-left override characters in source; comments that hide payloads; binary |
| 138 | + files (`.so`, `.dylib`, `.bin`, `.safetensors`, `.pt`) checked into the source tree |
| 139 | + that should be built from source instead. |
| 140 | + - **Information disclosure and credential handling:** Any code that reads |
| 141 | + `~/.cache/huggingface/token`, `HF_TOKEN`, `HUGGING_FACE_HUB_TOKEN`, |
| 142 | + `ANTHROPIC_API_KEY`, AWS/GCP/Azure credential paths, SSH keys, or kubeconfigs. |
| 143 | + User-agent strings, error messages, or telemetry that leak system details or |
| 144 | + tokens. |
| 145 | + - **Denial of service:** Kernels with input-dependent unbounded loops or huge |
| 146 | + allocations, Python loaders that recurse on attacker-influenced metadata, build |
| 147 | + steps with no timeout that could be used to grief CI. |
| 148 | +
|
| 149 | + For each finding, assess exploitability — not just theoretical presence. A |
| 150 | + hard-coded test fixture URL is lower severity than a `curl | sh` in a build step. |
| 151 | + Distinguish "this code is sketchy" from "an attacker can use this today." |
| 152 | +
|
| 153 | + If you find security issues, output your report formatted for Slack using mrkdwn syntax. |
| 154 | + Use this structure: |
| 155 | +
|
| 156 | + *[SEVERITY]* `file:lines` — Title |
| 157 | + Description of the vulnerability and how it could be exploited. |
| 158 | + _Suggestion:_ How to fix. |
| 159 | +
|
| 160 | + Separate multiple findings with blank lines. Be concise but specific. |
| 161 | +
|
| 162 | + If no security issues are found, output exactly: NO_FINDINGS |
| 163 | +
|
| 164 | + === DIFF === |
| 165 | + PROMPT |
| 166 | + cat /tmp/changes.diff |
| 167 | + } | claude -p --model claude-opus-4-6 > /tmp/audit_result.txt |
| 168 | +
|
| 169 | + if grep -q "NO_FINDINGS" /tmp/audit_result.txt; then |
| 170 | + echo "has_findings=false" >> "$GITHUB_OUTPUT" |
| 171 | + echo "Security audit complete — no findings." |
| 172 | + else |
| 173 | + echo "has_findings=true" >> "$GITHUB_OUTPUT" |
| 174 | + echo "Security audit complete — findings detected, notifying Slack." |
| 175 | + fi |
| 176 | +
|
| 177 | + - name: Notify Slack |
| 178 | + if: steps.audit.outputs.has_findings == 'true' |
| 179 | + env: |
| 180 | + SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_SECURITY }} |
| 181 | + COMMIT_URL: ${{ github.event.head_commit.url || github.event.pull_request.html_url }} |
| 182 | + COMMIT_MESSAGE: ${{ github.event.head_commit.message || github.event.pull_request.title }} |
| 183 | + COMMIT_AUTHOR: ${{ github.event.head_commit.author.username || github.event.head_commit.author.name || github.event.pull_request.user.login }} |
| 184 | + run: | |
| 185 | + FINDINGS=$(cat /tmp/audit_result.txt) |
| 186 | + COMMIT_TITLE=$(printf '%s\n' "$COMMIT_MESSAGE" | head -n1) |
| 187 | +
|
| 188 | + printf -v HEADER '*[kernels-community] Security Audit Finding*\n*Commit:* <%s|%s>\n*Author:* %s\n\n---\n\n' \ |
| 189 | + "$COMMIT_URL" "$COMMIT_TITLE" "$COMMIT_AUTHOR" |
| 190 | +
|
| 191 | + jq -n \ |
| 192 | + --arg text "${HEADER}${FINDINGS}" \ |
| 193 | + '{"text": $text}' > /tmp/slack_payload.json |
| 194 | +
|
| 195 | + curl -sf -X POST "$SLACK_WEBHOOK_URL" \ |
| 196 | + -H 'Content-Type: application/json' \ |
| 197 | + -d @/tmp/slack_payload.json |
0 commit comments