[CI] Security audit workflow (#840)

sayakpaul · danieldk · web-flow · commit d7d1339bd85e · 2026-05-12T06:44:59.000+09:00
* add security audit workflow

* enhance targets

* Apply suggestions from code review

Co-authored-by: Daniël de Kok &lt;me@danieldk.eu&gt;

* Update .github/workflows/security-audit.yml

Co-authored-by: Daniël de Kok &lt;me@danieldk.eu&gt;

* up

---------

Co-authored-by: Daniël de Kok &lt;me@danieldk.eu&gt;
diff --git a/.github/workflows/security-audit.yml b/.github/workflows/security-audit.yml
@@ -0,0 +1,197 @@
+name: Security Audit
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+    types: [opened, synchronize, reopened]
+
+jobs:
+  security-audit:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        with:
+          fetch-depth: 0
+
+      - uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
+        with:
+          node-version: "20"
+
+      - name: Install Claude Code
+        run: npm install -g @anthropic-ai/claude-code
+
+      - name: Generate diff
+        run: git diff ${{ github.event.before || github.event.pull_request.base.sha }}...${{ github.sha }} > /tmp/changes.diff
+
+      - name: Run security audit
+        id: audit
+        env:
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+        run: |
+          {
+            cat <<'PROMPT'
+          You are a senior security engineer performing a penetration-test-style review of a
+          change that just landed on the main branch of the kernels-community project. This
+          repository hosts the source code for compute kernels (CUDA, Metal, ROCm, XPU, Triton,
+          C++, Python torch extensions) that are built by CI and uploaded to
+          `hf.co/kernels-community` on the Hugging Face Hub. End users download and load these
+          kernels — and execute the resulting native code — on their own machines (including in
+          CI environments and training jobs) via the `kernels` Python package. Treat the attack
+          surface accordingly: a backdoor that slips into a merged kernel will be served to
+          every downstream user.
+
+          A brief overview of the repository layout:
+
+          * Each top-level directory (e.g. `flash-attn3`, `relu`, `paged-attention`, `rmsnorm`)
+            is a single kernel with the following typical structure:
+              - `build.toml`: declares the kernel's name, repo-id, backends, dependencies, and
+                source file list. CI uses this to drive the build and the Hub upload.
+              - `flake.nix` / `flake.lock`: Nix build pinning. Pulls `kernel-builder` and any
+                C/C++/CUDA toolchain dependencies.
+              - Backend source directories (`*_cuda`, `*_metal`, `*_xpu`, `*_rocm`, etc.):
+                native kernel implementations.
+              - `torch-ext/`: Python entry points, `torch_binding.cpp`/`torch_binding.h` that
+                register the kernel as a Torch op, and a Python package directory with
+                `__init__.py` that exposes the kernel via `._ops`.
+              - `tests/`: pytest suite. Run by users and by CI on GPU runners.
+              - `benchmarks/`: benchmark scripts.
+          * `.github/workflows/`: CI for building, testing, validating and uploading kernels.
+          * `scripts/`: maintenance scripts (freshness checks, failure reporting).
+
+          The diff of the change follows below. You also have access to the full repository —
+          explore it when the diff alone is not sufficient to assess impact (e.g. to check who
+          calls a modified function, to confirm whether a hardcoded URL is reachable from a
+          build step, or to understand what a CUDA kernel actually computes).
+
+          Think like an attacker. The threat model is: a malicious contributor (or compromised
+          upstream) lands code that (a) executes attacker-controlled logic on every user that
+          loads this kernel, (b) exfiltrates secrets from users' machines or from this repo's
+          CI, (c) injects backdoors into the built artifact uploaded to the Hub, or (d) abuses
+          the GPU/CPU side effects of the kernel to leak data or escalate privilege.
+
+          Focus on:
+          - **Malicious code in kernel sources (CUDA, Metal, ROCm, XPU, Triton, C/C++):** Look
+            for arbitrary out-of-bounds reads/writes that go beyond the declared tensor shapes,
+            kernels that scan or copy memory they were not asked to touch, hard-coded device
+            pointers, inline PTX/asm that performs unexpected operations, and side channels
+            (timing, cache, shared-memory residue) that could leak tensor contents. Watch for
+            kernels whose math doesn't match their name (e.g. a "relu" kernel that also writes
+            to a second buffer).
+          - **Python torch-ext code:** `torch-ext/**/__init__.py` and any helper
+            Python modules run inside the user's interpreter the moment the kernel is loaded.
+            Flag any of: network calls (`urllib`, `requests`, `socket`, `http.client`, etc.),
+            filesystem writes, spawning of other processes (e.g. `subprocess`/`os.system`/`os.exec*`),
+            `eval`/`exec`/`compile` of dynamic strings, `ctypes.CDLL`/`cffi` loading of
+            arbitrary paths, `importlib` with attacker-controllable module names, base64/hex
+            blobs decoded then executed, environment variable reads that change control flow in
+            non-obvious ways, and `sys.modules` manipulation that could shadow stdlib or
+            third-party modules.
+          - **Torch op registration and namespace hygiene:** Every kernel registers ops under a
+            namespace derived from its package. Look for registrations that omit
+            `add_op_namespace_prefix`, that register ops under a name belonging to another
+            kernel or to PyTorch core, that override `aten::` ops, or that mutate
+            `torch.ops`/`torch.library` global state in surprising ways.
+          - **`torch_binding.cpp` / C++ glue:** TORCH_LIBRARY registrations, dispatch keys,
+            mutable-argument annotations, and any C++ that does I/O, dlopen, or shells out
+            beyond pure tensor math.
+          - **`build.toml` manipulation:** Source-file lists, dependency declarations,
+            `repo-id`, and backend selection. A new `.cu`/`.cpp`/`.py` source added to the file
+            list that doesn't appear in the kernel's stated functionality, a `repo-id` change
+            that retargets uploads to a different Hub repo, or a dependency on an unfamiliar
+            package are all red flags.
+          - **`flake.nix` / `flake.lock` / build pinning:** Look for new flake inputs pointing
+            at attacker-controlled forks, removal of `--no-write-lock-file`-style guards,
+            relaxation of the Nix sandbox (`sandbox = relaxed`, `sandbox = false`,
+            `extra-sandbox-paths`), addition of `__noChroot` or `allowSubstitutes = false`,
+            and changes to `kernel-builder`/`nixpkgs` pins. Stale or downgraded lock entries
+            for security-relevant packages (CUDA toolchain, glibc, OpenSSL, PyTorch) are also
+            high-impact.
+          - **Embedded download URLs and fetch-at-build-time:** `curl`/`wget`/`fetchurl`/
+            `fetchTarball`/`fetchGit` calls with unpinned refs or shell snippets in `build.toml`/`flake.nix`. Anything
+            that pulls bytes from the network at build or import time without a SHA pin is a
+            supply-chain risk.
+          - **Test and benchmark scripts:** `tests/` and `benchmarks/` run on CI GPU runners
+            that may have privileged access. Flag tests that download external data, write
+            outside the test temp dir, read CI secrets/env vars, or call out to the network.
+            Also flag tests that disable safety checks (e.g. `torch.compile` cache poisoning,
+            `weights_only=False` torch.load) on attacker-controlled inputs.
+          - **CI/CD security (`.github/workflows/`, `.github/actions/`, `scripts/`):** Workflow
+            permission scopes (`contents: write`, `id-token: write`, `packages: write`,
+            `pull-requests: write`), secret exposure in `run:` steps, script injection via
+            PR-controlled values (`github.event.pull_request.title`, `head_ref`, commit
+            messages, file paths) interpolated into shell, and changes that grant
+            `pull_request_target` or that make trusted jobs run untrusted PR code. Verify all
+            third-party actions are SHA-pinned (full 40-char SHA), not version-tagged. Flag any
+            workflow that hands `secrets.HF_TOKEN`, `secrets.ANTHROPIC_API_KEY`,
+            `secrets.SLACK_WEBHOOK_URL*` (including `SLACK_WEBHOOK_URL` and `SLACK_WEBHOOK_URL_SECURITY`),
+            or Cachix signing keys to PR-triggered code paths.
+          - **Hub upload pipeline (`manual-build-upload.yaml`, release workflows):** Anything
+            that widens what gets uploaded, that uploads to a `repo-id` not declared in the
+            kernel's `build.toml`, or that uses elevated tokens in steps that also execute
+            PR-derived code.
+          - **Obfuscation and steganography:** Long base64/hex/zlib blobs in Python or C++;
+            Unicode lookalike characters in identifiers (homoglyph attacks); zero-width or
+            right-to-left override characters in source; comments that hide payloads; binary
+            files (`.so`, `.dylib`, `.bin`, `.safetensors`, `.pt`) checked into the source tree
+            that should be built from source instead.
+          - **Information disclosure and credential handling:** Any code that reads
+            `~/.cache/huggingface/token`, `HF_TOKEN`, `HUGGING_FACE_HUB_TOKEN`,
+            `ANTHROPIC_API_KEY`, AWS/GCP/Azure credential paths, SSH keys, or kubeconfigs.
+            User-agent strings, error messages, or telemetry that leak system details or
+            tokens.
+          - **Denial of service:** Kernels with input-dependent unbounded loops or huge
+            allocations, Python loaders that recurse on attacker-influenced metadata, build
+            steps with no timeout that could be used to grief CI.
+
+          For each finding, assess exploitability — not just theoretical presence. A
+          hard-coded test fixture URL is lower severity than a `curl | sh` in a build step.
+          Distinguish "this code is sketchy" from "an attacker can use this today."
+
+          If you find security issues, output your report formatted for Slack using mrkdwn syntax.
+          Use this structure:
+
+          *[SEVERITY]* `file:lines` — Title
+          Description of the vulnerability and how it could be exploited.
+          _Suggestion:_ How to fix.
+
+          Separate multiple findings with blank lines. Be concise but specific.
+
+          If no security issues are found, output exactly: NO_FINDINGS
+
+          === DIFF ===
+          PROMPT
+            cat /tmp/changes.diff
+          } | claude -p --model claude-opus-4-6 > /tmp/audit_result.txt
+
+          if grep -q "NO_FINDINGS" /tmp/audit_result.txt; then
+            echo "has_findings=false" >> "$GITHUB_OUTPUT"
+            echo "Security audit complete — no findings."
+          else
+            echo "has_findings=true" >> "$GITHUB_OUTPUT"
+            echo "Security audit complete — findings detected, notifying Slack."
+          fi
+
+      - name: Notify Slack
+        if: steps.audit.outputs.has_findings == 'true'
+        env:
+          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_SECURITY }}
+          COMMIT_URL: ${{ github.event.head_commit.url || github.event.pull_request.html_url }}
+          COMMIT_MESSAGE: ${{ github.event.head_commit.message || github.event.pull_request.title }}
+          COMMIT_AUTHOR: ${{ github.event.head_commit.author.username || github.event.head_commit.author.name || github.event.pull_request.user.login }}
+        run: |
+          FINDINGS=$(cat /tmp/audit_result.txt)
+          COMMIT_TITLE=$(printf '%s\n' "$COMMIT_MESSAGE" | head -n1)
+
+          printf -v HEADER '*[kernels-community] Security Audit Finding*\n*Commit:* <%s|%s>\n*Author:* %s\n\n---\n\n' \
+            "$COMMIT_URL" "$COMMIT_TITLE" "$COMMIT_AUTHOR"
+
+          jq -n \
+            --arg text "${HEADER}${FINDINGS}" \
+            '{"text": $text}' > /tmp/slack_payload.json
+
+          curl -sf -X POST "$SLACK_WEBHOOK_URL" \
+            -H 'Content-Type: application/json' \
+            -d @/tmp/slack_payload.json