Skip to content

Commit d7d1339

Browse files
sayakpauldanieldk
andauthored
[CI] Security audit workflow (#840)
* add security audit workflow * enhance targets * Apply suggestions from code review Co-authored-by: Daniël de Kok <me@danieldk.eu> * Update .github/workflows/security-audit.yml Co-authored-by: Daniël de Kok <me@danieldk.eu> * up --------- Co-authored-by: Daniël de Kok <me@danieldk.eu>
1 parent 373a45c commit d7d1339

1 file changed

Lines changed: 197 additions & 0 deletions

File tree

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
name: Security Audit
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
types: [opened, synchronize, reopened]
8+
9+
jobs:
10+
security-audit:
11+
runs-on: ubuntu-latest
12+
permissions:
13+
contents: read
14+
steps:
15+
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
16+
with:
17+
fetch-depth: 0
18+
19+
- uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
20+
with:
21+
node-version: "20"
22+
23+
- name: Install Claude Code
24+
run: npm install -g @anthropic-ai/claude-code
25+
26+
- name: Generate diff
27+
run: git diff ${{ github.event.before || github.event.pull_request.base.sha }}...${{ github.sha }} > /tmp/changes.diff
28+
29+
- name: Run security audit
30+
id: audit
31+
env:
32+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
33+
run: |
34+
{
35+
cat <<'PROMPT'
36+
You are a senior security engineer performing a penetration-test-style review of a
37+
change that just landed on the main branch of the kernels-community project. This
38+
repository hosts the source code for compute kernels (CUDA, Metal, ROCm, XPU, Triton,
39+
C++, Python torch extensions) that are built by CI and uploaded to
40+
`hf.co/kernels-community` on the Hugging Face Hub. End users download and load these
41+
kernels — and execute the resulting native code — on their own machines (including in
42+
CI environments and training jobs) via the `kernels` Python package. Treat the attack
43+
surface accordingly: a backdoor that slips into a merged kernel will be served to
44+
every downstream user.
45+
46+
A brief overview of the repository layout:
47+
48+
* Each top-level directory (e.g. `flash-attn3`, `relu`, `paged-attention`, `rmsnorm`)
49+
is a single kernel with the following typical structure:
50+
- `build.toml`: declares the kernel's name, repo-id, backends, dependencies, and
51+
source file list. CI uses this to drive the build and the Hub upload.
52+
- `flake.nix` / `flake.lock`: Nix build pinning. Pulls `kernel-builder` and any
53+
C/C++/CUDA toolchain dependencies.
54+
- Backend source directories (`*_cuda`, `*_metal`, `*_xpu`, `*_rocm`, etc.):
55+
native kernel implementations.
56+
- `torch-ext/`: Python entry points, `torch_binding.cpp`/`torch_binding.h` that
57+
register the kernel as a Torch op, and a Python package directory with
58+
`__init__.py` that exposes the kernel via `._ops`.
59+
- `tests/`: pytest suite. Run by users and by CI on GPU runners.
60+
- `benchmarks/`: benchmark scripts.
61+
* `.github/workflows/`: CI for building, testing, validating and uploading kernels.
62+
* `scripts/`: maintenance scripts (freshness checks, failure reporting).
63+
64+
The diff of the change follows below. You also have access to the full repository —
65+
explore it when the diff alone is not sufficient to assess impact (e.g. to check who
66+
calls a modified function, to confirm whether a hardcoded URL is reachable from a
67+
build step, or to understand what a CUDA kernel actually computes).
68+
69+
Think like an attacker. The threat model is: a malicious contributor (or compromised
70+
upstream) lands code that (a) executes attacker-controlled logic on every user that
71+
loads this kernel, (b) exfiltrates secrets from users' machines or from this repo's
72+
CI, (c) injects backdoors into the built artifact uploaded to the Hub, or (d) abuses
73+
the GPU/CPU side effects of the kernel to leak data or escalate privilege.
74+
75+
Focus on:
76+
- **Malicious code in kernel sources (CUDA, Metal, ROCm, XPU, Triton, C/C++):** Look
77+
for arbitrary out-of-bounds reads/writes that go beyond the declared tensor shapes,
78+
kernels that scan or copy memory they were not asked to touch, hard-coded device
79+
pointers, inline PTX/asm that performs unexpected operations, and side channels
80+
(timing, cache, shared-memory residue) that could leak tensor contents. Watch for
81+
kernels whose math doesn't match their name (e.g. a "relu" kernel that also writes
82+
to a second buffer).
83+
- **Python torch-ext code:** `torch-ext/**/__init__.py` and any helper
84+
Python modules run inside the user's interpreter the moment the kernel is loaded.
85+
Flag any of: network calls (`urllib`, `requests`, `socket`, `http.client`, etc.),
86+
filesystem writes, spawning of other processes (e.g. `subprocess`/`os.system`/`os.exec*`),
87+
`eval`/`exec`/`compile` of dynamic strings, `ctypes.CDLL`/`cffi` loading of
88+
arbitrary paths, `importlib` with attacker-controllable module names, base64/hex
89+
blobs decoded then executed, environment variable reads that change control flow in
90+
non-obvious ways, and `sys.modules` manipulation that could shadow stdlib or
91+
third-party modules.
92+
- **Torch op registration and namespace hygiene:** Every kernel registers ops under a
93+
namespace derived from its package. Look for registrations that omit
94+
`add_op_namespace_prefix`, that register ops under a name belonging to another
95+
kernel or to PyTorch core, that override `aten::` ops, or that mutate
96+
`torch.ops`/`torch.library` global state in surprising ways.
97+
- **`torch_binding.cpp` / C++ glue:** TORCH_LIBRARY registrations, dispatch keys,
98+
mutable-argument annotations, and any C++ that does I/O, dlopen, or shells out
99+
beyond pure tensor math.
100+
- **`build.toml` manipulation:** Source-file lists, dependency declarations,
101+
`repo-id`, and backend selection. A new `.cu`/`.cpp`/`.py` source added to the file
102+
list that doesn't appear in the kernel's stated functionality, a `repo-id` change
103+
that retargets uploads to a different Hub repo, or a dependency on an unfamiliar
104+
package are all red flags.
105+
- **`flake.nix` / `flake.lock` / build pinning:** Look for new flake inputs pointing
106+
at attacker-controlled forks, removal of `--no-write-lock-file`-style guards,
107+
relaxation of the Nix sandbox (`sandbox = relaxed`, `sandbox = false`,
108+
`extra-sandbox-paths`), addition of `__noChroot` or `allowSubstitutes = false`,
109+
and changes to `kernel-builder`/`nixpkgs` pins. Stale or downgraded lock entries
110+
for security-relevant packages (CUDA toolchain, glibc, OpenSSL, PyTorch) are also
111+
high-impact.
112+
- **Embedded download URLs and fetch-at-build-time:** `curl`/`wget`/`fetchurl`/
113+
`fetchTarball`/`fetchGit` calls with unpinned refs or shell snippets in `build.toml`/`flake.nix`. Anything
114+
that pulls bytes from the network at build or import time without a SHA pin is a
115+
supply-chain risk.
116+
- **Test and benchmark scripts:** `tests/` and `benchmarks/` run on CI GPU runners
117+
that may have privileged access. Flag tests that download external data, write
118+
outside the test temp dir, read CI secrets/env vars, or call out to the network.
119+
Also flag tests that disable safety checks (e.g. `torch.compile` cache poisoning,
120+
`weights_only=False` torch.load) on attacker-controlled inputs.
121+
- **CI/CD security (`.github/workflows/`, `.github/actions/`, `scripts/`):** Workflow
122+
permission scopes (`contents: write`, `id-token: write`, `packages: write`,
123+
`pull-requests: write`), secret exposure in `run:` steps, script injection via
124+
PR-controlled values (`github.event.pull_request.title`, `head_ref`, commit
125+
messages, file paths) interpolated into shell, and changes that grant
126+
`pull_request_target` or that make trusted jobs run untrusted PR code. Verify all
127+
third-party actions are SHA-pinned (full 40-char SHA), not version-tagged. Flag any
128+
workflow that hands `secrets.HF_TOKEN`, `secrets.ANTHROPIC_API_KEY`,
129+
`secrets.SLACK_WEBHOOK_URL*` (including `SLACK_WEBHOOK_URL` and `SLACK_WEBHOOK_URL_SECURITY`),
130+
or Cachix signing keys to PR-triggered code paths.
131+
- **Hub upload pipeline (`manual-build-upload.yaml`, release workflows):** Anything
132+
that widens what gets uploaded, that uploads to a `repo-id` not declared in the
133+
kernel's `build.toml`, or that uses elevated tokens in steps that also execute
134+
PR-derived code.
135+
- **Obfuscation and steganography:** Long base64/hex/zlib blobs in Python or C++;
136+
Unicode lookalike characters in identifiers (homoglyph attacks); zero-width or
137+
right-to-left override characters in source; comments that hide payloads; binary
138+
files (`.so`, `.dylib`, `.bin`, `.safetensors`, `.pt`) checked into the source tree
139+
that should be built from source instead.
140+
- **Information disclosure and credential handling:** Any code that reads
141+
`~/.cache/huggingface/token`, `HF_TOKEN`, `HUGGING_FACE_HUB_TOKEN`,
142+
`ANTHROPIC_API_KEY`, AWS/GCP/Azure credential paths, SSH keys, or kubeconfigs.
143+
User-agent strings, error messages, or telemetry that leak system details or
144+
tokens.
145+
- **Denial of service:** Kernels with input-dependent unbounded loops or huge
146+
allocations, Python loaders that recurse on attacker-influenced metadata, build
147+
steps with no timeout that could be used to grief CI.
148+
149+
For each finding, assess exploitability — not just theoretical presence. A
150+
hard-coded test fixture URL is lower severity than a `curl | sh` in a build step.
151+
Distinguish "this code is sketchy" from "an attacker can use this today."
152+
153+
If you find security issues, output your report formatted for Slack using mrkdwn syntax.
154+
Use this structure:
155+
156+
*[SEVERITY]* `file:lines` — Title
157+
Description of the vulnerability and how it could be exploited.
158+
_Suggestion:_ How to fix.
159+
160+
Separate multiple findings with blank lines. Be concise but specific.
161+
162+
If no security issues are found, output exactly: NO_FINDINGS
163+
164+
=== DIFF ===
165+
PROMPT
166+
cat /tmp/changes.diff
167+
} | claude -p --model claude-opus-4-6 > /tmp/audit_result.txt
168+
169+
if grep -q "NO_FINDINGS" /tmp/audit_result.txt; then
170+
echo "has_findings=false" >> "$GITHUB_OUTPUT"
171+
echo "Security audit complete — no findings."
172+
else
173+
echo "has_findings=true" >> "$GITHUB_OUTPUT"
174+
echo "Security audit complete — findings detected, notifying Slack."
175+
fi
176+
177+
- name: Notify Slack
178+
if: steps.audit.outputs.has_findings == 'true'
179+
env:
180+
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL_SECURITY }}
181+
COMMIT_URL: ${{ github.event.head_commit.url || github.event.pull_request.html_url }}
182+
COMMIT_MESSAGE: ${{ github.event.head_commit.message || github.event.pull_request.title }}
183+
COMMIT_AUTHOR: ${{ github.event.head_commit.author.username || github.event.head_commit.author.name || github.event.pull_request.user.login }}
184+
run: |
185+
FINDINGS=$(cat /tmp/audit_result.txt)
186+
COMMIT_TITLE=$(printf '%s\n' "$COMMIT_MESSAGE" | head -n1)
187+
188+
printf -v HEADER '*[kernels-community] Security Audit Finding*\n*Commit:* <%s|%s>\n*Author:* %s\n\n---\n\n' \
189+
"$COMMIT_URL" "$COMMIT_TITLE" "$COMMIT_AUTHOR"
190+
191+
jq -n \
192+
--arg text "${HEADER}${FINDINGS}" \
193+
'{"text": $text}' > /tmp/slack_payload.json
194+
195+
curl -sf -X POST "$SLACK_WEBHOOK_URL" \
196+
-H 'Content-Type: application/json' \
197+
-d @/tmp/slack_payload.json

0 commit comments

Comments
 (0)