Skip to content

Commit 99df92c

Browse files
authored
fix: harden API-only scoring review issues
Address CodeQL alerts, tree API error propagation, forwarded-header trust, and hosted API review comments.
1 parent 80c5636 commit 99df92c

39 files changed

Lines changed: 4456 additions & 59 deletions

.dockerignore

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# The API image needs only pyproject/uv.lock + src/. Everything else is noise.
2+
.git
3+
.github
4+
.venv
5+
__pycache__
6+
**/__pycache__
7+
*.pyc
8+
.pytest_cache
9+
.mypy_cache
10+
.ruff_cache
11+
output
12+
*.db
13+
tests
14+
docs
15+
web
16+
node_modules
17+
*.md

DEPLOY.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Deploying the hosted report
2+
3+
Two deployables:
4+
5+
- **API** (FastAPI engine) → a container host. This guide uses **Fly.io**
6+
(`Dockerfile` + `fly.toml` are ready); Railway works from the same Dockerfile.
7+
- **Frontend** (`web/`, Next.js) → **Vercel**.
8+
9+
Optional but recommended for production:
10+
11+
- **Upstash Redis** — shared cache + per-IP throttle across instances.
12+
- A **GitHub token** (PAT or GitHub App installation token) so the API runs on
13+
the 5 000 req/hr authenticated limit instead of 60 req/hr.
14+
15+
---
16+
17+
## 1. API → Fly.io
18+
19+
```bash
20+
# One-time
21+
fly launch --no-deploy # or `fly apps create ghra-report-api`
22+
fly volumes create ghra_data --region iad --size 1 # persists the waitlist DB
23+
24+
# Secrets (never commit these; set them on Fly)
25+
fly secrets set GHRA_GITHUB_TOKEN=ghp_xxx
26+
fly secrets set GHRA_CORS_ORIGINS=https://your-frontend.vercel.app
27+
# If using Upstash (see §3):
28+
fly secrets set GHRA_REDIS_URL=rediss://default:xxx@xxx.upstash.io:6379
29+
30+
fly deploy
31+
```
32+
33+
`fly.toml` already wires the health check (`GET /api/health`), the `/data`
34+
volume mount, and the non-secret config (`GHRA_REPORT_TTL_SECONDS`,
35+
`GHRA_RATE_LIMIT`, `GHRA_RATE_WINDOW_SECONDS`, `GHRA_WAITLIST_DB=/data/waitlist.db`).
36+
37+
The container runs uvicorn with `--forwarded-allow-ips=*`, so behind Fly's proxy
38+
the per-IP throttle keys on the real client address (no `GHRA_TRUST_FORWARDED_FOR`
39+
needed).
40+
41+
Verify: `curl https://ghra-report-api.fly.dev/api/health`
42+
`{"status":"ok","github_token":true}`.
43+
44+
---
45+
46+
## 2. Frontend → Vercel
47+
48+
```bash
49+
cd web
50+
vercel link
51+
vercel env add NEXT_PUBLIC_API_BASE production # → https://ghra-report-api.fly.dev
52+
vercel --prod
53+
```
54+
55+
Set the project **Root Directory** to `web/` in Vercel (the repo root is the
56+
Python engine). After the frontend URL is known, set it as `GHRA_CORS_ORIGINS`
57+
on the API (§1) so the browser's cross-origin calls are allowed.
58+
59+
> Vercel commit-author gotcha: if `vercel --prod` is blocked on the commit
60+
> author, deploy from a git-free copy of `web/` and `vercel alias set`.
61+
62+
---
63+
64+
## 3. Upstash Redis (production cache + throttle)
65+
66+
Without `GHRA_REDIS_URL` the API uses an in-process store — correct, but
67+
per-instance (cache and throttle don't share across machines). For more than one
68+
instance, create an Upstash Redis database and set its `rediss://` URL as
69+
`GHRA_REDIS_URL` (§1). The `hosting` extra (`redis`) is already installed in the
70+
image. Any Redis server version works (the throttle uses plain `EXPIRE`).
71+
72+
---
73+
74+
## 4. Environment reference
75+
76+
| Variable | Where | Default | Purpose |
77+
| -------------------------- | --------- | ------------------ | -------------------------------------------------- |
78+
| `GHRA_GITHUB_TOKEN` | API | _(none)_ | Server token → 5 000 req/hr + GraphQL repo lists. |
79+
| `GHRA_CORS_ORIGINS` | API | localhost:3000 | Comma-separated allowed browser origins. |
80+
| `GHRA_REDIS_URL` | API | _(in-memory)_ | Upstash/Redis URL for shared cache + throttle. |
81+
| `GHRA_REPORT_TTL_SECONDS` | API | `3600` | Report cache TTL. |
82+
| `GHRA_RATE_LIMIT` | API | `20` | Requests per window per IP. |
83+
| `GHRA_RATE_WINDOW_SECONDS` | API | `3600` | Throttle window. |
84+
| `GHRA_WAITLIST_DB` | API | `<output>/waitlist.db` | SQLite path (point at the mounted volume). |
85+
| `GHRA_TRUST_FORWARDED_FOR` | API | off | Only if not using uvicorn `--forwarded-allow-ips`. |
86+
| `NEXT_PUBLIC_API_BASE` | Frontend | localhost:8080 | API base URL the browser calls. |
87+
88+
---
89+
90+
## 5. Notes & follow-ups
91+
92+
- **Waitlist durability:** SQLite on the mounted Fly volume survives restarts.
93+
For multi-instance writes, migrate the waitlist to Postgres (Neon) — only
94+
`SqliteWaitlistStore` needs a sibling implementation behind the existing
95+
`WaitlistStore` protocol.
96+
- **Local parity:** run the API with
97+
`uv run --extra serve python -m uvicorn --factory src.serve.app:create_app --port 8080`
98+
and the frontend with `pnpm dev` in `web/`.

Dockerfile

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# API server image for the hosted clone-free report (FastAPI engine).
2+
# The Next.js frontend deploys separately to Vercel — see DEPLOY.md.
3+
FROM python:3.12-slim
4+
5+
# uv for fast, reproducible, lockfile-pinned installs.
6+
COPY --from=ghcr.io/astral-sh/uv:0.5 /uv /bin/uv
7+
8+
WORKDIR /app
9+
ENV UV_COMPILE_BYTECODE=1 \
10+
UV_LINK_MODE=copy \
11+
PYTHONUNBUFFERED=1
12+
13+
# Install dependencies only (the app runs from the source tree via `src.*`
14+
# imports, so the project itself isn't packaged). Cached unless deps change.
15+
COPY pyproject.toml uv.lock ./
16+
RUN uv sync --frozen --no-install-project --no-dev --extra serve --extra hosting
17+
18+
COPY src ./src
19+
20+
EXPOSE 8080
21+
# Do not trust spoofable forwarded headers by default. Deployments behind a
22+
# known proxy can opt in with GHRA_TRUST_FORWARDED_FOR and a platform-specific
23+
# Uvicorn forwarded-allow-ips override.
24+
CMD ["uv", "run", "--no-sync", "python", "-m", "uvicorn", \
25+
"--factory", "src.serve.app:create_app", \
26+
"--host", "0.0.0.0", "--port", "8080"]

fly.toml

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Fly.io config for the hosted report API. Rename `app` to your Fly app.
2+
# Secrets (GHRA_GITHUB_TOKEN, GHRA_REDIS_URL, GHRA_CORS_ORIGINS) are set with
3+
# `fly secrets set …`, NOT here. See DEPLOY.md.
4+
app = "ghra-report-api"
5+
primary_region = "iad"
6+
7+
[build]
8+
dockerfile = "Dockerfile"
9+
10+
[env]
11+
GHRA_REPORT_TTL_SECONDS = "3600"
12+
GHRA_RATE_LIMIT = "20"
13+
GHRA_RATE_WINDOW_SECONDS = "3600"
14+
# Persisted on the mounted volume below (survives restarts/redeploys).
15+
GHRA_WAITLIST_DB = "/data/waitlist.db"
16+
17+
[http_service]
18+
internal_port = 8080
19+
force_https = true
20+
auto_stop_machines = "suspend"
21+
auto_start_machines = true
22+
min_machines_running = 0
23+
24+
[[http_service.checks]]
25+
interval = "30s"
26+
timeout = "5s"
27+
grace_period = "10s"
28+
method = "GET"
29+
path = "/api/health"
30+
31+
# Persistent volume for the SQLite waitlist. Create it once with:
32+
# fly volumes create ghra_data --region iad --size 1
33+
[mounts]
34+
source = "ghra_data"
35+
destination = "/data"
36+
37+
[[vm]]
38+
size = "shared-cpu-1x"
39+
memory = "512mb"

pyproject.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,11 @@ serve = [
5959
"jinja2>=3.1",
6060
"python-multipart>=0.0.9",
6161
]
62+
hosting = [
63+
# Optional: a shared Redis/Upstash backend for the hosted report cache and
64+
# per-IP throttle. Without it, the in-memory backend is used (single-instance).
65+
"redis>=5.0",
66+
]
6267
build = [
6368
"shiv>=1.0",
6469
"build>=1.0",

src/api_checkout.py

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
"""Materialize a sparse, API-sourced repo skeleton for clone-free scoring.
2+
3+
The audit engine's analyzers read a repo from the local filesystem. To score an
4+
arbitrary public GitHub user *without* cloning every repo (the hosted, multi-tenant
5+
path), this module reconstructs a sparse on-disk skeleton from the GitHub API:
6+
7+
* one Git Trees API call yields every path → directories are created and files are
8+
``touch``-ed so presence-based analyzers (structure, testing, CI, docs, build)
9+
see the real shape of the repo;
10+
* a bounded set of high-signal files (README, dependency manifests) are fetched via
11+
the Contents API and written with real content, so content-based analyzers
12+
(README quality, dependency counts, test-framework detection) still work.
13+
14+
The existing analyzers run against this skeleton unmodified. ``materialize_api_workspace``
15+
mirrors ``cloner.clone_workspace`` exactly (context manager yielding ``{name: Path}``),
16+
so it is a drop-in replacement for the clone step.
17+
18+
Materialization is sequential on purpose: it keeps API access well under GitHub's
19+
secondary rate limits (concurrent-request and points-per-minute caps) that a
20+
parallel burst across many repos would trip.
21+
"""
22+
23+
from __future__ import annotations
24+
25+
import logging
26+
import tempfile
27+
from contextlib import contextmanager
28+
from pathlib import Path
29+
from typing import TYPE_CHECKING, Callable, Generator
30+
31+
from src.models import RepoMetadata
32+
33+
if TYPE_CHECKING:
34+
from src.github_client import GitHubClient
35+
36+
logger = logging.getLogger(__name__)
37+
38+
DEFAULT_MAX_FILES = 5000
39+
DEFAULT_MAX_CONTENT_FILES = 20
40+
41+
# Files whose *content* (not just presence) carries real scoring signal. Matched
42+
# case-insensitively by basename; anything starting with ``readme`` also qualifies.
43+
CONTENT_FILE_NAMES = {
44+
"package.json",
45+
"pyproject.toml",
46+
"requirements.txt",
47+
"setup.py",
48+
"setup.cfg",
49+
"pipfile",
50+
"cargo.toml",
51+
"go.mod",
52+
"pom.xml",
53+
"build.gradle",
54+
"gemfile",
55+
"composer.json",
56+
}
57+
58+
59+
def _is_content_file(path: str) -> bool:
60+
base = path.rsplit("/", 1)[-1].lower()
61+
return base.startswith("readme") or base in CONTENT_FILE_NAMES
62+
63+
64+
def _safe_target(dest: Path, rel: str) -> Path | None:
65+
"""Resolve ``rel`` under ``dest``, rejecting traversal/absolute escapes.
66+
67+
Tree paths come from arbitrary remote repos, so a malicious entry like
68+
``../../etc/passwd`` or ``/abs/evil`` must never resolve outside ``dest``.
69+
"""
70+
rel = rel.strip()
71+
if not rel or rel in (".", "..") or "\x00" in rel:
72+
return None
73+
candidate = (dest / rel).resolve()
74+
dest_resolved = dest.resolve()
75+
if candidate == dest_resolved:
76+
return None
77+
if dest_resolved not in candidate.parents:
78+
return None
79+
return candidate
80+
81+
82+
def materialize_api_checkout(
83+
metadata: RepoMetadata,
84+
client: "GitHubClient",
85+
dest: Path,
86+
*,
87+
max_files: int = DEFAULT_MAX_FILES,
88+
max_content_files: int = DEFAULT_MAX_CONTENT_FILES,
89+
) -> Path:
90+
"""Build a sparse skeleton of one repo under ``dest`` from the GitHub API.
91+
92+
Returns ``dest``. If the repo tree is expectedly unavailable (empty repo,
93+
missing ref, private repo, gone), ``dest`` is created empty so downstream
94+
analyzers score it as a near-empty repo rather than crashing. Transient,
95+
rate-limit, and server errors propagate to the API boundary.
96+
"""
97+
dest = Path(dest)
98+
dest.mkdir(parents=True, exist_ok=True)
99+
100+
owner, _, repo = metadata.full_name.partition("/")
101+
if not owner or not repo:
102+
logger.warning(
103+
"Cannot materialize %r: full_name is not 'owner/repo'",
104+
metadata.full_name,
105+
)
106+
return dest
107+
108+
tree = client.get_repo_tree(owner, repo, metadata.default_branch)
109+
if not tree.get("available"):
110+
return dest
111+
if tree.get("truncated"):
112+
logger.warning(
113+
"Tree truncated for %s — skeleton is incomplete", metadata.full_name
114+
)
115+
116+
for rel in tree.get("dirs", []):
117+
target = _safe_target(dest, rel)
118+
if target is not None:
119+
target.mkdir(parents=True, exist_ok=True)
120+
121+
content_budget = max_content_files
122+
for rel in tree.get("files", [])[:max_files]:
123+
target = _safe_target(dest, rel)
124+
if target is None:
125+
continue
126+
target.parent.mkdir(parents=True, exist_ok=True)
127+
text = ""
128+
if content_budget > 0 and _is_content_file(rel):
129+
fetched = client.get_file_content(
130+
owner, repo, rel, ref=metadata.default_branch
131+
)
132+
if fetched is not None:
133+
text = fetched
134+
content_budget -= 1
135+
target.write_text(text, encoding="utf-8")
136+
137+
return dest
138+
139+
140+
@contextmanager
141+
def materialize_api_workspace(
142+
repos: list[RepoMetadata],
143+
client: "GitHubClient",
144+
*,
145+
on_progress: Callable[[int, int, str], None] | None = None,
146+
on_error: Callable[[str, str], None] | None = None,
147+
max_files: int = DEFAULT_MAX_FILES,
148+
max_content_files: int = DEFAULT_MAX_CONTENT_FILES,
149+
) -> Generator[dict[str, Path], None, None]:
150+
"""Materialize API skeletons for many repos into a session-unique temp dir.
151+
152+
Drop-in replacement for ``cloner.clone_workspace``: yields a dict mapping
153+
repo name → skeleton path. A repo that fails to materialize is skipped with
154+
a warning so one bad repo never aborts a portfolio scan.
155+
"""
156+
with tempfile.TemporaryDirectory(prefix="audit-api-") as tmpdir:
157+
root = Path(tmpdir)
158+
workspace: dict[str, Path] = {}
159+
total = len(repos)
160+
for index, repo in enumerate(repos, 1):
161+
if on_progress:
162+
on_progress(index, total, repo.name)
163+
try:
164+
dest = materialize_api_checkout(
165+
repo,
166+
client,
167+
root / repo.name,
168+
max_files=max_files,
169+
max_content_files=max_content_files,
170+
)
171+
workspace[repo.name] = dest
172+
except Exception as exc: # noqa: BLE001 — one bad repo must not abort the scan
173+
logger.warning("API checkout failed for %s: %s", repo.name, exc)
174+
if on_error:
175+
on_error(repo.name, str(exc))
176+
yield workspace

0 commit comments

Comments
 (0)