saagpatel · saagpatel · Jun 21, 2026 · Jun 20, 2026 · Jun 20, 2026 · Jun 20, 2026
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,17 @@
+# The API image needs only pyproject/uv.lock + src/. Everything else is noise.
+.git
+.github
+.venv
+__pycache__
+**/__pycache__
+*.pyc
+.pytest_cache
+.mypy_cache
+.ruff_cache
+output
+*.db
+tests
+docs
+web
+node_modules
+*.md
diff --git a/DEPLOY.md b/DEPLOY.md
@@ -0,0 +1,98 @@
+# Deploying the hosted report
+
+Two deployables:
+
+- **API** (FastAPI engine) → a container host. This guide uses **Fly.io**
+  (`Dockerfile` + `fly.toml` are ready); Railway works from the same Dockerfile.
+- **Frontend** (`web/`, Next.js) → **Vercel**.
+
+Optional but recommended for production:
+
+- **Upstash Redis** — shared cache + per-IP throttle across instances.
+- A **GitHub token** (PAT or GitHub App installation token) so the API runs on
+  the 5 000 req/hr authenticated limit instead of 60 req/hr.
+
+---
+
+## 1. API → Fly.io
+
+```bash
+# One-time
+fly launch --no-deploy            # or `fly apps create ghra-report-api`
+fly volumes create ghra_data --region iad --size 1   # persists the waitlist DB
+
+# Secrets (never commit these; set them on Fly)
+fly secrets set GHRA_GITHUB_TOKEN=ghp_xxx
+fly secrets set GHRA_CORS_ORIGINS=https://your-frontend.vercel.app
+# If using Upstash (see §3):
+fly secrets set GHRA_REDIS_URL=rediss://default:xxx@xxx.upstash.io:6379
+
+fly deploy
+```
+
+`fly.toml` already wires the health check (`GET /api/health`), the `/data`
+volume mount, and the non-secret config (`GHRA_REPORT_TTL_SECONDS`,
+`GHRA_RATE_LIMIT`, `GHRA_RATE_WINDOW_SECONDS`, `GHRA_WAITLIST_DB=/data/waitlist.db`).
+
+The container runs uvicorn with `--forwarded-allow-ips=*`, so behind Fly's proxy
+the per-IP throttle keys on the real client address (no `GHRA_TRUST_FORWARDED_FOR`
+needed).
+
+Verify: `curl https://ghra-report-api.fly.dev/api/health` →
+`{"status":"ok","github_token":true}`.
+
+---
+
+## 2. Frontend → Vercel
+
+```bash
+cd web
+vercel link
+vercel env add NEXT_PUBLIC_API_BASE production   # → https://ghra-report-api.fly.dev
+vercel --prod
+```
+
+Set the project **Root Directory** to `web/` in Vercel (the repo root is the
+Python engine). After the frontend URL is known, set it as `GHRA_CORS_ORIGINS`
+on the API (§1) so the browser's cross-origin calls are allowed.
+
+> Vercel commit-author gotcha: if `vercel --prod` is blocked on the commit
+> author, deploy from a git-free copy of `web/` and `vercel alias set`.
+
+---
+
+## 3. Upstash Redis (production cache + throttle)
+
+Without `GHRA_REDIS_URL` the API uses an in-process store — correct, but
+per-instance (cache and throttle don't share across machines). For more than one
+instance, create an Upstash Redis database and set its `rediss://` URL as
+`GHRA_REDIS_URL` (§1). The `hosting` extra (`redis`) is already installed in the
+image. Any Redis server version works (the throttle uses plain `EXPIRE`).
+
+---
+
+## 4. Environment reference
+
+| Variable                   | Where     | Default            | Purpose                                            |
+| -------------------------- | --------- | ------------------ | -------------------------------------------------- |
+| `GHRA_GITHUB_TOKEN`        | API       | _(none)_           | Server token → 5 000 req/hr + GraphQL repo lists.  |
+| `GHRA_CORS_ORIGINS`        | API       | localhost:3000     | Comma-separated allowed browser origins.           |
+| `GHRA_REDIS_URL`           | API       | _(in-memory)_      | Upstash/Redis URL for shared cache + throttle.     |
+| `GHRA_REPORT_TTL_SECONDS`  | API       | `3600`             | Report cache TTL.                                  |
+| `GHRA_RATE_LIMIT`          | API       | `20`               | Requests per window per IP.                        |
+| `GHRA_RATE_WINDOW_SECONDS` | API       | `3600`             | Throttle window.                                   |
+| `GHRA_WAITLIST_DB`         | API       | `<output>/waitlist.db` | SQLite path (point at the mounted volume).     |
+| `GHRA_TRUST_FORWARDED_FOR` | API       | off                | Only if not using uvicorn `--forwarded-allow-ips`. |
+| `NEXT_PUBLIC_API_BASE`     | Frontend  | localhost:8080     | API base URL the browser calls.                    |
+
+---
+
+## 5. Notes & follow-ups
+
+- **Waitlist durability:** SQLite on the mounted Fly volume survives restarts.
+  For multi-instance writes, migrate the waitlist to Postgres (Neon) — only
+  `SqliteWaitlistStore` needs a sibling implementation behind the existing
+  `WaitlistStore` protocol.
+- **Local parity:** run the API with
+  `uv run --extra serve python -m uvicorn --factory src.serve.app:create_app --port 8080`
+  and the frontend with `pnpm dev` in `web/`.
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,26 @@
+# API server image for the hosted clone-free report (FastAPI engine).
+# The Next.js frontend deploys separately to Vercel — see DEPLOY.md.
+FROM python:3.12-slim
+
+# uv for fast, reproducible, lockfile-pinned installs.
+COPY --from=ghcr.io/astral-sh/uv:0.5 /uv /bin/uv
+
+WORKDIR /app
+ENV UV_COMPILE_BYTECODE=1 \
+    UV_LINK_MODE=copy \
+    PYTHONUNBUFFERED=1
+
+# Install dependencies only (the app runs from the source tree via `src.*`
+# imports, so the project itself isn't packaged). Cached unless deps change.
+COPY pyproject.toml uv.lock ./
+RUN uv sync --frozen --no-install-project --no-dev --extra serve --extra hosting
+
+COPY src ./src
+
+EXPOSE 8080
+# Do not trust spoofable forwarded headers by default. Deployments behind a
+# known proxy can opt in with GHRA_TRUST_FORWARDED_FOR and a platform-specific
+# Uvicorn forwarded-allow-ips override.
+CMD ["uv", "run", "--no-sync", "python", "-m", "uvicorn", \
+     "--factory", "src.serve.app:create_app", \
+     "--host", "0.0.0.0", "--port", "8080"]
diff --git a/fly.toml b/fly.toml
@@ -0,0 +1,39 @@
+# Fly.io config for the hosted report API. Rename `app` to your Fly app.
+# Secrets (GHRA_GITHUB_TOKEN, GHRA_REDIS_URL, GHRA_CORS_ORIGINS) are set with
+# `fly secrets set …`, NOT here. See DEPLOY.md.
+app = "ghra-report-api"
+primary_region = "iad"
+
+[build]
+  dockerfile = "Dockerfile"
+
+[env]
+  GHRA_REPORT_TTL_SECONDS = "3600"
+  GHRA_RATE_LIMIT = "20"
+  GHRA_RATE_WINDOW_SECONDS = "3600"
+  # Persisted on the mounted volume below (survives restarts/redeploys).
+  GHRA_WAITLIST_DB = "/data/waitlist.db"
+
+[http_service]
+  internal_port = 8080
+  force_https = true
+  auto_stop_machines = "suspend"
+  auto_start_machines = true
+  min_machines_running = 0
+
+  [[http_service.checks]]
+    interval = "30s"
+    timeout = "5s"
+    grace_period = "10s"
+    method = "GET"
+    path = "/api/health"
+
+# Persistent volume for the SQLite waitlist. Create it once with:
+#   fly volumes create ghra_data --region iad --size 1
+[mounts]
+  source = "ghra_data"
+  destination = "/data"
+
+[[vm]]
+  size = "shared-cpu-1x"
+  memory = "512mb"
diff --git a/pyproject.toml b/pyproject.toml
@@ -59,6 +59,11 @@ serve = [
     "jinja2>=3.1",
     "python-multipart>=0.0.9",
 ]
+hosting = [
+    # Optional: a shared Redis/Upstash backend for the hosted report cache and
+    # per-IP throttle. Without it, the in-memory backend is used (single-instance).
+    "redis>=5.0",
+]
 build = [
     "shiv>=1.0",
     "build>=1.0",

diff --git a/src/api_checkout.py b/src/api_checkout.py
@@ -0,0 +1,176 @@
+"""Materialize a sparse, API-sourced repo skeleton for clone-free scoring.
+
+The audit engine's analyzers read a repo from the local filesystem. To score an
+arbitrary public GitHub user *without* cloning every repo (the hosted, multi-tenant
+path), this module reconstructs a sparse on-disk skeleton from the GitHub API:
+
+* one Git Trees API call yields every path → directories are created and files are
+  ``touch``-ed so presence-based analyzers (structure, testing, CI, docs, build)
+  see the real shape of the repo;
+* a bounded set of high-signal files (README, dependency manifests) are fetched via
+  the Contents API and written with real content, so content-based analyzers
+  (README quality, dependency counts, test-framework detection) still work.
+
+The existing analyzers run against this skeleton unmodified. ``materialize_api_workspace``
+mirrors ``cloner.clone_workspace`` exactly (context manager yielding ``{name: Path}``),
+so it is a drop-in replacement for the clone step.
+
+Materialization is sequential on purpose: it keeps API access well under GitHub's
+secondary rate limits (concurrent-request and points-per-minute caps) that a
+parallel burst across many repos would trip.
+"""
+
+from __future__ import annotations
+
+import logging
+import tempfile
+from contextlib import contextmanager
+from pathlib import Path
+from typing import TYPE_CHECKING, Callable, Generator
+
+from src.models import RepoMetadata
+
+if TYPE_CHECKING:
+    from src.github_client import GitHubClient
+
+logger = logging.getLogger(__name__)
+
+DEFAULT_MAX_FILES = 5000
+DEFAULT_MAX_CONTENT_FILES = 20
+
+# Files whose *content* (not just presence) carries real scoring signal. Matched
+# case-insensitively by basename; anything starting with ``readme`` also qualifies.
+CONTENT_FILE_NAMES = {
+    "package.json",
+    "pyproject.toml",
+    "requirements.txt",
+    "setup.py",
+    "setup.cfg",
+    "pipfile",
+    "cargo.toml",
+    "go.mod",
+    "pom.xml",
+    "build.gradle",
+    "gemfile",
+    "composer.json",
+}
+
+
+def _is_content_file(path: str) -> bool:
+    base = path.rsplit("/", 1)[-1].lower()
+    return base.startswith("readme") or base in CONTENT_FILE_NAMES
+
+
+def _safe_target(dest: Path, rel: str) -> Path | None:
+    """Resolve ``rel`` under ``dest``, rejecting traversal/absolute escapes.
+
+    Tree paths come from arbitrary remote repos, so a malicious entry like
+    ``../../etc/passwd`` or ``/abs/evil`` must never resolve outside ``dest``.
+    """
+    rel = rel.strip()
+    if not rel or rel in (".", "..") or "\x00" in rel:
+        return None
+    candidate = (dest / rel).resolve()
+    dest_resolved = dest.resolve()
+    if candidate == dest_resolved:
+        return None
+    if dest_resolved not in candidate.parents:
+        return None
+    return candidate
+
+
+def materialize_api_checkout(
+    metadata: RepoMetadata,
+    client: "GitHubClient",
+    dest: Path,
+    *,
+    max_files: int = DEFAULT_MAX_FILES,
+    max_content_files: int = DEFAULT_MAX_CONTENT_FILES,
+) -> Path:
+    """Build a sparse skeleton of one repo under ``dest`` from the GitHub API.
+
+    Returns ``dest``. If the repo tree is expectedly unavailable (empty repo,
+    missing ref, private repo, gone), ``dest`` is created empty so downstream
+    analyzers score it as a near-empty repo rather than crashing. Transient,
+    rate-limit, and server errors propagate to the API boundary.
+    """
+    dest = Path(dest)
+    dest.mkdir(parents=True, exist_ok=True)
+
+    owner, _, repo = metadata.full_name.partition("/")
+    if not owner or not repo:
+        logger.warning(
+            "Cannot materialize %r: full_name is not 'owner/repo'",
+            metadata.full_name,
+        )
+        return dest
+
+    tree = client.get_repo_tree(owner, repo, metadata.default_branch)
+    if not tree.get("available"):
+        return dest
+    if tree.get("truncated"):
+        logger.warning(
+            "Tree truncated for %s — skeleton is incomplete", metadata.full_name
+        )
+
+    for rel in tree.get("dirs", []):
+        target = _safe_target(dest, rel)
+        if target is not None:
+            target.mkdir(parents=True, exist_ok=True)
+
+    content_budget = max_content_files
+    for rel in tree.get("files", [])[:max_files]:
+        target = _safe_target(dest, rel)
+        if target is None:
+            continue
+        target.parent.mkdir(parents=True, exist_ok=True)
+        text = ""
+        if content_budget > 0 and _is_content_file(rel):
+            fetched = client.get_file_content(
+                owner, repo, rel, ref=metadata.default_branch
+            )
+            if fetched is not None:
+                text = fetched
+                content_budget -= 1
+        target.write_text(text, encoding="utf-8")
+
+    return dest
+
+
+@contextmanager
+def materialize_api_workspace(
+    repos: list[RepoMetadata],
+    client: "GitHubClient",
+    *,
+    on_progress: Callable[[int, int, str], None] | None = None,
+    on_error: Callable[[str, str], None] | None = None,
+    max_files: int = DEFAULT_MAX_FILES,
+    max_content_files: int = DEFAULT_MAX_CONTENT_FILES,
+) -> Generator[dict[str, Path], None, None]:
+    """Materialize API skeletons for many repos into a session-unique temp dir.
+
+    Drop-in replacement for ``cloner.clone_workspace``: yields a dict mapping
+    repo name → skeleton path. A repo that fails to materialize is skipped with
+    a warning so one bad repo never aborts a portfolio scan.
+    """
+    with tempfile.TemporaryDirectory(prefix="audit-api-") as tmpdir:
+        root = Path(tmpdir)
+        workspace: dict[str, Path] = {}
+        total = len(repos)
+        for index, repo in enumerate(repos, 1):
+            if on_progress:
+                on_progress(index, total, repo.name)
+            try:
+                dest = materialize_api_checkout(
+                    repo,
+                    client,
+                    root / repo.name,
+                    max_files=max_files,
+                    max_content_files=max_content_files,
+                )
+                workspace[repo.name] = dest
+            except Exception as exc:  # noqa: BLE001 — one bad repo must not abort the scan
+                logger.warning("API checkout failed for %s: %s", repo.name, exc)
+                if on_error:
+                    on_error(repo.name, str(exc))
+        yield workspace