Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# The API image needs only pyproject/uv.lock + src/. Everything else is noise.
.git
.github
.venv
__pycache__
**/__pycache__
*.pyc
.pytest_cache
.mypy_cache
.ruff_cache
output
*.db
tests
docs
web
node_modules
*.md
98 changes: 98 additions & 0 deletions DEPLOY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Deploying the hosted report

Two deployables:

- **API** (FastAPI engine) → a container host. This guide uses **Fly.io**
(`Dockerfile` + `fly.toml` are ready); Railway works from the same Dockerfile.
- **Frontend** (`web/`, Next.js) → **Vercel**.

Optional but recommended for production:

- **Upstash Redis** — shared cache + per-IP throttle across instances.
- A **GitHub token** (PAT or GitHub App installation token) so the API runs on
the 5 000 req/hr authenticated limit instead of 60 req/hr.

---

## 1. API → Fly.io

```bash
# One-time
fly launch --no-deploy # or `fly apps create ghra-report-api`
fly volumes create ghra_data --region iad --size 1 # persists the waitlist DB

# Secrets (never commit these; set them on Fly)
fly secrets set GHRA_GITHUB_TOKEN=ghp_xxx
fly secrets set GHRA_CORS_ORIGINS=https://your-frontend.vercel.app
# If using Upstash (see §3):
fly secrets set GHRA_REDIS_URL=rediss://default:xxx@xxx.upstash.io:6379

fly deploy
```

`fly.toml` already wires the health check (`GET /api/health`), the `/data`
volume mount, and the non-secret config (`GHRA_REPORT_TTL_SECONDS`,
`GHRA_RATE_LIMIT`, `GHRA_RATE_WINDOW_SECONDS`, `GHRA_WAITLIST_DB=/data/waitlist.db`).

The container runs uvicorn with `--forwarded-allow-ips=*`, so behind Fly's proxy
the per-IP throttle keys on the real client address (no `GHRA_TRUST_FORWARDED_FOR`
needed).

Verify: `curl https://ghra-report-api.fly.dev/api/health` →
`{"status":"ok","github_token":true}`.

---

## 2. Frontend → Vercel

```bash
cd web
vercel link
vercel env add NEXT_PUBLIC_API_BASE production # → https://ghra-report-api.fly.dev
vercel --prod
```

Set the project **Root Directory** to `web/` in Vercel (the repo root is the
Python engine). After the frontend URL is known, set it as `GHRA_CORS_ORIGINS`
on the API (§1) so the browser's cross-origin calls are allowed.

> Vercel commit-author gotcha: if `vercel --prod` is blocked on the commit
> author, deploy from a git-free copy of `web/` and `vercel alias set`.

---

## 3. Upstash Redis (production cache + throttle)

Without `GHRA_REDIS_URL` the API uses an in-process store — correct, but
per-instance (cache and throttle don't share across machines). For more than one
instance, create an Upstash Redis database and set its `rediss://` URL as
`GHRA_REDIS_URL` (§1). The `hosting` extra (`redis`) is already installed in the
image. Any Redis server version works (the throttle uses plain `EXPIRE`).

---

## 4. Environment reference

| Variable | Where | Default | Purpose |
| -------------------------- | --------- | ------------------ | -------------------------------------------------- |
| `GHRA_GITHUB_TOKEN` | API | _(none)_ | Server token → 5 000 req/hr + GraphQL repo lists. |
| `GHRA_CORS_ORIGINS` | API | localhost:3000 | Comma-separated allowed browser origins. |
| `GHRA_REDIS_URL` | API | _(in-memory)_ | Upstash/Redis URL for shared cache + throttle. |
| `GHRA_REPORT_TTL_SECONDS` | API | `3600` | Report cache TTL. |
| `GHRA_RATE_LIMIT` | API | `20` | Requests per window per IP. |
| `GHRA_RATE_WINDOW_SECONDS` | API | `3600` | Throttle window. |
| `GHRA_WAITLIST_DB` | API | `<output>/waitlist.db` | SQLite path (point at the mounted volume). |
| `GHRA_TRUST_FORWARDED_FOR` | API | off | Only if not using uvicorn `--forwarded-allow-ips`. |
| `NEXT_PUBLIC_API_BASE` | Frontend | localhost:8080 | API base URL the browser calls. |

---

## 5. Notes & follow-ups

- **Waitlist durability:** SQLite on the mounted Fly volume survives restarts.
For multi-instance writes, migrate the waitlist to Postgres (Neon) — only
`SqliteWaitlistStore` needs a sibling implementation behind the existing
`WaitlistStore` protocol.
- **Local parity:** run the API with
`uv run --extra serve python -m uvicorn --factory src.serve.app:create_app --port 8080`
and the frontend with `pnpm dev` in `web/`.
26 changes: 26 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# API server image for the hosted clone-free report (FastAPI engine).
# The Next.js frontend deploys separately to Vercel — see DEPLOY.md.
FROM python:3.12-slim

# uv for fast, reproducible, lockfile-pinned installs.
COPY --from=ghcr.io/astral-sh/uv:0.5 /uv /bin/uv

WORKDIR /app
ENV UV_COMPILE_BYTECODE=1 \
UV_LINK_MODE=copy \
PYTHONUNBUFFERED=1

# Install dependencies only (the app runs from the source tree via `src.*`
# imports, so the project itself isn't packaged). Cached unless deps change.
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-install-project --no-dev --extra serve --extra hosting

COPY src ./src

EXPOSE 8080
# Do not trust spoofable forwarded headers by default. Deployments behind a
# known proxy can opt in with GHRA_TRUST_FORWARDED_FOR and a platform-specific
# Uvicorn forwarded-allow-ips override.
CMD ["uv", "run", "--no-sync", "python", "-m", "uvicorn", \
"--factory", "src.serve.app:create_app", \
"--host", "0.0.0.0", "--port", "8080"]
39 changes: 39 additions & 0 deletions fly.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Fly.io config for the hosted report API. Rename `app` to your Fly app.
# Secrets (GHRA_GITHUB_TOKEN, GHRA_REDIS_URL, GHRA_CORS_ORIGINS) are set with
# `fly secrets set …`, NOT here. See DEPLOY.md.
app = "ghra-report-api"
primary_region = "iad"

[build]
dockerfile = "Dockerfile"

[env]
GHRA_REPORT_TTL_SECONDS = "3600"
GHRA_RATE_LIMIT = "20"
GHRA_RATE_WINDOW_SECONDS = "3600"
# Persisted on the mounted volume below (survives restarts/redeploys).
GHRA_WAITLIST_DB = "/data/waitlist.db"

[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = "suspend"
auto_start_machines = true
min_machines_running = 0

[[http_service.checks]]
interval = "30s"
timeout = "5s"
grace_period = "10s"
method = "GET"
path = "/api/health"

# Persistent volume for the SQLite waitlist. Create it once with:
# fly volumes create ghra_data --region iad --size 1
[mounts]
source = "ghra_data"
destination = "/data"

[[vm]]
size = "shared-cpu-1x"
memory = "512mb"
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,11 @@ serve = [
"jinja2>=3.1",
"python-multipart>=0.0.9",
]
hosting = [
# Optional: a shared Redis/Upstash backend for the hosted report cache and
# per-IP throttle. Without it, the in-memory backend is used (single-instance).
"redis>=5.0",
]
build = [
"shiv>=1.0",
"build>=1.0",
Expand Down
176 changes: 176 additions & 0 deletions src/api_checkout.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
"""Materialize a sparse, API-sourced repo skeleton for clone-free scoring.

The audit engine's analyzers read a repo from the local filesystem. To score an
arbitrary public GitHub user *without* cloning every repo (the hosted, multi-tenant
path), this module reconstructs a sparse on-disk skeleton from the GitHub API:

* one Git Trees API call yields every path → directories are created and files are
``touch``-ed so presence-based analyzers (structure, testing, CI, docs, build)
see the real shape of the repo;
* a bounded set of high-signal files (README, dependency manifests) are fetched via
the Contents API and written with real content, so content-based analyzers
(README quality, dependency counts, test-framework detection) still work.

The existing analyzers run against this skeleton unmodified. ``materialize_api_workspace``
mirrors ``cloner.clone_workspace`` exactly (context manager yielding ``{name: Path}``),
so it is a drop-in replacement for the clone step.

Materialization is sequential on purpose: it keeps API access well under GitHub's
secondary rate limits (concurrent-request and points-per-minute caps) that a
parallel burst across many repos would trip.
"""

from __future__ import annotations

import logging
import tempfile
from contextlib import contextmanager
from pathlib import Path
from typing import TYPE_CHECKING, Callable, Generator

from src.models import RepoMetadata

if TYPE_CHECKING:
from src.github_client import GitHubClient

logger = logging.getLogger(__name__)

DEFAULT_MAX_FILES = 5000
DEFAULT_MAX_CONTENT_FILES = 20

# Files whose *content* (not just presence) carries real scoring signal. Matched
# case-insensitively by basename; anything starting with ``readme`` also qualifies.
CONTENT_FILE_NAMES = {
"package.json",
"pyproject.toml",
"requirements.txt",
"setup.py",
"setup.cfg",
"pipfile",
"cargo.toml",
"go.mod",
"pom.xml",
"build.gradle",
"gemfile",
"composer.json",
}


def _is_content_file(path: str) -> bool:
base = path.rsplit("/", 1)[-1].lower()
return base.startswith("readme") or base in CONTENT_FILE_NAMES


def _safe_target(dest: Path, rel: str) -> Path | None:
"""Resolve ``rel`` under ``dest``, rejecting traversal/absolute escapes.

Tree paths come from arbitrary remote repos, so a malicious entry like
``../../etc/passwd`` or ``/abs/evil`` must never resolve outside ``dest``.
"""
rel = rel.strip()
if not rel or rel in (".", "..") or "\x00" in rel:
return None
candidate = (dest / rel).resolve()
dest_resolved = dest.resolve()
if candidate == dest_resolved:
return None
if dest_resolved not in candidate.parents:
return None
return candidate


def materialize_api_checkout(
metadata: RepoMetadata,
client: "GitHubClient",
dest: Path,
*,
max_files: int = DEFAULT_MAX_FILES,
max_content_files: int = DEFAULT_MAX_CONTENT_FILES,
) -> Path:
"""Build a sparse skeleton of one repo under ``dest`` from the GitHub API.

Returns ``dest``. If the repo tree is expectedly unavailable (empty repo,
missing ref, private repo, gone), ``dest`` is created empty so downstream
analyzers score it as a near-empty repo rather than crashing. Transient,
rate-limit, and server errors propagate to the API boundary.
"""
dest = Path(dest)
dest.mkdir(parents=True, exist_ok=True)

owner, _, repo = metadata.full_name.partition("/")
if not owner or not repo:
logger.warning(
"Cannot materialize %r: full_name is not 'owner/repo'",
metadata.full_name,
)
return dest

tree = client.get_repo_tree(owner, repo, metadata.default_branch)
if not tree.get("available"):
return dest
if tree.get("truncated"):
logger.warning(
"Tree truncated for %s — skeleton is incomplete", metadata.full_name
)

for rel in tree.get("dirs", []):
target = _safe_target(dest, rel)
if target is not None:
target.mkdir(parents=True, exist_ok=True)

content_budget = max_content_files
for rel in tree.get("files", [])[:max_files]:
target = _safe_target(dest, rel)
if target is None:
continue
target.parent.mkdir(parents=True, exist_ok=True)
text = ""
if content_budget > 0 and _is_content_file(rel):
fetched = client.get_file_content(
owner, repo, rel, ref=metadata.default_branch
)
if fetched is not None:
text = fetched
content_budget -= 1
target.write_text(text, encoding="utf-8")

return dest


@contextmanager
def materialize_api_workspace(
repos: list[RepoMetadata],
client: "GitHubClient",
*,
on_progress: Callable[[int, int, str], None] | None = None,
on_error: Callable[[str, str], None] | None = None,
max_files: int = DEFAULT_MAX_FILES,
max_content_files: int = DEFAULT_MAX_CONTENT_FILES,
) -> Generator[dict[str, Path], None, None]:
"""Materialize API skeletons for many repos into a session-unique temp dir.

Drop-in replacement for ``cloner.clone_workspace``: yields a dict mapping
repo name → skeleton path. A repo that fails to materialize is skipped with
a warning so one bad repo never aborts a portfolio scan.
"""
with tempfile.TemporaryDirectory(prefix="audit-api-") as tmpdir:
root = Path(tmpdir)
workspace: dict[str, Path] = {}
total = len(repos)
for index, repo in enumerate(repos, 1):
if on_progress:
on_progress(index, total, repo.name)
try:
dest = materialize_api_checkout(
repo,
client,
root / repo.name,
max_files=max_files,
max_content_files=max_content_files,
)
workspace[repo.name] = dest
except Exception as exc: # noqa: BLE001 — one bad repo must not abort the scan
logger.warning("API checkout failed for %s: %s", repo.name, exc)
if on_error:
on_error(repo.name, str(exc))
yield workspace
Loading