Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions docs/firecracker-snapshots.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# RFC: Firecracker snapshots for instant bazel-diff starts

**Status:** Draft
**Audience:** bazel-diff maintainers / contributors
**Scope decided:** CLI hooks in the Kotlin tool + a Go orchestration tool, capturing a
*full warm Bazel server*, targeting *self-hosted CI* (we control the host kernel and CPU model).

---

## 1. Motivation

bazel-diff's own JVM CLI starts in well under a second. That is not where the time goes. The
canonical workflow ([`bazel-diff-example.sh`](../bazel-diff-example.sh)) is:

1. `bazel run :bazel-diff` β€” build the tool
2. `git checkout <prev>` β†’ `generate-hashes` β†’ runs `bazel query deps(//...:all-targets)`
3. `git checkout <final>` β†’ `generate-hashes` β†’ another `bazel query`
4. `get-impacted-targets` β€” cheap, pure JSON diff

The cost is the `bazel query` in steps 2/3, which forces:

- **Bazel server startup** (JVM warmup).
- **External-repo / bzlmod resolution + repository-cache fetch.** `BazelQueryService` even shells
out to `bazel mod dump_repo_mapping` and `bazel mod show_repo`
([`BazelQueryService.kt`](../cli/src/main/kotlin/com/bazel_diff/bazel/BazelQueryService.kt)).
- **Full Skyframe graph load + package analysis** for `deps(//...)`.

On a large monorepo this is minutes per cold start. A Firecracker microVM snapshot lets us capture
that warm state once and restore it in ~sub-second, so the PR-time path re-analyzes only the changed
packages.

**Goal:** instant starts of bazel-diff by restoring a microVM whose Bazel server already has the
build graph loaded and external repos fetched.

---

## 2. Key architectural split

The Firecracker record/restore itself is a **host-level concern** β€” it talks to the Firecracker REST
API over a unix socket (optionally via `jailer`). It is *not* something the Kotlin CLI does. The work
therefore splits into two pieces:

| Piece | Where | Responsibility |
| --- | --- | --- |
| **CLI hooks** | Kotlin (`cli/`) | Make snapshots deterministic and *safe*: warm-then-signal, emit a cache key, bake base hashes. |
| **Orchestration tool** | Go (`tools/firecracker/`) | Boot/warm/snapshot and restore/checkout/run the microVM via the Firecracker API. |

Consume needs **no new bazel-diff command** β€” it is the existing `generate-hashes` +
`get-impacted-targets` run against base hashes baked into the snapshot.

---

## 3. Lifecycle

### Record (per base SHA β€” on merge to master, or nightly)

```
host: build read-only rootfs ──► boot Firecracker microVM (TAP net for fetch)
(bazel + JDK + git + bazel-diff binary + workspace @ baseSHA)
VM: bazel-diff warmup ──► bazel query deps(//...) loads Skyframe + fetches externals
──► writes /snap/base_hashes.json + /snap/fingerprint.json
──► exits 0 = "safe to snapshot"
host: pause VM ──► snapshot {mem_file, vmstate} ──► freeze rootfs as backing image
store keyed by fingerprint + baseSHA
```

### Consume (per PR / target SHA β€” the hot path)

```
host: fingerprint(targetEnv) == snapshot.fingerprint? ── no ──► fall back to cold run
β”‚ yes
restore microVM (COW overlay on disk, UFFD lazy memory load) ~sub-second
VM: git checkout <target> ──► warm server does INCREMENTAL re-analysis of changed pkgs
bazel-diff generate-hashes (fast β€” server already warm)
bazel-diff get-impacted-targets -sh /snap/base_hashes.json -fh <new> -o <out>
host: extract impacted targets ──► discard overlay
```

---

## 4. New CLI surface

Both new subcommands slot into the existing picocli `subcommands` list in
[`BazelDiff.kt`](../cli/src/main/kotlin/com/bazel_diff/cli/BazelDiff.kt) alongside
`GenerateHashesCommand` and `GetImpactedTargetsCommand`.

### 4.1 `bazel-diff warmup`

The record-side entrypoint. Effectively `generate-hashes` for the base revision, plus:

- Writes base hashes to a known path (`--base-hashes`, default `/snap/base_hashes.json`).
- Writes the fingerprint file (see Β§5).
- Exits `0` **only** once the query has completed and the server is warm + quiesced. The host
watches for this clean exit as the "safe to snapshot" signal.

Implementation reuses `GenerateHashesCommand`'s plumbing; warmup is essentially generate-hashes with
metadata side-effects and a clear success contract.

### 4.2 `bazel-diff fingerprint`

Computes the snapshot **cache key** and writes it as JSON. Used both at record time (to tag the
snapshot) and at consume time (to validate a candidate snapshot before trusting it). See Β§5.

### 4.3 Consume

No new command. The orchestrator runs the existing `generate-hashes` for the target revision, then
`get-impacted-targets -sh /snap/base_hashes.json -fh <new>`.

---

## 5. Correctness β€” the cache key and the fail-safe

bazel-diff's core promise is *"an incorrect affected set is worse than none."* A restored snapshot
must produce **the same answer as a cold run**. Two layers of defense:

### 5.1 bazel-diff already re-hashes file content itself

`SourceFileHasher` reads and hashes source file contents independently of the Bazel server. So
*content* correctness does not depend on the warm server's incrementality β€” only the **graph
structure / rule attributes** returned by `bazel query` do, and Bazel's incremental analysis is the
trusted core there.

### 5.2 The fingerprint (cache key)

A snapshot is only safe to consume when the consuming environment matches the recording environment
on everything that could change the graph. The fingerprint is a hash over:

- **Bazel version** (already detected in `BazelQueryService.determineBazelVersion`).
- **`MODULE.bazel.lock`** (bzlmod resolution state).
- **`.bazelrc`** (and any imported rc files).
- **bazel-diff version** (`VersionProvider`).
- **The relevant flag set** β€” `--useCquery`, `cqueryCommandOptions`, `bazelCommandOptions`,
`startupOptions`, `--includeTargetType`, `--targetType`, fine-grained external-repo config, etc.
(anything that changes what `generate-hashes` queries or how it hashes).

**Fail-safe rule:** any fingerprint mismatch β†’ do **not** use the snapshot; fall back to a cold run.
A stale snapshot is never silently trusted.

### 5.3 CI canary

Recommended: a periodic CI job that runs the *snapshot-consumed* result against a *cold* result for
the same revision pair and asserts set equality. This builds and maintains empirical trust and catches
any Bazel incremental-analysis edge case (env vars, repository-rule re-trigger conditions, untracked
files) before it reaches users.

---

## 6. Firecracker specifics (self-hosted)

Controlling the host makes several normally-hard issues tractable:

- **CPU model pinning.** Snapshots only restore on a matching microarchitecture. Pin the CI instance
type or set a Firecracker CPU template. (This is exactly the constraint that the cloud-portability
option would have made painful β€” out of scope here.)
- **Disk.** Read-only backing rootfs + a per-restore copy-on-write overlay so each consumed VM is
isolated and disposable.
- **Memory.** Use diff snapshots + UFFD on-demand page loading; keep the mem file on local NVMe/tmpfs
for fast restore.
- **Clock + network.** Resync the guest clock on resume (a known snapshot gotcha) and re-attach the
TAP device. To make consume fully offline, pre-bake full git history into the rootfs so
`git checkout <target>` needs no network.
- **Isolation.** Run under `jailer` in CI.

---

## 7. Snapshot store layout

Keyed by `fingerprint + baseSHA`:

```
<store>/<fingerprint>/<baseSHA>/
mem_file # guest memory image (diff snapshot)
vmstate # Firecracker microVM state
rootfs.backing # frozen read-only disk image
base_hashes.json # produced by `bazel-diff warmup`
metadata.json # fingerprint, baseSHA, bazel version, created-at, bazel-diff version
```

Consume resolves a snapshot by: matching fingerprint, then choosing a `baseSHA` that is an ancestor
of the target SHA (git merge-base), preferring the nearest ancestor to minimize incremental
re-analysis.

---

## 8. Orchestration tool (Go, `tools/firecracker/`)

Go chosen for the official `firecracker-go-sdk`, clean API access, and a static binary for CI. UX
mirrors `bazel-diff-example.sh` as the familiar entrypoint.

```
bazel-diff-snap record --workspace <path> --base-sha <sha> --store <dir> [firecracker opts]
bazel-diff-snap consume --workspace <path> --target-sha <sha> --store <dir> --out <impacted.txt>
```

- `record`: build/prepare rootfs β†’ boot VM β†’ run `bazel-diff warmup` + `fingerprint` β†’ pause β†’
snapshot β†’ freeze rootfs β†’ write store entry.
- `consume`: compute `fingerprint` for target env β†’ resolve compatible snapshot (else cold fall-back)
β†’ restore (COW overlay, UFFD) β†’ `git checkout` β†’ `generate-hashes` β†’ `get-impacted-targets` β†’
extract β†’ tear down.

---

## 9. Phasing

1. **CLI hooks** β€” `fingerprint` + `warmup` subcommands, pure Kotlin, fully unit-testable, no VM
required. Lands value independently (the fingerprint is useful for any snapshot/caching scheme).
2. **Orchestration tool** β€” `tools/firecracker/` in Go: `record` / `consume`.
3. **Correctness canary + docs** β€” snapshot-vs-cold equality check in CI; README section.

---

## 10. Open questions

- **Quiescence detection.** How does `warmup` know the server is fully idle (no background Skyframe
work) before exit? Likely "query returned + process exited 0" is sufficient since the query is
synchronous, but worth validating.
- **Snapshot freshness policy.** How far back can a base SHA be before incremental re-analysis stops
being worth it vs. cold? Needs measurement; drives record cadence (every merge vs. nightly).
- **rootfs build pipeline.** Reuse an existing base image + inject workspace, or build per-record?
Affects record time and store size.
- **Flag-set canonicalization.** Exact list of flags that must enter the fingerprint vs. those that
are snapshot-neutral β€” needs an explicit, reviewed allow/deny list to avoid both false mismatches
(wasted cold runs) and false matches (incorrect results).
Loading