Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 210 additions & 0 deletions specs/SPEC-14- Cloud Git Versioning & GitHub Backup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
---
title: 'SPEC-14: Cloud Git Versioning & GitHub Backup'
type: spec
permalink: specs/spec-14-cloud-git-versioning
tags:
- git
- github
- backup
- versioning
- cloud
related:
- specs/spec-9-multi-project-bisync
- specs/spec-9-follow-ups-conflict-sync-and-observability
status: deferred
---

# SPEC-14: Cloud Git Versioning & GitHub Backup

**Status: DEFERRED** - Postponed until multi-user/teams feature development. Using S3 versioning (SPEC-9.1) for v1 instead.

## Why Deferred

**Original goals can be met with simpler solutions:**
- Version history → **S3 bucket versioning** (automatic, zero config)
- Offsite backup → **Tigris global replication** (built-in)
- Restore capability → **S3 version restore** (`bm cloud restore --version-id`)
- Collaboration → **Deferred to teams/multi-user feature** (not v1 requirement)

**Complexity vs value trade-off:**
- Git integration adds: committer service, puller service, webhooks, LFS, merge conflicts
- Risk: Loop detection between Git ↔ rclone bisync ↔ local edits
- S3 versioning gives 80% of value with 5% of complexity

**When to revisit:**
- Teams/multi-user features (PR-based collaboration workflow)
- User requests for commit messages and branch-based workflows
- Need for fine-grained audit trail beyond S3 object metadata

---

## Original Specification (for reference)

## Why
Early access users want **transparent version history**, easy **offsite backup**, and a familiar **restore/branching** workflow. Git/GitHub integration would provide:
- Auditable history of every change (who/when/why)
- Branches/PRs for review and collaboration
- Offsite private backup under the user's control
- Escape hatch: users can always `git clone` their knowledge base

**Note:** These goals are now addressed via S3 versioning (SPEC-9.1) for single-user use case.

## Goals
- **Transparent**: Users keep using Basic Memory; Git runs behind the scenes.
- **Private**: Push to a **private GitHub repo** that the user owns (or tenant org).
- **Reliable**: No data loss, deterministic mapping of filesystem ↔ Git.
- **Composable**: Plays nicely with SPEC‑9 bisync and upcoming conflict features (SPEC‑9 Follow‑Ups).

**Non‑Goals (for v1):**
- Fine‑grained per‑file encryption in Git history (can be layered later).
- Large media optimization beyond Git LFS defaults.

## User Stories
1. *As a user*, I connect my GitHub and choose a private backup repo.
2. *As a user*, every change I make in cloud (or via bisync) is **committed** and **pushed** automatically.
3. *As a user*, I can **restore** a file/folder/project to a prior version.
4. *As a power user*, I can **git pull/push** directly to collaborate outside the app.
5. *As an admin*, I can enforce repo ownership (tenant org) and least‑privilege scopes.

## Scope
- **In scope:** Full repo backup of `/app/data/` (all projects) with optional selective subpaths.
- **Out of scope (v1):** Partial shallow mirrors; encrypted Git; cross‑provider SCM (GitLab/Bitbucket).

## Architecture
### Topology
- **Authoritative working tree**: `/app/data/` (bucket mount) remains the source of truth (SPEC‑9).
- **Bare repo** lives alongside: `/app/git/${tenant}/knowledge.git` (server‑side).
- **Mirror remote**: `github.com/<owner>/<repo>.git` (private).

```mermaid
flowchart LR
A[/Users & Agents/] -->|writes/edits| B[/app/data/]
B -->|file events| C[Committer Service]
C -->|git commit| D[(Bare Repo)]
D -->|push| E[(GitHub Private Repo)]
E -->|webhook (push)| F[Puller Service]
F -->|git pull/merge| D
D -->|checkout/merge| B
```

### Services
- **Committer Service** (daemon):
- Watches `/app/data/` for changes (inotify/poll)
- Batches changes (debounce e.g. 2–5s)
- Writes `.bmmeta` (if present) into commit message trailer (see Follow‑Ups)
- `git add -A && git commit -m "chore(sync): <summary>

BM-Meta: <json>"`
- Periodic `git push` to GitHub mirror (configurable interval)
- **Puller Service** (webhook target):
- Receives GitHub webhook (push) → `git fetch`
- **Fast‑forward** merges to `main` only; reject non‑FF unless policy allows
- Applies changes back to `/app/data/` via clean checkout
- Emits sync events for Basic Memory indexers

### Auth & Security
- **GitHub App** (recommended): minimal scopes: `contents:read/write`, `metadata:read`, webhook.
- Tenant‑scoped installation; repo created in user account or tenant org.
- Tokens stored in KMS/secret manager; rotated automatically.
- Optional policy: allow only **FF merges** on `main`; non‑FF requires PR.

### Repo Layout
- **Monorepo** (default): one repo per tenant mirrors `/app/data/` with subfolders per project.
- Optional multi‑repo mode (later): one repo per project.

### File Handling
- Honor `.gitignore` generated from `.bmignore.rclone` + BM defaults (cache, temp, state).
- **Git LFS** for large binaries (images, media) — auto track by extension/size threshold.
- Normalize newline + Unicode (aligns with Follow‑Ups).

### Conflict Model
- **Primary concurrency**: SPEC‑9 Follow‑Ups (`.bmmeta`, conflict copies) stays the first line of defense.
- **Git merges** are a **secondary** mechanism:
- Server only auto‑merges **text** conflicts when trivial (FF or clean 3‑way).
- Otherwise, create `name (conflict from <branch>, <ts>).md` and surface via events.

### Data Flow vs Bisync
- Bisync (rclone) continues between local sync dir ↔ bucket.
- Git sits **cloud‑side** between bucket and GitHub.
- On **pull** from GitHub → files written to `/app/data/` → picked up by indexers & eventually by bisync back to users.

## CLI & UX
New commands (cloud mode):
- `bm cloud git connect` — Launch GitHub App installation; create private repo; store installation id.
- `bm cloud git status` — Show connected repo, last push time, last webhook delivery, pending commits.
- `bm cloud git push` — Manual push (rarely needed).
- `bm cloud git pull` — Manual pull/FF (admin only by default).
- `bm cloud snapshot -m "message"` — Create a tagged point‑in‑time snapshot (git tag).
- `bm restore <path> --to <commit|tag>` — Restore file/folder/project to prior version.

Settings:
- `bm config set git.autoPushInterval=5s`
- `bm config set git.lfs.sizeThreshold=10MB`
- `bm config set git.allowNonFF=false`

## Migration & Backfill
- On connect, if repo empty: initial commit of entire `/app/data/`.
- If repo has content: require **one‑time import** path (clone to staging, reconcile, choose direction).

## Edge Cases
- Massive deletes: gated by SPEC‑9 `max_delete` **and** Git pre‑push hook checks.
- Case changes and rename detection: rely on git rename heuristics + Follow‑Ups move hints.
- Secrets: default ignore common secret patterns; allow custom deny list.

## Telemetry & Observability
- Emit `git_commit`, `git_push`, `git_pull`, `git_conflict` events with correlation IDs.
- `bm sync --report` extended with Git stats (commit count, delta bytes, push latency).

## Phased Plan
### Phase 0 — Prototype (1 sprint)
- Server: bare repo init + simple committer (batch every 10s) + manual GitHub token.
- CLI: `bm cloud git connect --token <PAT>` (dev‑only)
- Success: edits in `/app/data/` appear in GitHub within 30s.

### Phase 1 — GitHub App & Webhooks (1–2 sprints)
- Switch to GitHub App installs; create private repo; store installation id.
- Committer hardened (debounce 2–5s, backoff, retries).
- Puller service with webhook → FF merge → checkout to `/app/data/`.
- LFS auto‑track + `.gitignore` generation.
- CLI surfaces status + logs.

### Phase 2 — Restore & Snapshots (1 sprint)
- `bm restore` for file/folder/project with dry‑run.
- `bm cloud snapshot` tags + list/inspect.
- Policy: PR‑only non‑FF, admin override.

### Phase 3 — Selective & Multi‑Repo (nice‑to‑have)
- Include/exclude projects; optional per‑project repos.
- Advanced policies (branch protections, required reviews).

## Acceptance Criteria
- Changes to `/app/data/` are committed and pushed automatically within configurable interval (default ≤5s).
- GitHub webhook pull results in updated files in `/app/data/` (FF‑only by default).
- LFS configured and functioning; large files don't bloat history.
- `bm cloud git status` shows connected repo and last push/pull times.
- `bm restore` restores a file/folder to a prior commit with a clear audit trail.
- End‑to‑end works alongside SPEC‑9 bisync without loops or data loss.

## Risks & Mitigations
- **Loop risk (Git ↔ Bisync)**: Writes to `/app/data/` → bisync → local → user edits → back again. *Mitigation*: Debounce, commit squashing, idempotent `.bmmeta` versioning, and watch exclusion windows during pull.
- **Repo bloat**: Lots of binary churn. *Mitigation*: default LFS, size threshold, optional media‑only repo later.
- **Security**: Token leakage. *Mitigation*: GitHub App with short‑lived tokens, KMS storage, scoped permissions.
- **Merge complexity**: Non‑trivial conflicts. *Mitigation*: prefer FF; otherwise conflict copies + events; require PR for non‑FF.

## Open Questions
- Do we default to **monorepo** per tenant, or offer project‑per‑repo at connect time?
- Should `restore` write to a branch and open a PR, or directly modify `main`?
- How do we expose Git history in UI (timeline view) without users dropping to CLI?

## Appendix: Sample Config
```json
{
"git": {
"enabled": true,
"repo": "https://github.com/<owner>/<repo>.git",
"autoPushInterval": "5s",
"allowNonFF": false,
"lfs": { "sizeThreshold": 10485760 }
}
}
```
Loading
Loading