kagent-dev
diff --git a/‎.agents/skills/kagent‎
Lines changed: 1 addition & 0 deletions b/‎.agents/skills/kagent‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.agents/skills/kagent-dev‎
Lines changed: 1 addition & 0 deletions b/‎.agents/skills/kagent-dev‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.claude/skills/kagent-dev/SKILL.md‎
Lines changed: 6 additions & 2 deletions b/‎.claude/skills/kagent-dev/SKILL.md‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎.claude/skills/kagent-dev/references/ci-failures.md‎
Lines changed: 2 additions & 0 deletions b/‎.claude/skills/kagent-dev/references/ci-failures.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎.claude/skills/kagent-dev/references/database-migrations.md‎
Lines changed: 155 additions & 0 deletions b/‎.claude/skills/kagent-dev/references/database-migrations.md‎
Lines changed: 155 additions & 0 deletions
diff --git a/‎.github/workflows/ci.yaml‎
Lines changed: 22 additions & 4 deletions b/‎.github/workflows/ci.yaml‎
Lines changed: 22 additions & 4 deletions
diff --git a/‎.github/workflows/image-scan.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/image-scan.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/workflows/migration-immutability.yaml‎
Lines changed: 35 additions & 0 deletions b/‎.github/workflows/migration-immutability.yaml‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎.github/workflows/sqlc-generate-check.yaml‎
Lines changed: 35 additions & 0 deletions b/‎.github/workflows/sqlc-generate-check.yaml‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎.github/workflows/tag.yaml‎
Lines changed: 1 addition & 0 deletions b/‎.github/workflows/tag.yaml‎
Lines changed: 1 addition & 0 deletions
@@ -0,0 +1 @@
+../../.claude/skills/kagent
@@ -0,0 +1 @@
+../../.claude/skills/kagent-dev
@@ -18,6 +18,9 @@ make helm-install  # Builds images and deploys to Kind
 make controller-manifests  # generate + copy CRDs to helm (recommended)
 make -C go generate         # DeepCopy methods only
 
+# sqlc (after editing go/core/internal/database/queries/*.sql)
+cd go/core/internal/database && sqlc generate  # regenerate gen/ — commit both
+
 # Build & test
 make -C go test               # Unit tests (includes golden file checks)
 make -C go e2e                # E2E tests (needs KAGENT_URL)
@@ -43,7 +46,7 @@ kagent/
 │   ├── api/                 # Shared types module
 │   │   ├── v1alpha2/        # Current CRD types (agent_types.go, etc.)
 │   │   ├── adk/             # ADK config types (types.go) — flows to Python runtime
-│   │   ├── database/        # GORM models
+│   │   ├── database/        # database models
 │   │   ├── httpapi/         # HTTP API types
 │   │   └── config/crd/bases/ # Generated CRD YAML
 │   ├── core/                # Infrastructure module
@@ -231,7 +234,7 @@ curl -v $KAGENT_URL/healthz                                   # Controller reach
 
 **Reproducing locally (without cluster):** Follow `go/core/test/e2e/README.md` — extract agent config, start mock LLM server, run agent with `kagent-adk test`. Much faster iteration than full cluster.
 
-**CI-specific:** E2E runs in matrix (`sqlite` + `postgres`). If only one database variant fails, it's likely database-related. If both fail, it's infrastructure. Most common CI-only failure: mock LLM unreachability because `KAGENT_LOCAL_HOST` detection fails on Linux.
+**CI-specific:** Most common CI-only failure: mock LLM unreachability because `KAGENT_LOCAL_HOST` detection fails on Linux.
 
 See `references/e2e-debugging.md` for comprehensive debugging techniques.
 
@@ -349,3 +352,4 @@ Don't use Go template syntax (`{{ }}`) in doc comments — Helm will try to pars
 - `references/translator-guide.md` - Translator patterns, `deployments.go` and `adk_api_translator.go`
 - `references/e2e-debugging.md` - Comprehensive E2E debugging, local reproduction
 - `references/ci-failures.md` - CI failure patterns and fixes
+- `references/database-migrations.md` - Migration authoring rules, sqlc workflow, multi-instance safety, expand/contract pattern
@@ -7,6 +7,7 @@ Common GitHub Actions CI failures and how to fix them.
 | Failure | Likely Cause | Quick Fix |
 |---------|--------------|-----------|
 | manifests-check | CRD manifests out of date | `make -C go generate && cp go/api/config/crd/bases/*.yaml helm/kagent-crds/templates/` |
+| sqlc-generate-check | `gen/` out of sync with queries | `cd go/core/internal/database && sqlc generate`, commit `gen/` |
 | go-lint depguard | Forbidden package used | Replace with allowed alternative (e.g., `slices.Sort` not `sort.Strings`) |
 | test-e2e timeout | Agent not starting or KAGENT_URL wrong | Check pod status, verify KAGENT_URL setup in CI |
 | golden files mismatch | Translator output changed | `UPDATE_GOLDEN=true make -C go test` and commit |
@@ -520,6 +521,7 @@ make init-git-hooks
 Before submitting PR:
 
 - [ ] Ran `make -C go generate` after CRD changes
+- [ ] Ran `cd go/core/internal/database && sqlc generate` after query changes, committed `gen/`
 - [ ] Ran `make lint` and fixed issues
 - [ ] Ran `make -C go test` and all pass
 - [ ] Regenerated golden files if translator changed
 
@@ -0,0 +1,155 @@
+# Database Migrations Guide
+
+kagent uses [golang-migrate](https://github.com/golang-migrate/migrate) with embedded SQL files and [sqlc](https://sqlc.dev/) for type-safe query generation. Migrations run **in-app at startup** — the controller applies them before accepting traffic.
+
+## Structure
+
+```
+go/core/pkg/migrations/
+├── migrations.go          # Embeds the FS (go:embed); exports FS for downstream consumers
+├── runner.go              # RunUp (applies pending migrations at startup)
+├── core/                  # Core schema (tracked in schema_migrations table)
+│   ├── 000001_initial.up.sql / .down.sql
+│   ├── 000002_add_session_source.up.sql / .down.sql
+│   └── ...
+└── vector/                # pgvector schema (tracked in vector_schema_migrations table)
+    ├── 000001_vector_support.up.sql / .down.sql
+    └── ...
+
+go/core/internal/database/
+├── queries/               # Hand-written SQL queries (source of truth)
+│   ├── sessions.sql
+│   ├── memory.sql
+│   └── ...
+├── gen/                   # sqlc-generated Go code — DO NOT edit manually
+│   ├── db.go
+│   ├── models.go
+│   └── *.sql.go
+└── sqlc.yaml              # sqlc configuration
+```
+
+Migrations manage two independent tracks — `core` and `vector` — and roll back both if either fails. The `--database-vector-enabled` flag (default `true`) controls whether the vector track runs.
+
+## sqlc Workflow
+
+When you add or change a SQL query:
+
+1. Edit (or add) a `.sql` file under `go/core/internal/database/queries/`
+2. Regenerate:
+   ```bash
+   cd go/core/internal/database && sqlc generate
+   ```
+3. Commit both the query file and the updated `gen/` files together.
+
+A CI check (`.github/workflows/sqlc-generate-check.yaml`) fails the PR if `gen/` is out of sync with the queries. Never edit `gen/` by hand.
+
+**sqlc annotations used:**
+- `:one` — returns a single row
+- `:many` — returns a slice
+- `:exec` — returns only error (use for INSERT/UPDATE/DELETE that don't need the result)
+
+## Writing Migrations
+
+### Backward-compatible schema changes
+
+During a rolling deploy, old pods will be reading and writing a schema that has already been upgraded. **Every migration must be backward-compatible with the previous version's code.**
+
+| Change | Old code behavior | Safe? |
+|--------|------------------|-------|
+| Add nullable column | SELECT ignores it; INSERT omits it (goes NULL) | ✅ |
+| Add column with `DEFAULT x` | INSERT omits it; DB fills default | ✅ |
+| Add NOT NULL column **without** default | Old INSERT missing the column → error | ❌ |
+| Add index | Invisible to application code | ✅ |
+| Add foreign key | Old INSERT may fail constraint | ❌ |
+| Drop/rename column old code references | Old SELECT/INSERT errors | ❌ |
+| Change compatible type (e.g. `int` → `bigint`) | Usually fine | ⚠️ |
+
+**Expand-then-contract pattern for schema changes:**
+1. **Version N (Expand)**: add the new column/table (nullable or with default); old code still works
+2. **Version N (Deploy)**: ship new code that uses the new structure
+3. **Version N+1 (Contract)**: drop the old column/table once version N is fully deployed and no pods run version N-1
+
+### Idempotency and cross-track safety
+
+All DDL statements must use `IF EXISTS` / `IF NOT EXISTS` guards:
+
+```sql
+-- Up
+CREATE TABLE IF NOT EXISTS foo (...);
+ALTER TABLE foo ADD COLUMN IF NOT EXISTS bar TEXT;
+
+-- Down
+DROP TABLE IF EXISTS foo;
+ALTER TABLE foo DROP COLUMN IF EXISTS bar;
+```
+
+Guards provide defense-in-depth for crash recovery and dirty-state cleanup, where a partially-applied migration may be re-run or rolled back.
+
+### Naming
+
+Files must follow `NNNNNN_description.up.sql` / `NNNNNN_description.down.sql` with zero-padded 6-digit sequence numbers.
+
+### Down migrations
+
+Every `.up.sql` must have a corresponding `.down.sql` that exactly reverses it. Down migrations are used for rollbacks and by automatic rollback on migration failure. They must be **idempotent** — the two-track rollback logic (roll back core if vector fails) may call them more than once in failure scenarios.
+
+## Multi-Instance Safety
+
+### How the advisory lock works
+
+The migration runner acquires a PostgreSQL **session-level** advisory lock (`pg_advisory_lock`) before running.
+
+### Rolling deploy concurrency
+
+If multiple pods start simultaneously (e.g., rolling deploy with replicas > 1):
+1. One controller acquires the advisory lock and runs migrations.
+2. Others block on `pg_advisory_lock`.
+3. When the winner finishes and its connection closes, the next waiter acquires the lock, calls `Up()`, gets `ErrNoChange`, and exits immediately.
+
+This is safe. The only risk is if the winning controller crashes mid-migration (see Dirty State below).
+
+### Dirty state recovery
+
+If the controller crashes mid-migration, the migration runner records the version as `dirty = true` in the tracking table. The next startup detects dirty state and calls `rollbackToVersion`, which:
+1. Calls `mg.Force(version - 1)` to clear the dirty flag.
+2. Runs the down migration to restore the previous clean state.
+3. Re-runs the failed up migration.
+
+**Requirement**: down migrations must be idempotent and correctly reverse their up migration. A missing or broken down migration requires manual recovery.
+
+### Rollout strategy
+
+For backward-compatible migrations a rolling update is safe:
+
+1. New pod starts → migration runner applies pending migrations (advisory lock serializes concurrent runs)
+2. New pod passes readiness probe → old pod terminates
+3. Backward-compatible schema means old pods continue operating during the window
+
+For a migration that is **not** backward-compatible, restructure it using the expand-then-contract pattern (add new column/table in version N, ship code that uses it, drop the old column in version N+1).
+
+## Static Analysis Enforcement
+
+The policies above are enforced by static analysis tests in `go/core/pkg/migrations/cross_track_test.go`. These run against the embedded SQL files — no database required.
+
+| Test | What it enforces |
+|------|-----------------|
+| `TestNoCrossTrackDDL` | No track may `ALTER TABLE` or `CREATE INDEX ON` a table owned by another track |
+| `TestMigrationGuards` | Up migrations must use `IF NOT EXISTS` on all `CREATE`/`ADD COLUMN`; down migrations must use `IF EXISTS` on all `DROP` statements |
+
+**Adding a new track**: add the track directory name to the `tracks` slice in each test so the new track is covered by the same checks.
+
+These tests catch policy violations at PR time without needing a running database. They complement the integration tests in `runner_test.go`, which verify the runner's rollback and concurrency behavior against a real Postgres instance.
+
+## Downstream Extension Model
+
+The migration layer is designed for downstream consumers to extend with their own migrations alongside OSS. The extension points are:
+
+1. **SQL files as the contract.** The migration files in `go/core/pkg/migrations/core/` and `vector/` are the stable interface. Downstream consumers sync these files into their own repos and build their own migration runners. Don't move or reorganize migration file paths without considering downstream impact.
+
+2. **`MigrationRunner` DI callback.** Downstream consumers pass a custom `MigrationRunner` to `app.Start` to take full ownership of the migration process — running OSS migrations alongside their own in whatever order they need. The signature `func(ctx context.Context, url string, vectorEnabled bool) error` is stable.
+
+3. **Vector track stays separate.** The vector track is conditionally applied and has its own tracking table. Downstream extensions should not modify vector-owned tables (enforced by `TestNoCrossTrackDDL`).
+
+### What this means for OSS development
+
+- **Migration immutability is cross-repo.** Once a migration file is merged and tagged, downstream consumers may have synced it. Modifying it breaks their tracking table state.
@@ -15,6 +15,7 @@ env:
   # Cache key components for better organization
   CACHE_KEY_PREFIX: kagent-v2
   BRANCH_CACHE_KEY: ${{ github.head_ref || github.ref_name }}
+  AGENT_SANDBOX_VERSION: v0.3.10
   # Consistent builder configuration
   BUILDX_BUILDER_NAME: kagent-builder-v0.23.0
   BUILDX_VERSION: v0.23.0
@@ -66,6 +67,17 @@ jobs:
         with:
           install_only: true
 
+      - name: Create Kind cluster
+        run: |
+          make create-kind-cluster
+
+      - name: Install agent-sandbox
+        run: |
+          kubectl apply -f "https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${AGENT_SANDBOX_VERSION}/manifest.yaml"
+          kubectl wait --for=condition=Established crd/sandboxes.agents.x-k8s.io --timeout=90s
+          kubectl rollout status deployment/agent-sandbox-controller -n agent-sandbox-system --timeout=120s
+          kubectl wait --for=condition=Ready pod -l app=agent-sandbox-controller -n agent-sandbox-system --timeout=120s
+
       - name: Install Kagent
         id: install-kagent
         env:
@@ -79,10 +91,11 @@ jobs:
             --platform=linux/amd64
             --push
         run: |
-          make create-kind-cluster
           echo "Cache key: ${{ needs.setup.outputs.cache-key }}"
           make helm-install
           make push-test-agent push-test-skill
+          kubectl rollout status deployment/kagent-controller -n kagent --timeout=120s
+          kubectl wait --for=condition=Ready pod -l app.kubernetes.io/component=controller -n kagent --timeout=120s
           kubectl wait --for=condition=Ready  agents.kagent.dev -n kagent --all --timeout=60s || kubectl get po -n kagent -o wide ||:
           kubectl wait --for=condition=Ready  agents.kagent.dev -n kagent --all --timeout=60s
 
@@ -113,15 +126,15 @@ jobs:
         run: |
           # Upgrade helm to use namespace-scoped RBAC
           make helm-install-provider
-          
+
           # Wait for controller to be ready after upgrade
           kubectl rollout status deployment/kagent-controller -n kagent --timeout=90s
-          
+
           # Setup environment variables (reusing logic from previous step)
           HOST_IP=$(docker network inspect kind -f '{{range .IPAM.Config}}{{if .Gateway}}{{.Gateway}}{{"\n"}}{{end}}{{end}}' | grep -E '^[0-9]+\.' | head -1)
           export KAGENT_LOCAL_HOST=$HOST_IP
           export KAGENT_URL="http://$(kubectl get svc -n kagent kagent-controller -o jsonpath='{.status.loadBalancer.ingress[0].ip}'):8083"
-          
+
           # Run critical tests with namespace-scoped RBAC to verify the controller didn't lose needed permissions
           cd go
           go test -v github.com/kagent-dev/kagent/go/core/test/e2e -run '^TestE2EInvokeInlineAgent$|^TestE2EInvokeDeclarativeAgentWithMcpServerTool$' -failfast
@@ -131,6 +144,10 @@ jobs:
           echo "::error::Failed to run e2e tests"
           echo "::error::Kubectl get pods -n kagent"
           kubectl describe pods -n kagent
+          echo "::error::Kubectl get pods -n agent-sandbox-system"
+          kubectl get pods -n agent-sandbox-system -o wide || true
+          echo "::error::Kubectl logs -n agent-sandbox-system deployment/agent-sandbox-controller"
+          kubectl logs -n agent-sandbox-system deployment/agent-sandbox-controller || true
           echo "::error::Kubectl get events -n kagent"
           kubectl get events -n kagent
           echo "::error::Kubectl get agents -n kagent"
@@ -248,6 +265,7 @@ jobs:
           - app
           - cli
           - golang-adk
+          - golang-adk-full
           - skills-init
     runs-on: ubuntu-latest
     services:
 
@@ -31,6 +31,7 @@ jobs:
           - app
           - skills-init
           - golang-adk
+          - golang-adk-full
     runs-on: ubuntu-latest
     services:
       registry:
 
@@ -0,0 +1,35 @@
+name: Migration Immutability
+
+on:
+  pull_request:
+    branches: [main]
+    paths:
+      - "go/core/pkg/migrations/**"
+
+jobs:
+  check:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Fail if any existing migration file was modified
+        run: |
+          # List files under go/core/pkg/migrations/ that were changed relative
+          # to the merge base of this PR.  We only care about modifications (M)
+          # and renames (R); additions (A) are fine.
+          BASE=$(git merge-base HEAD origin/${{ github.base_ref }})
+          MODIFIED=$(git diff --name-only --diff-filter=MR "$BASE" HEAD \
+            -- 'go/core/pkg/migrations/**/*.sql')
+
+          if [ -n "$MODIFIED" ]; then
+            echo "ERROR: The following migration files were modified."
+            echo "Migration files are immutable once merged."
+            echo "Fix bugs with a new migration instead."
+            echo ""
+            echo "$MODIFIED"
+            exit 1
+          fi
+
+          echo "OK: no existing migration files were modified."
@@ -0,0 +1,35 @@
+name: sqlc Generate Check
+
+on:
+  pull_request:
+    branches: [main]
+    paths:
+      - "go/core/internal/database/queries/**"
+      - "go/core/internal/database/sqlc.yaml"
+      - "go/core/pkg/migrations/**"
+
+jobs:
+  check:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-go@v6
+        with:
+          go-version: "1.26"
+          cache: true
+          cache-dependency-path: go/go.sum
+
+      - name: Run sqlc generate
+        working-directory: go
+        run: make sqlc-generate
+
+      - name: Fail if generated files differ
+        run: |
+          if ! git diff --quiet go/core/internal/database/gen/; then
+            echo "ERROR: sqlc generate produced changes. Run sqlc generate locally and commit the result."
+            echo ""
+            git diff go/core/internal/database/gen/
+            exit 1
+          fi
+          echo "OK: generated files are up to date."
@@ -21,6 +21,7 @@ jobs:
           - ui
           - app
           - golang-adk
+          - golang-adk-full
           - skills-init
     runs-on: ubuntu-latest
     permissions: