Skip to content

feat(phase-1): garage-init + dbt scaffolding + e2e fixes#21

Open
Islanders-Treasure0969 wants to merge 3 commits into
mainfrom
feat/garage-init
Open

feat(phase-1): garage-init + dbt scaffolding + e2e fixes#21
Islanders-Treasure0969 wants to merge 3 commits into
mainfrom
feat/garage-init

Conversation

@Islanders-Treasure0969

Copy link
Copy Markdown
Owner

Summary

Phase 1 (minimum E2E パイプライン) が make compose-up → make phase1 で end-to-end クリーン稼働するまで持っていく一連の fix。実 GitHub データが Garage S3 に流れて、DuckDB bronze → silver → gold を通って Streamlit (http://localhost:8501) で表示されるところまで到達済み。

Base: fix/garage-secrets-via-env (PR #20). PR #20 マージ後に main ベースに rebase され直す。

What's new

  • scripts/garage-init.sh — Garage の cluster layout / bucket / S3 key import を idempotent に実行
  • make garage-init / make dbt-install Makefile target
  • transform/requirements.txt — dbt-core 1.10 / dbt-duckdb 1.10 pin
  • .local/ ディレクトリの gitignore

Pipeline fixes (実走させたら出てきたやつ)

# Issue Fix
1 DuckDB が ap-northeast-1 で Garage にアクセスして AuthorizationHeaderMalformed profiles.yml に s3_region: garage
2 anthropics/claude-code の license null → INTEGER 推論で他リポと型不整合 union_by_name=true + try_cast(license_spdx as varchar)
3 cardinality(parents) が "can only operate on MAPs" len(parents) (LIST 用)
4 DuckDB v1.5 binder bug: bronze view + silver qualify + 多重 TIMESTAMP Bronze を table materialize に変更
5 dbt が PATH に無い transform/.venv + make dbt-install

Verified

  • Bronze: 5 + 7,833 + 390 rows from 5 OSS repos
  • Silver: 5 + 7,366 + 467 + 390 rows
  • Gold: 2 tables populated
  • dbt: PASS=28 ERROR=0
  • Streamlit: HTTP 200 @ :8501

Sample data (gold.repo_health_snapshot)

repo stars recent_commits_28d recent_prs_28d recent_issues_28d
anthropics/claude-code 121,960 11 19 3,984
duckdb/duckdb 38,025 300 97 20
temporalio/temporal 20,138 24 69 8
dbt-labs/dbt-core 12,734 10 32 10
apache/iceberg 8,836 45 106 19

Islanders-Treasure0969 and others added 2 commits May 10, 2026 00:55
Two compose-time bugs surfaced when first running `make compose-up`:

1. Garage refused to start with `Invalid RPC secret key`. The
   `rpc_secret = "REPLACE_ME_AT_BOOTSTRAP_via_env"` placeholder in
   garage.toml was being read literally — no env override was wired.

2. Postgres 18 changed the on-disk layout (PR docker-library/postgres#1259):
   it now refuses to mount at /var/lib/postgresql/data and demands a
   parent-dir mount with versioned subdirs. Three of our four services
   crashed in lockstep ("PostgreSQL data in /var/lib/postgresql/data
   (unused mount/volume)").

Fixes:
- garage.toml: drop `rpc_secret` / `admin_token` / `metrics_token` lines.
  Garage reads `GARAGE_RPC_SECRET` etc. from the environment at start.
- docker-compose.yml: add `environment:` block on the garage service that
  pulls GARAGE_{RPC_SECRET,ADMIN_TOKEN} from the host env, with
  `${VAR:?...}` validation so a missing op-run wrap fails loudly instead
  of silently using empty values.
- docker-compose.yml: revert postgres 18-alpine → 17-alpine (digest pinned).
  Postgres 17 is supported through 2029; the 18 path-shape change is
  better solved in a future PR with proper PGDATA + parent-mount layout.
- Makefile: `compose-up` and `compose-up-streaming` now go through
  `$(OP_RUN)` (and depend on `env-check`), so secrets are always injected.

Verified locally: postgres / temporal-db / lakekeeper-db / temporal /
temporal-ui all healthy. Garage starts cleanly (cluster-layout init is a
follow-up). Lakekeeper DB-migration step is also a follow-up (Phase 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of the project (minimal end-to-end pipeline) now runs cleanly
from `make compose-up` through `make phase1` to a live Streamlit
dashboard at http://localhost:8501. Real GitHub data lands in Garage S3,
flows through DuckDB bronze→silver→gold, and is visualised.

Major additions
---------------
- scripts/garage-init.sh — idempotent layout + bucket + key import.
  Fixes the "no role assigned / Quorum not available" state Garage
  ships in. Pulls GARAGE_S3_ACCESS_KEY/SECRET_KEY from env (op run).
- Makefile: `make garage-init`, `make dbt-install`, and `phase1` target
  chains compose-up → garage-init → ingest → dbt → dashboard.
- transform/requirements.txt: pin dbt-core 1.10 + dbt-duckdb 1.10.
- transform/.venv setup via `make dbt-install` (separate from
  ingestion's venv to keep dependency surfaces clean).

Pipeline fixes discovered while running for real
------------------------------------------------
1. profiles.yml.example: add `s3_region: garage` so DuckDB's S3 client
   stops sending `ap-northeast-1` to Garage and getting
   AuthorizationHeaderMalformed back.
2. read_parquet(union_by_name=true) on all three bronze stg models —
   needed because anthropics/claude-code has license=null and pyarrow
   inferred its `license_spdx` column as INTEGER while every other repo
   wrote VARCHAR. Combined with `try_cast(license_spdx as varchar)` on
   the projection.
3. Bronze stg_* models materialized as `table` (not `view`). Working
   around DuckDB v1.5 binder INTERNAL Error: when bronze is a view and
   silver does `select <many cols incl. timestamps> from {{ ref(stg_*) }}
   qualify row_number() over (... order by fetched_at desc) = 1`, the
   binder errors with "Failed to bind column reference '': inequal
   types (TIMESTAMP != VARCHAR)". Reproduced the bug in isolation; the
   triggers are (a) source = subquery/view, (b) qualify with TIMESTAMP
   ordering, (c) ≥2 TIMESTAMP columns in projection. Persisting bronze
   sidesteps the inline-binder path.
4. Silver fct_*/dim_repos rewritten with `qualify` (replaces the
   `with ranked as (select *, row_number() ...)` pattern, which also
   triggered the same binder bug even with table-bronze).
5. fct_commits: replace `cardinality(parents)` with `len(parents)`.
   `parents` is `VARCHAR[]` (a list); DuckDB's `cardinality()` is for
   MAPs. `len()` is the canonical list/array length.

Verified end-to-end run
-----------------------
- Bronze: 5 + 7,833 + 390 rows ingested from 5 OSS repos
- Silver: 5 + 7,366 + 467 + 390 rows
- Gold:   repo_health_snapshot 5 rows, repo_daily_metrics populated
- dbt:    PASS=28 ERROR=0 SKIP=0 (3 bronze + 4 silver + 2 gold + 19 tests)
- Streamlit: HTTP 200 on :8501 reading gold tables

The .gitignore now also covers `.local/`, used for personal phase
trackers. See `.local/phase-1/{plan,status,log}.md` for the running
notes (gitignored).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 9, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are limited based on label configuration.

🏷️ Required labels (at least one) (1)
  • needs-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 54c1181d-c9f9-4c9e-941e-1d228e391373

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Comment @coderabbitai help to get the list of available commands and usage tips.

@Islanders-Treasure0969 Islanders-Treasure0969 changed the base branch from fix/garage-secrets-via-env to main May 9, 2026 16:27
@Islanders-Treasure0969 Islanders-Treasure0969 enabled auto-merge (squash) May 9, 2026 16:27
dbt-duckdb's default generate_schema_name macro produces
<target_schema>_<custom_schema>, so a model declared as
`+schema: gold` lands at `main_gold.repo_health_snapshot` rather than
`gold.repo_health_snapshot`. The Streamlit dashboard (and any
downstream SQL example) references the cleaner `gold.*` form, so the
dashboard fell into its CatalogException-caught fallback ("Gold models
not yet materialized").

Override the macro to use the custom schema verbatim, matching what's
declared in dbt_project.yml. Now bronze/silver/gold tables live in the
schemas of those names, and the dashboard renders correctly.

The case for keeping the prefix is multi-target schema isolation
(dev_gold vs prod_gold sharing one DuckDB file), which we don't
currently need; we have a single dev profile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant