feat(phase-1): garage-init + dbt scaffolding + e2e fixes#21
Open
Islanders-Treasure0969 wants to merge 3 commits into
Open
feat(phase-1): garage-init + dbt scaffolding + e2e fixes#21Islanders-Treasure0969 wants to merge 3 commits into
Islanders-Treasure0969 wants to merge 3 commits into
Conversation
Two compose-time bugs surfaced when first running `make compose-up`: 1. Garage refused to start with `Invalid RPC secret key`. The `rpc_secret = "REPLACE_ME_AT_BOOTSTRAP_via_env"` placeholder in garage.toml was being read literally — no env override was wired. 2. Postgres 18 changed the on-disk layout (PR docker-library/postgres#1259): it now refuses to mount at /var/lib/postgresql/data and demands a parent-dir mount with versioned subdirs. Three of our four services crashed in lockstep ("PostgreSQL data in /var/lib/postgresql/data (unused mount/volume)"). Fixes: - garage.toml: drop `rpc_secret` / `admin_token` / `metrics_token` lines. Garage reads `GARAGE_RPC_SECRET` etc. from the environment at start. - docker-compose.yml: add `environment:` block on the garage service that pulls GARAGE_{RPC_SECRET,ADMIN_TOKEN} from the host env, with `${VAR:?...}` validation so a missing op-run wrap fails loudly instead of silently using empty values. - docker-compose.yml: revert postgres 18-alpine → 17-alpine (digest pinned). Postgres 17 is supported through 2029; the 18 path-shape change is better solved in a future PR with proper PGDATA + parent-mount layout. - Makefile: `compose-up` and `compose-up-streaming` now go through `$(OP_RUN)` (and depend on `env-check`), so secrets are always injected. Verified locally: postgres / temporal-db / lakekeeper-db / temporal / temporal-ui all healthy. Garage starts cleanly (cluster-layout init is a follow-up). Lakekeeper DB-migration step is also a follow-up (Phase 2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 1 of the project (minimal end-to-end pipeline) now runs cleanly
from `make compose-up` through `make phase1` to a live Streamlit
dashboard at http://localhost:8501. Real GitHub data lands in Garage S3,
flows through DuckDB bronze→silver→gold, and is visualised.
Major additions
---------------
- scripts/garage-init.sh — idempotent layout + bucket + key import.
Fixes the "no role assigned / Quorum not available" state Garage
ships in. Pulls GARAGE_S3_ACCESS_KEY/SECRET_KEY from env (op run).
- Makefile: `make garage-init`, `make dbt-install`, and `phase1` target
chains compose-up → garage-init → ingest → dbt → dashboard.
- transform/requirements.txt: pin dbt-core 1.10 + dbt-duckdb 1.10.
- transform/.venv setup via `make dbt-install` (separate from
ingestion's venv to keep dependency surfaces clean).
Pipeline fixes discovered while running for real
------------------------------------------------
1. profiles.yml.example: add `s3_region: garage` so DuckDB's S3 client
stops sending `ap-northeast-1` to Garage and getting
AuthorizationHeaderMalformed back.
2. read_parquet(union_by_name=true) on all three bronze stg models —
needed because anthropics/claude-code has license=null and pyarrow
inferred its `license_spdx` column as INTEGER while every other repo
wrote VARCHAR. Combined with `try_cast(license_spdx as varchar)` on
the projection.
3. Bronze stg_* models materialized as `table` (not `view`). Working
around DuckDB v1.5 binder INTERNAL Error: when bronze is a view and
silver does `select <many cols incl. timestamps> from {{ ref(stg_*) }}
qualify row_number() over (... order by fetched_at desc) = 1`, the
binder errors with "Failed to bind column reference '': inequal
types (TIMESTAMP != VARCHAR)". Reproduced the bug in isolation; the
triggers are (a) source = subquery/view, (b) qualify with TIMESTAMP
ordering, (c) ≥2 TIMESTAMP columns in projection. Persisting bronze
sidesteps the inline-binder path.
4. Silver fct_*/dim_repos rewritten with `qualify` (replaces the
`with ranked as (select *, row_number() ...)` pattern, which also
triggered the same binder bug even with table-bronze).
5. fct_commits: replace `cardinality(parents)` with `len(parents)`.
`parents` is `VARCHAR[]` (a list); DuckDB's `cardinality()` is for
MAPs. `len()` is the canonical list/array length.
Verified end-to-end run
-----------------------
- Bronze: 5 + 7,833 + 390 rows ingested from 5 OSS repos
- Silver: 5 + 7,366 + 467 + 390 rows
- Gold: repo_health_snapshot 5 rows, repo_daily_metrics populated
- dbt: PASS=28 ERROR=0 SKIP=0 (3 bronze + 4 silver + 2 gold + 19 tests)
- Streamlit: HTTP 200 on :8501 reading gold tables
The .gitignore now also covers `.local/`, used for personal phase
trackers. See `.local/phase-1/{plan,status,log}.md` for the running
notes (gitignored).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are limited based on label configuration. 🏷️ Required labels (at least one) (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
dbt-duckdb's default generate_schema_name macro produces
<target_schema>_<custom_schema>, so a model declared as
`+schema: gold` lands at `main_gold.repo_health_snapshot` rather than
`gold.repo_health_snapshot`. The Streamlit dashboard (and any
downstream SQL example) references the cleaner `gold.*` form, so the
dashboard fell into its CatalogException-caught fallback ("Gold models
not yet materialized").
Override the macro to use the custom schema verbatim, matching what's
declared in dbt_project.yml. Now bronze/silver/gold tables live in the
schemas of those names, and the dashboard renders correctly.
The case for keeping the prefix is multi-target schema isolation
(dev_gold vs prod_gold sharing one DuckDB file), which we don't
currently need; we have a single dev profile.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 (minimum E2E パイプライン) が make compose-up → make phase1 で end-to-end クリーン稼働するまで持っていく一連の fix。実 GitHub データが Garage S3 に流れて、DuckDB bronze → silver → gold を通って Streamlit (http://localhost:8501) で表示されるところまで到達済み。
What's new
scripts/garage-init.sh— Garage の cluster layout / bucket / S3 key import を idempotent に実行make garage-init/make dbt-installMakefile targettransform/requirements.txt— dbt-core 1.10 / dbt-duckdb 1.10 pin.local/ディレクトリの gitignorePipeline fixes (実走させたら出てきたやつ)
s3_region: garageanthropics/claude-codeの license null → INTEGER 推論で他リポと型不整合union_by_name=true+try_cast(license_spdx as varchar)cardinality(parents)が "can only operate on MAPs"len(parents)(LIST 用)qualify+ 多重 TIMESTAMPtransform/.venv+make dbt-installVerified
PASS=28 ERROR=0Sample data (gold.repo_health_snapshot)