Skip to content

Implement BigQuery Atomic View Swapping & High-Performance GCS Streaming#74

Merged
BLMgithub merged 15 commits into
mainfrom
feature/bigquery-pointer-view
Apr 24, 2026
Merged

Implement BigQuery Atomic View Swapping & High-Performance GCS Streaming#74
BLMgithub merged 15 commits into
mainfrom
feature/bigquery-pointer-view

Conversation

@BLMgithub
Copy link
Copy Markdown
Owner

Transitions pipeline publishing from file-based pointers to a dual-system architecture utilizing BigQuery Authorized Views. Hardens core I/O to ensure zero-downtime BI delivery and 8GB memory footprint.

Implemented

  • Atomic View Swapping: Added swap_bigquery_view logic to redirect stable published_ views via versioned BigQuery External Tables.
  • Native GCS Streaming: Replaced legacy BigQuery Storage API with direct Polars scan_parquet via BigQuery metadata URI resolution (scan_gcs_uris_from_bigquery).
  • Discovery-First ID Mapping: Refactored the ID Registrar to resolve all entity mappings in a pre-contract Discovery Phase, consolidating I/O into a single surgical pass across the 40M-row historical registry regardless of raw file format (CSV/Parquet).
  • Temporal Normalization: Implemented terminal us (microsecond) resolution enforcement at source, transformation, and sink layers to prevent ns schema panics.
  • Primitive Integer Pipeline: Integrated UInt32 ID Registrar for UUID-to-Int mapping; decoupled Assembly/Semantic I/O to reduce join-memory overhead by >60%.
  • Infra-as-Code: Provisioned Semantic Datasets, BigLake Connections, and 1-month auto-expiration policies via Terraform.
  • Environmental Parity: Unified local/cloud I/O paths using disk volume mounts and regional context injection (GCP_REGION).

Impact

  • BI Stability: Eliminates Power BI "Dynamic Data Source" refresh blockers by providing static,native SQL endpoints.
  • I/O Efficiency: Eliminates the "Multiplier Trap" by reducing registry read-redundancy by 66% for entities appearing in multiple tables (e.g., order_id). Maximizes Parquet row-group pruning by consolidating UUID predicates into a unified global Hash-Set.
  • Reliability: Guarantees environment neutrality; resolves all "ns vs us" data type mismatches across local and cloud environments.
  • High-Throughput Scaling: Successfully defends the 8GB memory ceiling. Benchmarks confirm high-efficiency processing of 40M rows (~5.5GB Parquet) at a throughput of ~307k rows/sec, achieving full pipeline execution in 130 seconds.
  • Auditability: Maintains immutable, versioned GCS snapshots with corresponding BigQuery metadata pointers for point-in-time recovery.

…lumns; refactor validation and contract stage to polars
…er IDs (from id_mappping/) for all operations.
…ant registry I/O; update unit tests accordingly
@BLMgithub BLMgithub self-assigned this Apr 24, 2026
@BLMgithub BLMgithub added the enhancement New feature or request label Apr 24, 2026
@BLMgithub BLMgithub merged commit 1f6f57f into main Apr 24, 2026
1 check passed
@BLMgithub BLMgithub deleted the feature/bigquery-pointer-view branch May 17, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

1 participant