Implement BigQuery Atomic View Swapping & High-Performance GCS Streaming by BLMgithub · Pull Request #74 · BLMgithub/operations-analytics-pipeline

BLMgithub · 2026-04-24T16:01:23Z

Transitions pipeline publishing from file-based pointers to a dual-system architecture utilizing BigQuery Authorized Views. Hardens core I/O to ensure zero-downtime BI delivery and 8GB memory footprint.

Implemented

Atomic View Swapping: Added swap_bigquery_view logic to redirect stable published_ views via versioned BigQuery External Tables.
Native GCS Streaming: Replaced legacy BigQuery Storage API with direct Polars scan_parquet via BigQuery metadata URI resolution (scan_gcs_uris_from_bigquery).
Discovery-First ID Mapping: Refactored the ID Registrar to resolve all entity mappings in a pre-contract Discovery Phase, consolidating I/O into a single surgical pass across the 40M-row historical registry regardless of raw file format (CSV/Parquet).
Temporal Normalization: Implemented terminal us (microsecond) resolution enforcement at source, transformation, and sink layers to prevent ns schema panics.
Primitive Integer Pipeline: Integrated UInt32 ID Registrar for UUID-to-Int mapping; decoupled Assembly/Semantic I/O to reduce join-memory overhead by >60%.
Infra-as-Code: Provisioned Semantic Datasets, BigLake Connections, and 1-month auto-expiration policies via Terraform.
Environmental Parity: Unified local/cloud I/O paths using disk volume mounts and regional context injection (GCP_REGION).

Impact

BI Stability: Eliminates Power BI "Dynamic Data Source" refresh blockers by providing static,native SQL endpoints.
I/O Efficiency: Eliminates the "Multiplier Trap" by reducing registry read-redundancy by 66% for entities appearing in multiple tables (e.g., order_id). Maximizes Parquet row-group pruning by consolidating UUID predicates into a unified global Hash-Set.
Reliability: Guarantees environment neutrality; resolves all "ns vs us" data type mismatches across local and cloud environments.
High-Throughput Scaling: Successfully defends the 8GB memory ceiling. Benchmarks confirm high-efficiency processing of 40M rows (~5.5GB Parquet) at a throughput of ~307k rows/sec, achieving full pipeline execution in 130 seconds.
Auditability: Maintains immutable, versioned GCS snapshots with corresponding BigQuery metadata pointers for point-in-time recovery.

…r BI connectivity

…ole for data pipeline job via terraform

…lumns; refactor validation and contract stage to polars

…er IDs (from id_mappping/) for all operations.

…ing and new implementation

…d to enable codifying new gcp feature

…ith stream filtering

… for id_registrar

…pdate docs and unit tests accordingly

…ge adapter

…ant registry I/O; update unit tests accordingly

…e and iac docs

…chitecture

BLMgithub added 15 commits April 14, 2026 08:03

feat: add bigquery atomic view swap to manage published versioning fo…

70d89ab

…r BI connectivity

feat: add provision for semantics dataset; enable and bind bigquery r…

f93888a

…ole for data pipeline job via terraform

feat: implement persistent ID registrar for UInt32 mapping of UUID co…

bb447f6

…lumns; refactor validation and contract stage to polars

refactor: transition assembly and semantic stages to use UInt32 integ…

24ebbef

…er IDs (from id_mappping/) for all operations.

docs: update docstring and stage markdown documents to match refactor…

4234873

…ing and new implementation

feat: implement disk volume mount provisioning and document workaroun…

b93ad7a

…d to enable codifying new gcp feature

refactor: refactor: replace id_registrar memory-intensive anti-join w…

33d1786

…ith stream filtering

refactor: decouple assembly and semantic stage IO

091b196

test: update validation unit test to match refactoring

70ef08c

feat: add local and gcp path IO adapter for id_mapping; add unit test…

54fdefa

… for id_registrar

feat: implement direct GCS streaming by URIs via BigQuery metadata; u…

b6e27f3

…pdate docs and unit tests accordingly

fix: id_registrar replacing the accumulated mapped ids when GCS stora…

887efa6

…ge adapter

feat: implement discovery-first global ID mapping to eliminate redund…

d89ca84

…ant registry I/O; update unit tests accordingly

feat: codify bigquery datasets and external tables; update ci pipelin…

840f973

…e and iac docs

docs: align project documentation and benchmarks with new pipeline ar…

16004aa

…chitecture

BLMgithub self-assigned this Apr 24, 2026

BLMgithub added the enhancement New feature or request label Apr 24, 2026

BLMgithub added this to Seller Fulfillment Intervention Monitor Apr 24, 2026

github-project-automation Bot moved this to Backlog in Seller Fulfillment Intervention Monitor Apr 24, 2026

BLMgithub merged commit 1f6f57f into main Apr 24, 2026
1 check passed

github-project-automation Bot moved this from Backlog to Completed in Seller Fulfillment Intervention Monitor Apr 24, 2026

BLMgithub deleted the feature/bigquery-pointer-view branch May 17, 2026 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BigQuery Atomic View Swapping & High-Performance GCS Streaming#74

Implement BigQuery Atomic View Swapping & High-Performance GCS Streaming#74
BLMgithub merged 15 commits into
mainfrom
feature/bigquery-pointer-view

BLMgithub commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BLMgithub commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant