You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+25-16Lines changed: 25 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,29 +63,38 @@ The pipeline does not just move data; it actively defends the analytical layer f
63
63
***End-to-End Traceability:** A single `run_id` is propagated through all raw snapshots, metadata logs, and published artifacts to provide absolute lineage tracking.
64
64
***Resilient Logging:** Even in the event of a fatal crash, the orchestrator's `finally` block guarantees that partial logs and stage reports are synced back to cloud storage before the local workspace is purged, ensuring debuggability.
The pipeline is explicitly engineered to process large-scale historical data without breaching the strict memory constraints of serverless compute (Cloud Run). To achieve this, the core execution engine was migrated from Pandas to Polars, utilizing `LazyFrames` and streaming evaluation.
68
+
The pipeline is explicitly engineered to process massive datasets within the rigid memory constraints of serverless compute (Cloud Run). By leveraging the Polars Rust engine (Lazy API & Streaming), the system achieves near-perfect memory density, operating consistently at the physical hardware ceiling.
69
69
70
-
**The Benchmark Constraint: 4GB RAM / 2 vCPU**
70
+
**GCP Stress-Test Metrics (18 Million Row Snapshot)**
>The dataset for this chart is available at [`benchmark`](/assets/benchmarks/) and the instruction to download 15m rows dataset found in [`data/`](/data/README)
***Measurement Methodology:** Performance profiles were captured by executing the pipeline locally via [`docker-compose.benchmark`](docker-compose.benchmark.yml) configured to precisely mirror the Cloud Run constraints (`memory="4G" cpus="2" POLARS_MAX_THREADS=2`). Resource footprints were tracked sequentially via a PowerShell polling script:
***The Pandas Ceiling (4M Rows in 88s):** Under the legacy Pandas engine, memory usage became fully saturated (100% / 4GiB) when processing a 4-million-row dataset. Because Pandas executes eagerly and loads entire datasets into memory, any dataset larger than 4M rows resulted in an inevitable Out-Of-Memory (OOM) crash.
80
-
***The Polars Migration (15M Rows in 67s):** By switching to the Polars Lazy API, the pipeline now processes a dataset nearly 4x larger (15 million rows) while actually reducing execution time from 88 seconds to 67 seconds within the exact same 4GB/2vCPU constraint.
81
-
***Streaming Evaluation:** Instead of eagerly loading the whole dataset, Polars processes data in batches, drastically reducing the memory footprint.
82
-
***Multi-Core Utilization:** Unlike single-threaded Pandas (peaking at ~100% CPU), the Polars engine effectively parallelizes the workload, consistently utilizing ~200% CPU across both provisioned cores.
83
-
***Zero-Copy Export:** The Semantic stage leverages `sink_parquet` to write analytical models directly to disk via streaming, ensuring memory is freed instantaneously during the final Gold-layer assembly.
74
+
> The data used for this chart [`benchmarks/`](/assets/benchmarks/polars/18mrows_dataset_stats_log.csv) and the 18m rows dataset can be found her [`data/`](/data/)
75
+
76
+
77
+
| Metric | Value (18M Row Peak Load) |
78
+
| :--- | :--- |
79
+
|**Throughput (Processing)**|~116,000 Rows / Second |
|**Effective Data Headroom**|~6.5 GiB (Active Transformation) |
84
+
85
+
***Linear Vertical Scaling:** Bumping the Cloud Run provision to 32GiB allows the same architecture to process ~72 Million rows without code changes.
86
+
***Predictable Capacity:** Identifying the 1.5GB "Memory Tax" allows for precise resource governance, ensuring jobs never fail due to unpredictable Signal 9 (OOM) events.
87
+
***Zero-Idle Economics:** 100% serverless execution ensures zero billable time during idle periods, significantly reducing the Total Cost of Ownership (TCO) compared to dedicated cluster solutions.
88
+
89
+
**Measurement Methodology**
90
+
***Performance Profiling:** Captured from production telemetry via the pipeline's native `run_duration` metadata, calculating the precise delta between `started_at` and `completed_at` timestamps.
91
+
***Memory Utilization:** Monitored via an integrated [`psutil.virtual_memory().used`](/assets/benchmarks/polars/README.md) profiling implementation to verify the actual resource footprint and confirm the physical ceiling for an 8GiB provision.
92
+
***Throughput Efficiency:** Leverages Polars' streaming evaluation to maintain high throughput and minimize CPU idle time during GCS I/O, providing a significant performance advantage over traditional eager-loading engines.
Operational maturity requires assuming things will eventually break. The pipeline features a comprehensive observability suite managed natively via Google Cloud Monitoring and Cloud Logging, codified entirely in Terraform.
91
100
@@ -111,7 +120,7 @@ The system monitors specific log payloads across the infrastructure and dispatch
111
120
112
121
## Repository Structure
113
122
114
-
```text
123
+
```
115
124
operations-analytics-pipeline/
116
125
├── .gcp/
117
126
│ └── terraforms/ # IaC for all GCP resources (Cloud Run, Eventarc, Storage, IAM)
This section provides proof that the memory metrics in the root README were captured from a real Cloud Run execution of the 18M row dataset.
4
+
5
+
The telemetry logger below was added **temporarily** to the orchestrator for a specific benchmarking run. This code was pushed directly to the Cloud Artifact Registry as an experimental image tag (`mem-record`) and is not part of the permanent git repository history.
6
+
7
+
```python
8
+
import psutil
9
+
import threading
10
+
import time
11
+
12
+
defmemory_logger(stop_event: threading.Event):
13
+
"""Temporary: Logs RAM usage to stdout every 1s for benchmarking."""
***Source:** Real-time stdout logs from the Cloud Run job execution.
34
+
***Extraction:** Log entries with the `METRIC_MEM` prefix were filtered and exported as a CSV.
35
+
***Status:** This methodology ensures that the reported peak loads and "V-shaped" memory reclamation drops are reproducible and based on actual hardware performance.
Copy file name to clipboardExpand all lines: data/README.md
+1-50Lines changed: 1 addition & 50 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,53 +18,4 @@ The downloaded archive contains the following partitions:
18
18
**Execute the local pipeline:**
19
19
```
20
20
python -m data_pipeline.run_pipeline
21
-
```
22
-
23
-
## Data Dictionary: Contract-Compliant Schema (Silver Layer)
24
-
The following tables represent the technical contracts enforced during the **Contract Stage**. Source [`table_configs.py`](../data_pipeline/shared/table_configs.py).
0 commit comments