You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: data/README.md
+21-9Lines changed: 21 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,21 +1,32 @@
1
1
# Data & Synthetic Benchmarks
2
2
3
-
This directory serves as the local state provider for the pipeline when executing in a non-cloud environment. It mimics the structure of the Google Cloud Storage (GCS) buckets, allowing for high-fidelity local simulation and performance benchmarking.
3
+
This directory serves as the local state provider for the pipeline when executing in a non-cloud environment. It mimics the structure of the Google Cloud Storage (GCS) buckets.
4
4
5
5
## Synthetic Dataset
6
-
To replicate the high-volume environment described in the [GCP Stress-Test Metrics (Scaling Efficiency)](/README.md#gcp-stress-test-metrics-scaling-efficiency) section, you can download the 36M-row synthetic dataset here: [**Kaggle Dataset Link**](https://www.kaggle.com/datasets/melvidabryan/e-commerce-synthetic-dataset)
6
+
To replicate the high-volume environment described in the [GCP Stress-Test Metrics (Scaling Efficiency)](/README.md#gcp-stress-test-metrics-scaling-efficiency) section, you can download the 40M-row synthetic dataset here: [**Kaggle Dataset Link**](https://www.kaggle.com/datasets/melvidabryan/e-commerce-synthetic-dataset)
7
7
8
-
>*Note: This upload contains the **Contracted Version** of the dataset. The original "Raw" state—totaling approximately 24GB of unrefined CSVs was omitted to prioritize transfer efficiency.*
8
+
>*Note: This upload contains the **Contracted Version** of the dataset. The original "Raw" state, totaling approximately ~26GB of unrefined CSVs was omitted to prioritize transfer efficiency.*
9
9
10
-
###File Structure & Purpose
11
-
The dataset is divided into two primary directories to facilitate different stages of pipeline testing:
10
+
## File Structure & Purpose
11
+
The dataset is divided into three primary directories to facilitate different stages of pipeline testing:
12
12
13
13
| Directory | Files | Description |
14
14
| :--- | :--- | :--- |
15
-
|`contracted/`| 110 files |**Production-Scale Test:** The full 36M row dataset (~4.04 GB) formatted to strict enterprise schema requirements. |
16
-
|`raw/`| 5 files |**Delta Sample (Validation):** Small-scale samples (~10k rows each) representing **daily incoming deltas**. These files are intentionally "noisy" to exhibit the full range of injected data quality errors. |
15
+
|`contracted/`| 125 files |**Production-Scale Test:** The full 36M row dataset (~5.34 GB) formatted to strict schema requirements. |
16
+
|`id_mapping/customer_id/`| 1 file |**Metadata Registry:** Central lookup mapping Customer UUIDs to Uint32 surrogate keys. |
17
+
|`id_mapping/order_id/`| 40 files |**Metadata Registry (Sharded):** Fragmented lookup (40M+ keys) to test high-cardinality ID resolution. |
18
+
|`id_mapping/product_id/`| 1 file |**Metadata Registry:** Central lookup mapping Product UUIDs to Uint32 surrogate keys. |
19
+
|`id_mapping/seller_id/`| 1 file |**Metadata Registry:** Central lookup mapping Seller UUIDs to Uint32 surrogate keys. |
20
+
|`raw/`| 5 files |**Delta Sample (Validation):** Small-scale samples (~20k rows each) representing **daily incoming deltas**. These files are intentionally "noisy" to exhibit the full range of injected data quality errors. |
17
21
18
-
### Included Tables
22
+
---
23
+
24
+
### ID Mapping & Surrogate Key Simulation
25
+
The id_mapping/ directory acts as a simulated metadata registrar for surrogate key generation. The pipeline utilizes these registries to resolve raw source UUIDs into memory-efficient Uint32 identifiers while enforcing global deduplication and referential integrity.
26
+
27
+
To benchmark ***[`mapping`](/data_pipeline/contract/id_registrar.py) throughput and memory footprint***, the order_id registry is partitioned into 40 sharded files (1M rows each). This fragmentation simulates the ingestion pressure of high-cardinality transactional data (40M+ unique keys) on serverless compute. Dimension-level registries (Customer, Product, Seller) remain unfragmented, as their lower cardinality is insufficient to trigger the resource-exhaustion thresholds required for these performance benchmarks.
28
+
29
+
## Included Tables
19
30
20
31
The dataset provides a complete relational snapshot of an e-commerce ecosystem:
21
32
@@ -28,7 +39,8 @@ The dataset provides a complete relational snapshot of an e-commerce ecosystem:
28
39
## Local Execution Setup
29
40
1. Extract the downloaded dataset archive.
30
41
2. Copy the `raw/` and `contracted/` directories into this `data/` folder.
31
-
3. The `RunContext` manager is configured to strictly recognize `.parquet` and `.csv` extensions; all other file types are ignored to prevent ingestion noise.
42
+
3. Use the commented out local path in [`RunContext.create()`](../data_pipeline/shared/run_context.py#L62).
43
+
4. The `RunContext` manager is configured to strictly recognize `.parquet` and `.csv` extensions; all other file types are ignored to prevent ingestion noise.
0 commit comments