You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+74-40Lines changed: 74 additions & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,38 +6,35 @@ This repository handles the HTTP Archive data pipeline, which takes the results
6
6
7
7
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the `main` branch is used on each triggered pipeline run.
8
8
9
-
### Crawl results
9
+
### HTTP Archive Crawl
10
10
11
11
Tag: `crawl_complete`
12
12
13
-
- httparchive.crawl.pages
14
-
- httparchive.crawl.parsed_css
15
-
- httparchive.crawl.requests
13
+
- Crawl dataset `httparchive.crawl.*`
16
14
17
-
### Core Web Vitals Technology Report
15
+
Consumers:
18
16
19
-
Tag: `crux_ready`
17
+
- public dataset and [BQ Sharing Listing](https://console.cloud.google.com/bigquery/analytics-hub/discovery/projects/httparchive/locations/us/dataExchanges/httparchive/listings/crawl)
20
18
21
-
- httparchive.core_web_vitals.technologies
19
+
-Blink Features Report `httparchive.blink_features.usage`
In order to unify the workflow triggering mechanism, we use [a Cloud Run function](./infra/README.md) that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.
51
48
52
-
## Contributing
53
-
54
-
### Dataform development
55
-
56
-
1.[Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in Dataform.
57
-
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
58
-
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
59
-
60
-
#### Workspace hints
61
-
62
-
1. In `workflow_settings.yaml` set `environment: dev` to process sampled data.
63
-
2. For development and testing, you can modify variables in `includes/constants.js`, but note that these are programmatically generated.
64
-
65
-
## Repository Structure
66
-
67
-
-`definitions/` - Contains the core Dataform SQL definitions and declarations
68
-
-`output/` - Contains the main pipeline transformation logic
69
-
-`declarations/` - Contains referenced tables/views declarations and other resources definitions
70
-
-`includes/` - Contains shared JavaScript utilities and constants
71
-
-`infra/` - Infrastructure code and deployment configurations
72
-
-`dataform-trigger/` - Cloud Run function for workflow automation
Runs the batch processing workflows. There are two Dataform repositories for [development](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data-test/details/workspaces?authuser=7&project=httparchive) and [production](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workspaces?authuser=7&project=httparchive).
4
+
5
+
The test repository is used [for development and testing purposes](https://cloud.google.com/dataform/docs/workspaces) and not connected to the rest of the pipeline infra.
6
+
7
+
Pipeline can be [run manually](https://cloud.google.com/dataform/docs/code-lifecycle) from the Dataform UI.
8
+
9
+
[Configuration](./tf/dataform.tf)
10
+
11
+
## Dataform Development Workspace
12
+
13
+
1.[Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in test Dataform repository.
14
+
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
15
+
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
16
+
17
+
*Some useful hints:*
18
+
19
+
1. In workflow settings vars set `dev_name: dev` to process sampled data in dev workspace.
20
+
2. Change `current_month` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
21
+
3.`definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.
22
+
23
+
## Workspace hints
24
+
25
+
1. In `workflow_settings.yaml` set `environment: dev` to process sampled data.
26
+
2. For development and testing, you can modify variables in `includes/constants.js`, but note that these are programmatically generated.
27
+
28
+
## Repository Structure
29
+
30
+
-`definitions/` - Contains the core Dataform SQL definitions and declarations
31
+
-`output/` - Contains the main pipeline transformation logic
32
+
-`declarations/` - Contains referenced tables/views declarations and other resources definitions
33
+
-`includes/` - Contains shared JavaScript utilities and constants
34
+
-`infra/` - Infrastructure code and deployment configurations
35
+
-`bigquery-export/` - BigQuery export service
36
+
-`dataform-service/` - Cloud Run function for dataform workflows automation
37
+
-`tf/` - Terraform configurations
38
+
-`docs/` - Additional documentation
39
+
40
+
## GiHub to Dataform connection
41
+
42
+
GitHub PAT saved to a [Secret Manager secret](https://console.cloud.google.com/security/secret-manager/secret/GitHub_max-ostapenko_dataform_PAT/versions?authuser=7&project=httparchive).
0 commit comments