Skip to content

Commit 53c133e

Browse files
committed
Merge branch 'main' into standard_reports
2 parents 53dab5c + 1996661 commit 53c133e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+2869
-7688
lines changed

.github/dependabot.yml

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,17 @@ updates:
1010
directories:
1111
- "/"
1212
- "/infra/bigquery-export"
13-
- "infra/dataform-export"
14-
- "infra/dataform-trigger"
13+
- "/infra/dataform-service"
1514
schedule:
1615
interval: "weekly"
1716

1817
- package-ecosystem: "terraform"
19-
directory: "infra/tf/"
18+
directories:
19+
- "/infra/tf/"
20+
schedule:
21+
interval: "weekly"
22+
23+
- package-ecosystem: "github-actions"
24+
directory: "/"
2025
schedule:
2126
interval: "weekly"

.github/workflows/ci.yaml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
name: CI
3+
4+
on:
5+
workflow_dispatch:
6+
pull_request:
7+
branches:
8+
- main
9+
10+
permissions:
11+
contents: read
12+
packages: read
13+
14+
jobs:
15+
lint:
16+
name: Lint
17+
runs-on: ubuntu-latest
18+
19+
permissions:
20+
statuses: write # To report GitHub Actions status checks
21+
22+
steps:
23+
- name: Checkout Code
24+
uses: actions/checkout@v4
25+
with:
26+
fetch-depth: 0
27+
28+
- name: Lint Code Base
29+
uses: super-linter/super-linter/slim@v8.0.0
30+
env:
31+
DEFAULT_BRANCH: main
32+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
33+
LINTER_RULES_PATH: .
34+
VALIDATE_JSCPD: false
35+
VALIDATE_JAVASCRIPT_PRETTIER: false
36+
VALIDATE_MARKDOWN_PRETTIER: false
37+
VALIDATE_CHECKOV: false
38+
VALIDATE_GIT_COMMITLINT: false
39+
40+
dependabot:
41+
name: Dependabot auto-merge
42+
runs-on: ubuntu-latest
43+
needs: lint
44+
if: github.event.pull_request.user.login == 'dependabot[bot]' && github.repository == 'HTTPArchive/dataform'
45+
46+
permissions:
47+
contents: write
48+
pull-requests: write
49+
50+
steps:
51+
- name: Dependabot metadata
52+
id: metadata
53+
uses: dependabot/fetch-metadata@v2
54+
with:
55+
github-token: "${{ secrets.GITHUB_TOKEN }}"
56+
57+
- name: Enable auto-merge for Dependabot PRs
58+
if: steps.metadata.outputs.update-type == 'version-update:semver-patch' || steps.metadata.outputs.update-type == 'version-update:semver-minor'
59+
run: gh pr merge --auto --squash "$PR_URL"
60+
env:
61+
PR_URL: ${{github.event.pull_request.html_url}}
62+
GH_TOKEN: ${{secrets.GITHUB_TOKEN}}

.github/workflows/linter.yaml

Lines changed: 0 additions & 37 deletions
This file was deleted.

.gitignore

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
11
node_modules/
22
.DS_Store
3-
.venv/
4-
5-
# Terraform
63
infra/tf/.terraform/
7-
infra/tf/tmp/
8-
**/*.zip
4+
.env

Makefile

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,3 @@ tf_plan:
1111

1212
tf_apply:
1313
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve
14-
15-
bigquery_export_deploy:
16-
cd infra/bigquery-export && npm run build
17-
18-
#bigquery_export_spark_deploy:
19-
# cd infra/bigquery_export_spark && gcloud builds submit --region=global --tag us-docker.pkg.dev/httparchive/bigquery-spark-procedures/firestore_export:latest

README.md

Lines changed: 74 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -6,38 +6,35 @@ This repository handles the HTTP Archive data pipeline, which takes the results
66

77
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the `main` branch is used on each triggered pipeline run.
88

9-
### Crawl results
9+
### HTTP Archive Crawl
1010

1111
Tag: `crawl_complete`
1212

13-
- httparchive.crawl.pages
14-
- httparchive.crawl.parsed_css
15-
- httparchive.crawl.requests
13+
- Crawl dataset `httparchive.crawl.*`
1614

17-
### Core Web Vitals Technology Report
15+
Consumers:
1816

19-
Tag: `crux_ready`
17+
- public dataset and [BQ Sharing Listing](https://console.cloud.google.com/bigquery/analytics-hub/discovery/projects/httparchive/locations/us/dataExchanges/httparchive/listings/crawl)
2018

21-
- httparchive.core_web_vitals.technologies
19+
- Blink Features Report `httparchive.blink_features.usage`
2220

23-
Consumers:
21+
Consumers:
2422

25-
- [HTTP Archive Tech Report](https://httparchive.org/reports/techreport/landing)
23+
- [chromestatus.com](https://chromestatus.com/metrics/feature/timeline/popularity/2089)
2624

27-
### Blink Features Report
25+
### HTTP Archive Technology Report
2826

29-
Tag: `crawl_complete`
27+
Tag: `crux_ready`
3028

31-
- httparchive.blink_features.features
32-
- httparchive.blink_features.usage
29+
- `httparchive.reports.cwv_tech_*` and `httparchive.reports.tech_*`
3330

34-
Consumers:
31+
Consumers:
3532

36-
- chromestatus.com - [example](https://chromestatus.com/metrics/feature/timeline/popularity/2089)
33+
- [HTTP Archive Tech Report](https://httparchive.org/reports/techreport/landing)
3734

3835
## Schedules
3936

40-
1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription
37+
1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataform-service-crawl-complete?authuser=2&project=httparchive) PubSub subscription
4138

4239
Tags: ["crawl_complete"]
4340

@@ -49,30 +46,66 @@ Consumers:
4946

5047
In order to unify the workflow triggering mechanism, we use [a Cloud Run function](./infra/README.md) that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.
5148

52-
## Contributing
53-
54-
### Dataform development
55-
56-
1. [Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in Dataform.
57-
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
58-
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
59-
60-
#### Workspace hints
61-
62-
1. In `workflow_settings.yaml` set `environment: dev` to process sampled data.
63-
2. For development and testing, you can modify variables in `includes/constants.js`, but note that these are programmatically generated.
64-
65-
## Repository Structure
66-
67-
- `definitions/` - Contains the core Dataform SQL definitions and declarations
68-
- `output/` - Contains the main pipeline transformation logic
69-
- `declarations/` - Contains referenced tables/views declarations and other resources definitions
70-
- `includes/` - Contains shared JavaScript utilities and constants
71-
- `infra/` - Infrastructure code and deployment configurations
72-
- `dataform-trigger/` - Cloud Run function for workflow automation
73-
- `tf/` - Terraform configurations
74-
- `bigquery-export/` - BigQuery export configurations
75-
- `docs/` - Additional documentation
49+
## Cloud resources overview
50+
51+
```mermaid
52+
graph TB;
53+
subgraph Cloud Run
54+
dataform-service[dataform-service service]
55+
bigquery-export[bigquery-export job]
56+
end
57+
58+
subgraph PubSub
59+
crawl-complete[crawl-complete topic]
60+
dataform-service-crawl-complete[dataform-service-crawl-complete subscription]
61+
crawl-complete --> dataform-service-crawl-complete
62+
end
63+
64+
dataform-service-crawl-complete --> dataform-service
65+
66+
subgraph Cloud_Scheduler
67+
bq-poller-crux-ready[bq-poller-crux-ready Poller Scheduler Job]
68+
bq-poller-crux-ready --> dataform-service
69+
end
70+
71+
subgraph Dataform
72+
dataform[Dataform Repository]
73+
dataform_release_config[dataform Release Configuration]
74+
dataform_workflow[dataform Workflow Execution]
75+
end
76+
77+
dataform-service --> dataform[Dataform Repository]
78+
dataform --> dataform_release_config
79+
dataform_release_config --> dataform_workflow
80+
81+
subgraph BigQuery
82+
bq_jobs[BigQuery jobs]
83+
bq_datasets[BigQuery table updates]
84+
bq_jobs --> bq_datasets
85+
end
86+
87+
dataform_workflow --> bq_jobs
88+
89+
bq_jobs --> bigquery-export
90+
91+
subgraph Monitoring
92+
cloud_run_logs[Cloud Run logs]
93+
dataform_logs[Dataform logs]
94+
bq_logs[BigQuery logs]
95+
alerting_policies[Alerting Policies]
96+
slack_notifications[Slack notifications]
97+
98+
cloud_run_logs --> alerting_policies
99+
dataform_logs --> alerting_policies
100+
bq_logs --> alerting_policies
101+
alerting_policies --> slack_notifications
102+
end
103+
104+
dataform-service --> cloud_run_logs
105+
dataform_workflow --> dataform_logs
106+
bq_jobs --> bq_logs
107+
bigquery-export --> cloud_run_logs
108+
```
76109

77110
## Development Setup
78111

@@ -86,6 +119,7 @@ In order to unify the workflow triggering mechanism, we use [a Cloud Run functio
86119

87120
- `npm run format` - Format code using Standard.js, fix Markdown issues, and format Terraform files
88121
- `npm run lint` - Run linting checks on JavaScript, Markdown files, and compile Dataform configs
122+
- `make tf_apply` - Apply Terraform configurations
89123

90124
## Code Quality
91125

dataform.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Dataform
2+
3+
Runs the batch processing workflows. There are two Dataform repositories for [development](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data-test/details/workspaces?authuser=7&project=httparchive) and [production](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workspaces?authuser=7&project=httparchive).
4+
5+
The test repository is used [for development and testing purposes](https://cloud.google.com/dataform/docs/workspaces) and not connected to the rest of the pipeline infra.
6+
7+
Pipeline can be [run manually](https://cloud.google.com/dataform/docs/code-lifecycle) from the Dataform UI.
8+
9+
[Configuration](./tf/dataform.tf)
10+
11+
## Dataform Development Workspace
12+
13+
1. [Create new dev workspace](https://cloud.google.com/dataform/docs/quickstart-dev-environments) in test Dataform repository.
14+
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
15+
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
16+
17+
*Some useful hints:*
18+
19+
1. In workflow settings vars set `dev_name: dev` to process sampled data in dev workspace.
20+
2. Change `current_month` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
21+
3. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.
22+
23+
## Workspace hints
24+
25+
1. In `workflow_settings.yaml` set `environment: dev` to process sampled data.
26+
2. For development and testing, you can modify variables in `includes/constants.js`, but note that these are programmatically generated.
27+
28+
## Repository Structure
29+
30+
- `definitions/` - Contains the core Dataform SQL definitions and declarations
31+
- `output/` - Contains the main pipeline transformation logic
32+
- `declarations/` - Contains referenced tables/views declarations and other resources definitions
33+
- `includes/` - Contains shared JavaScript utilities and constants
34+
- `infra/` - Infrastructure code and deployment configurations
35+
- `bigquery-export/` - BigQuery export service
36+
- `dataform-service/` - Cloud Run function for dataform workflows automation
37+
- `tf/` - Terraform configurations
38+
- `docs/` - Additional documentation
39+
40+
## GiHub to Dataform connection
41+
42+
GitHub PAT saved to a [Secret Manager secret](https://console.cloud.google.com/security/secret-manager/secret/GitHub_max-ostapenko_dataform_PAT/versions?authuser=7&project=httparchive).
43+
44+
- repository: HTTPArchive/dataform
45+
- permissions:
46+
- Commit statuses: read
47+
- Contents: read, write
48+
49+
## Monitoring
50+
51+
- [Production Dataform workflow execution logs](https://console.cloud.google.com/bigquery/dataform/locations/us-central1/repositories/crawl-data/details/workflows?authuser=7&project=httparchive)
52+
53+
- [Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/16526940745374967367?authuser=7&project=httparchive) policy

definitions/output/crawl/pages.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ publish('pages', {
8282
tags: ['crawl_complete'],
8383
dependOnDependencyAssertions: true
8484
}).preOps(ctx => `
85-
SET @@RESERVATION='projects/httparchive/locations/US/reservations/enterprise';
85+
SET @@RESERVATION='${constants.reservation_id}';
8686
8787
DELETE FROM ${ctx.self()}
8888
WHERE date = '${constants.currentMonth}' AND

definitions/output/crawl/parsed_css.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ publish('parsed_css', {
99
},
1010
tags: ['crawl_complete']
1111
}).preOps(ctx => `
12-
SET @@RESERVATION='projects/httparchive/locations/US/reservations/enterprise';
12+
SET @@RESERVATION='${constants.reservation_id}';
1313
1414
DELETE FROM ${ctx.self()}
1515
WHERE date = '${constants.currentMonth}'

definitions/output/crawl/requests.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ publish('requests', {
3838
},
3939
tags: ['crawl_complete']
4040
}).preOps(ctx => `
41-
SET @@RESERVATION='projects/httparchive/locations/US/reservations/enterprise';
41+
SET @@RESERVATION='${constants.reservation_id}';
4242
4343
FOR client_var IN (SELECT * FROM UNNEST(['desktop', 'mobile']) AS value) DO
4444
FOR is_root_page_var IN (SELECT * FROM UNNEST([TRUE, FALSE]) AS value) DO

0 commit comments

Comments
 (0)