Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions import-automation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ Import jobs own the task of fetching data from external data sources and making

The scheduler job triggers a GCP Workflow which then creates a GCP Batch job for each data import. An import job performs multiple tasks such as downloading data, processing it, and generating resolved mcf and copying it to GCS. It relies on the DataCommons [Import tool](https://github.com/datacommonsorg/import/blob/master/docs/usage.md) to perform mcf generation. Additionally, several validations are performed as part of the import job to ensure data quality. More details about the validation framework and supported validations can be found in the [README](https://github.com/datacommonsorg/data/tree/master/tools/import_validation).

Status of various import jobs can be monitored in the ImportStatus spanner table via the [Looker Studio dashboard](https://lookerstudio.google.com/c/reporting/e88fda74-50c9-46c6-88aa-c84342ceba48/).
Status of various import jobs can be monitored in the ImportStatus spanner table via the [Data Studio dashboard](https://datastudio.google.com/c/reporting/e88fda74-50c9-46c6-88aa-c84342ceba48).
Comment thread
vish-cs marked this conversation as resolved.

## Ingestion Pipeline
DataCommons runs various import jobs on cloud batch that generate the output MCF data on GCS. The output from these jobs is consumed by the graph ingestion pipeline (Dataflow) to push data into the knowledge graph (Spanner). More details about the ingestion pipeline are available [here](https://github.com/datacommonsorg/import/tree/master/pipeline/ingestion).

A GCP [cloud workflow](workflow/spanner-ingestion-workflow.yaml) is used to coordinate control between auto-refresh import jobs and the ingestion dataflow pipeline. To maintain data consistency, a global lock is used to ensure that only a single execution of the workflow is active at any time. The workflow relies on various [Spanner tables](workflow/spanner_schema.sql) for metadata management and [helper cloud functions](workflow/ingestion-helper/README.md) to control the execution.
A GCP [cloud workflow](https://github.com/datacommonsorg/import/blob/master/pipeline/workflow/spanner-ingestion-workflow.yaml) is used to coordinate control between auto-refresh import jobs and the ingestion dataflow pipeline. To maintain data consistency, a global lock is used to ensure that only a single execution of the workflow is active at any time. The workflow relies on various [Spanner tables](https://github.com/datacommonsorg/import/blob/master/pipeline/workflow/ingestion-helper/schema.sql) for metadata management and [helper cloud functions](https://github.com/datacommonsorg/import/blob/master/pipeline/workflow/ingestion-helper/README.md) to control the execution.

Infrastructure deployment for the various components in the import automation stack is automated using a [Terraform script](terraform/main.tf).
Infrastructure deployment for the various components in the import automation stack is automated using a [Terraform script](https://github.com/datacommonsorg/import/blob/master/pipeline/terraform/main.tf).

379 changes: 0 additions & 379 deletions import-automation/terraform/main.tf

This file was deleted.

Loading
Loading