Skip to content

Latest commit

 

History

History
110 lines (79 loc) · 10.6 KB

File metadata and controls

110 lines (79 loc) · 10.6 KB

Google Dataproc PySpark Workflow Template

This documentation describes the details and parameters required to create a Google Dataproc PySpark Workflow Template within the Witboost platform.

Prerequisites

A System should already exist in order to attach the new components to it.

Creation Wizard

The Creation Wizard allows you to create a new Google Dataproc PySpark Workflow Template.

Component metadata

This section covers the basic information that any component must have.

  • Name: Required. The name of the component.
  • Description: Required. Help others understand what this component is for.
  • Data Product: Required. The System this Workload belongs to. Be sure to choose the right one as it cannot be changed.
  • Identifier: Autogenerated from the information above. A unique identifier for the component. It will not be editable after creation and is a string composed of [a-zA-Z] separated by any of [-_].
  • Owner: Automatically selected from the System metadata. System owner.
  • Reads From: A workload could read from other components in other Systems or external components. This information will be used for lineage reasons.
  • Depends On: A component could depend on other components in the same System. This information will be used to deploy the components in such an order that their dependencies already exist.
  • Tags: Tags for the component.

Example:

Field name Example value
Name Cloud Vaccinations Workload
Description Contains data on COVID-19 Vaccinations
Domain domain:healthcare
Parent system:healthcare.vaccinationsdp.0
Identifier Will look something like this: healtchare.vaccinationsdp.0.vaccinations-workload. Depends on the name you gave to the component and the System it belongs to.
Owner Will look something like this: group:datameshplatform. Depends on the System owner.
Depends On
Reads From
Tags

Google Dataproc Workflow Template deployment Information

This section covers specific information related to the Google Dataproc Workflow Template.

Field name Description
Project ID The GCP project ID where the Dataproc Workflow Template will be deployed.
Region The GCP region where the Dataproc Workflow Template will be deployed.
Time To Live (TTL) Number of seconds before the job will be killed.

Artifact details

Field name Description
Name The name of the artifact to reference in the Workflow Template.
Version The artifact version to reference in the Workflow Template.
GCS Internal Storage Area GCS Storage Area component, used to store the artifact of the Workflow Template.

Job details

Field name Description
Type The job type to pass to the Workflow Template. It is "PYSPARK" by default.
Main Python File The main Python file for the job. It can be provided as a file name or as a relative path such as "path/to/main.py".
Additional Python Files Additional Python files used to split up the business logic of the job. Each value can be provided as a file name or as a relative path such as "path/to/additional.py".
Jar Files The Jar files to pass to the Python PySpark workload. This is initialized with a default jar.
Arguments The arguments to pass in input to the PySpark job.

Cluster details

Field name Description
Name The name of the cluster that will be generated as soon as the Dataproc component is instantiated.
Service Account Service account used by the cluster as a way to handle the permissions.
Base Image The starting image for the cluster.

Master Nodes details

Field name Description
Count The number of Master nodes to instantiate.
Machine Type It determines the type and amount of resources available to Master nodes.
Primary Disk Size (GB) The primary Disk size for master nodes. Expressed in GigaBytes.
Primary Disk Type SSD or Standard Disk.
Number of local SSD The number of local SSD.

Worker Nodes details

Field name Description
Count The number of Worker nodes to instantiate.
Machine Type It determines the type and amount of resources available to worker nodes.
Primary Disk Size (GB) The primary Disk size for worker nodes. Expressed in GigaBytes.
Primary Disk Type SSD or Standard Disk.
Number of local SSD The number of local SSD.
Secondary Workers The number of secondary workers.

Creation

After this step, the system will show you the summary of the information provided. You can go back and edit them if you notice any mistake, otherwise you can go ahead and create the component.

After clicking on "Create", the component registration will start. If no errors occur, it will go through the 3 phases (Fetching, Publishing and Registering) and it will show you the links to the newly created repository inside GitLab and the new System component in the Builder Catalog.

When deploying the System, deployment of this component will create the Cloud bucket inside the specified project.

Be careful not to delete the catalog-info.yml and ensure that the project structure remains as given.

Edit Wizard

The Edit Wizard allows you to edit most information about the component after you have created it. The sections are the same as the Creation Wizard, so you can refer to the documentation above, but some fields will be locked as they cannot be changed after creation.