Google Dataproc PySpark Workflow Template

This documentation describes the details and parameters required to create a Google Dataproc PySpark Workflow Template within the Witboost platform.

Prerequisites

A System should already exist in order to attach the new components to it.

Creation Wizard

The Creation Wizard allows you to create a new Google Dataproc PySpark Workflow Template.

Component metadata

This section covers the basic information that any component must have.

Name: Required. The name of the component.
Description: Required. Help others understand what this component is for.
Data Product: Required. The System this Workload belongs to. Be sure to choose the right one as it cannot be changed.
Identifier: Autogenerated from the information above. A unique identifier for the component. It will not be editable after creation and is a string composed of [a-zA-Z] separated by any of [-_].
Owner: Automatically selected from the System metadata. System owner.
Reads From: A workload could read from other components in other Systems or external components. This information will be used for lineage reasons.
Depends On: A component could depend on other components in the same System. This information will be used to deploy the components in such an order that their dependencies already exist.
Tags: Tags for the component.

Example:

Field name	Example value
Name	Cloud Vaccinations Workload
Description	Contains data on COVID-19 Vaccinations
Domain	domain:healthcare
Parent	system:healthcare.vaccinationsdp.0
Identifier	Will look something like this: healtchare.vaccinationsdp.0.vaccinations-workload. Depends on the name you gave to the component and the System it belongs to.
Owner	Will look something like this: group:datameshplatform. Depends on the System owner.
Depends On
Reads From
Tags

Google Dataproc Workflow Template deployment Information

This section covers specific information related to the Google Dataproc Workflow Template.

Field name	Description
Project ID	The GCP project ID where the Dataproc Workflow Template will be deployed.
Region	The GCP region where the Dataproc Workflow Template will be deployed.
Time To Live (TTL)	Number of seconds before the job will be killed.

Artifact details

Field name	Description
Name	The name of the artifact to reference in the Workflow Template.
Version	The artifact version to reference in the Workflow Template.
GCS Internal Storage Area	GCS Storage Area component, used to store the artifact of the Workflow Template.

Job details

Field name	Description
Type	The job type to pass to the Workflow Template. It is "PYSPARK" by default.
Main Python File	The main Python file for the job. It can be provided as a file name or as a relative path such as "path/to/main.py".
Additional Python Files	Additional Python files used to split up the business logic of the job. Each value can be provided as a file name or as a relative path such as "path/to/additional.py".
Jar Files	The Jar files to pass to the Python PySpark workload. This is initialized with a default jar.
Arguments	The arguments to pass in input to the PySpark job.

Cluster details

Field name	Description
Name	The name of the cluster that will be generated as soon as the Dataproc component is instantiated.
Service Account	Service account used by the cluster as a way to handle the permissions.
Base Image	The starting image for the cluster.

Master Nodes details

Field name	Description
Count	The number of Master nodes to instantiate.
Machine Type	It determines the type and amount of resources available to Master nodes.
Primary Disk Size (GB)	The primary Disk size for master nodes. Expressed in GigaBytes.
Primary Disk Type	SSD or Standard Disk.
Number of local SSD	The number of local SSD.

Worker Nodes details

Field name	Description
Count	The number of Worker nodes to instantiate.
Machine Type	It determines the type and amount of resources available to worker nodes.
Primary Disk Size (GB)	The primary Disk size for worker nodes. Expressed in GigaBytes.
Primary Disk Type	SSD or Standard Disk.
Number of local SSD	The number of local SSD.
Secondary Workers	The number of secondary workers.

Creation

After this step, the system will show you the summary of the information provided. You can go back and edit them if you notice any mistake, otherwise you can go ahead and create the component.

After clicking on "Create", the component registration will start. If no errors occur, it will go through the 3 phases (Fetching, Publishing and Registering) and it will show you the links to the newly created repository inside GitLab and the new System component in the Builder Catalog.

When deploying the System, deployment of this component will create the Cloud bucket inside the specified project.

Be careful not to delete the catalog-info.yml and ensure that the project structure remains as given.

Edit Wizard

The Edit Wizard allows you to edit most information about the component after you have created it. The sections are the same as the Creation Wizard, so you can refer to the documentation above, but some fields will be locked as they cannot be changed after creation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Dataproc PySpark Workflow Template

Prerequisites

Creation Wizard

Component metadata

Google Dataproc Workflow Template deployment Information

Artifact details

Job details

Cluster details

Master Nodes details

Worker Nodes details

Creation

Edit Wizard

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

Google Dataproc PySpark Workflow Template

Prerequisites

Creation Wizard

Component metadata

Google Dataproc Workflow Template deployment Information

Artifact details

Job details

Cluster details

Master Nodes details

Worker Nodes details

Creation

Edit Wizard