This documentation describes the details and parameters required to create a Google Dataproc PySpark Workflow Template within the Witboost platform.
A System should already exist in order to attach the new components to it.
The Creation Wizard allows you to create a new Google Dataproc PySpark Workflow Template.
This section covers the basic information that any component must have.
- Name: Required. The name of the component.
- Description: Required. Help others understand what this component is for.
- Data Product: Required. The System this Workload belongs to. Be sure to choose the right one as it cannot be changed.
- Identifier: Autogenerated from the information above. A unique identifier for the component. It will not be editable after creation and is a string composed of [a-zA-Z] separated by any of [-_].
- Owner: Automatically selected from the System metadata. System owner.
- Reads From: A workload could read from other components in other Systems or external components. This information will be used for lineage reasons.
- Depends On: A component could depend on other components in the same System. This information will be used to deploy the components in such an order that their dependencies already exist.
- Tags: Tags for the component.
Example:
| Field name | Example value |
|---|---|
| Name | Cloud Vaccinations Workload |
| Description | Contains data on COVID-19 Vaccinations |
| Domain | domain:healthcare |
| Parent | system:healthcare.vaccinationsdp.0 |
| Identifier | Will look something like this: healtchare.vaccinationsdp.0.vaccinations-workload. Depends on the name you gave to the component and the System it belongs to. |
| Owner | Will look something like this: group:datameshplatform. Depends on the System owner. |
| Depends On | |
| Reads From | |
| Tags |
This section covers specific information related to the Google Dataproc Workflow Template.
| Field name | Description |
|---|---|
| Project ID | The GCP project ID where the Dataproc Workflow Template will be deployed. |
| Region | The GCP region where the Dataproc Workflow Template will be deployed. |
| Time To Live (TTL) | Number of seconds before the job will be killed. |
| Field name | Description |
|---|---|
| Name | The name of the artifact to reference in the Workflow Template. |
| Version | The artifact version to reference in the Workflow Template. |
| GCS Internal Storage Area | GCS Storage Area component, used to store the artifact of the Workflow Template. |
| Field name | Description |
|---|---|
| Type | The job type to pass to the Workflow Template. It is "PYSPARK" by default. |
| Main Python File | The main Python file for the job. It can be provided as a file name or as a relative path such as "path/to/main.py". |
| Additional Python Files | Additional Python files used to split up the business logic of the job. Each value can be provided as a file name or as a relative path such as "path/to/additional.py". |
| Jar Files | The Jar files to pass to the Python PySpark workload. This is initialized with a default jar. |
| Arguments | The arguments to pass in input to the PySpark job. |
| Field name | Description |
|---|---|
| Name | The name of the cluster that will be generated as soon as the Dataproc component is instantiated. |
| Service Account | Service account used by the cluster as a way to handle the permissions. |
| Base Image | The starting image for the cluster. |
| Field name | Description |
|---|---|
| Count | The number of Master nodes to instantiate. |
| Machine Type | It determines the type and amount of resources available to Master nodes. |
| Primary Disk Size (GB) | The primary Disk size for master nodes. Expressed in GigaBytes. |
| Primary Disk Type | SSD or Standard Disk. |
| Number of local SSD | The number of local SSD. |
| Field name | Description |
|---|---|
| Count | The number of Worker nodes to instantiate. |
| Machine Type | It determines the type and amount of resources available to worker nodes. |
| Primary Disk Size (GB) | The primary Disk size for worker nodes. Expressed in GigaBytes. |
| Primary Disk Type | SSD or Standard Disk. |
| Number of local SSD | The number of local SSD. |
| Secondary Workers | The number of secondary workers. |
After this step, the system will show you the summary of the information provided. You can go back and edit them if you notice any mistake, otherwise you can go ahead and create the component.
After clicking on "Create", the component registration will start. If no errors occur, it will go through the 3 phases (Fetching, Publishing and Registering) and it will show you the links to the newly created repository inside GitLab and the new System component in the Builder Catalog.
When deploying the System, deployment of this component will create the Cloud bucket inside the specified project.
Be careful not to delete the catalog-info.yml and ensure that the project structure remains as given.
The Edit Wizard allows you to edit most information about the component after you have created it. The sections are the same as the Creation Wizard, so you can refer to the documentation above, but some fields will be locked as they cannot be changed after creation.