Skip to content

Data Job

a_git_a edited this page Dec 18, 2023 · 7 revisions

Overview

VDK is Versatile Data Kit SDK.

It provides standard functionality for data ingestion and processing and CLI for managing the lifecycle of a Data Job.

Data Job is a Data processing unit that allows data engineers to implement automated pull ingestion (E in ELT) or batch data transformation into Data Warehouse (T in ELT). At the core of it, it is a directory with different scripts and inside of it.

Data Job Steps

Data job consists of Steps. A data job step is a single unit of work for a Data Job. Which data job scripts or files are considered steps and executed by vdk is customizable.

By default, there are two types of steps:

  • SQL steps (SQL files)
  • Python steps (Python files implementing run(job_input) method)

By default steps are executed in an alphanumerical order of their file names.

See example:
data job step sequence

The steps will be executed in the order of the respective file names: 10_drop_table.sql, 20_create_table.sql, and 30_ingest_to_table.py

Create Your First Data Job

To create your first Data Job you need to:

  1. Install Quickstart VDK
  2. Execute command vdk create
  3. Follow the Create First Data Job page

Data Job Execution

An instance of a running Data Job deployment is called an execution.

To execute your data job you need to:

  1. Execute vdk run command
  2. Follow the output of this run

Local executions always comprise a single attempt.

➡️ Next section: Ingestion

Clone this wiki locally