-
Notifications
You must be signed in to change notification settings - Fork 66
dictionary
Data processing unit that allows data engineers to implement automated pull ingestion (E in ELT) or batch data transformation into Data Warehouse (T in ELT).
A single unit of work for a Data Job. Usually this is SQL or Python file. Steps are executed in an alphanumerical order of their file names. Plugins can provide different types of steps or change execution order.
All Python, SQL files, and requirements.txt (and others) of the Data Job. These are versioned and tracked per job.
Any saved state, configuration, and secrets of a Data Job. Those are tracked per deployment. They can be used across different job executions. In the future, they should also be versioned.
They can be accessed using IProperties python interface within a data job or CLI vdk properties command.
Deployment takes both the build/code and the deployment-specific properties, builds and packages them and once a Data Job is deployed, it is ready for immediate execution in the execution environment. It can be scheduled to run periodically.
Single run of a Data Job. Data Job run is usually referred to when a Data Job is executed locally. A job may have multiple attempts (or runs) in a job execution. See Data Job Execution.
An instance of a running Data Job deployment is called an execution.
Data Job execution can run a Data Job one or more times. If a run (attempt) fails due to a platform error, the job can be automatically re-run (this is configurable by Control Service operators)
This is applicable only for executions in the "Cloud" (Kubernetes). Local executions always comprise of a single attempt.
See Data Job Attempt.
A Data Job Execution or Data Job Run can accept any number of arguments when started. Those are unique for each execution/run. They are usually used when Execution is triggered manually by user or when user is developing locally.
VDK is Versatile Data Kit SDK.
It provides common functionality for data ingestion and processing and CLI for managing lifecycle of a Data Job (see Control Service)
The "backend" part. It provides API for managing the lifecycle of Data Jobs.
Versatile Data Kit provides a way to extend or change the behavior of any SDK command using VDK Plugins. One can plugin into any stage of the job runtime. For a list of already developed plugins see plugins directory here. For how to install and develop plugin see plugin doc
OpId identifies the trigger that initiated this job. If left empty it will be auto-generated. OpID is similar to trace ID in open tracing. It enable tracing multiple operations across different services and datasets. For example, it is possible to have N jobs with the same OpID (if Job1 started Job2 then Job1.opId = Job2.opId). In HTTP requests it is passed as header 'X-OPID' by the Control Service.
For ExecutionId - see definition of Execution
For AttemptId - see definition of Attempt
SDK - Develop Data Jobs
SDK Key Concepts
Control Service - Deploy Data Jobs
Control Service Key Concepts
- Scheduling a Data Job for automatic execution
- Deployment
- Execution
- Production
- Properties and Secrets
Operations UI
Community
Contacts