|
| 1 | +# Databricks Lakeflow Jobs with StackQL-Deploy |
| 2 | + |
| 3 | +A complete end-to-end demonstration of deploying and managing **Databricks Lakeflow jobs** using **StackQL-Deploy** for infrastructure provisioning and **Databricks Asset Bundles (DABs)** for data pipeline management. |
| 4 | + |
| 5 | +[](https://github.com/stackql/databricks-lakeflow-jobs-example/actions/workflows/databricks-dab.yml) |
| 6 | + |
| 7 | +## 🎯 Project Overview |
| 8 | + |
| 9 | +This repository demonstrates modern DataOps practices by combining: |
| 10 | + |
| 11 | +- **🏗️ Infrastructure as Code**: Using [StackQL](https://stackql.io) and [stackql-deploy](https://stackql-deploy.io) for SQL-based infrastructure management |
| 12 | +- **📊 Data Pipeline Management**: Using [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) for job orchestration and deployment |
| 13 | +- **🚀 GitOps CI/CD**: Automated infrastructure provisioning and data pipeline deployment via GitHub Actions |
| 14 | + |
| 15 | +### What This Project Does |
| 16 | + |
| 17 | +1. **Provisions Databricks Infrastructure** using StackQL-Deploy: |
| 18 | + - AWS IAM roles and cross-account permissions |
| 19 | + - S3 buckets for workspace storage |
| 20 | + - Databricks workspace with Unity Catalog |
| 21 | + - Storage credentials and external locations |
| 22 | + |
| 23 | +2. **Deploys a Retail Data Pipeline** using Databricks Asset Bundles: |
| 24 | + - Multi-stage data processing (Bronze → Silver → Gold) |
| 25 | + - Parallel task execution with dependency management |
| 26 | + - State-based conditional processing |
| 27 | + - For-each loops for parallel state processing |
| 28 | + |
| 29 | +3. **Automates Everything** with GitHub Actions: |
| 30 | + - Infrastructure provisioning on push to main |
| 31 | + - DAB validation and deployment |
| 32 | + - Multi-environment support (dev/prod) |
| 33 | + |
| 34 | +## 🏛️ Architecture |
| 35 | + |
| 36 | +```mermaid |
| 37 | +graph TB |
| 38 | + subgraph "GitHub Repository" |
| 39 | + A[infrastructure/] --> B[StackQL-Deploy] |
| 40 | + C[retail-job/] --> D[Databricks Asset Bundle] |
| 41 | + end |
| 42 | + |
| 43 | + subgraph "AWS Cloud" |
| 44 | + B --> E[IAM Roles] |
| 45 | + B --> F[S3 Buckets] |
| 46 | + B --> G[VPC/Security Groups] |
| 47 | + end |
| 48 | + |
| 49 | + subgraph "Databricks Platform" |
| 50 | + B --> H[Workspace] |
| 51 | + D --> I[Lakeflow Jobs] |
| 52 | + H --> I |
| 53 | + I --> J[Bronze Tables] |
| 54 | + I --> K[Silver Tables] |
| 55 | + I --> L[Gold Tables] |
| 56 | + end |
| 57 | + |
| 58 | + subgraph "CI/CD Pipeline" |
| 59 | + M[GitHub Actions] --> B |
| 60 | + M --> D |
| 61 | + M --> N[Multi-Environment Deployment] |
| 62 | + end |
| 63 | +``` |
| 64 | + |
| 65 | +## 📁 Repository Structure |
| 66 | + |
| 67 | +``` |
| 68 | +databricks-lakeflow-jobs-example/ |
| 69 | +├── infrastructure/ # StackQL infrastructure templates |
| 70 | +│ ├── README.md # Infrastructure setup guide |
| 71 | +│ ├── stackql_manifest.yml # StackQL deployment configuration |
| 72 | +│ └── resources/ # Cloud resource templates |
| 73 | +│ ├── aws/ # AWS resources (IAM, S3) |
| 74 | +│ ├── databricks_account/ # Account-level Databricks resources |
| 75 | +│ └── databricks_workspace/ # Workspace configurations |
| 76 | +├── retail-job/ # Databricks Asset Bundle |
| 77 | +│ ├── databricks.yml # DAB configuration |
| 78 | +│ └── Task Files/ # Data pipeline notebooks |
| 79 | +│ ├── 01_data_ingestion/ # Bronze layer data ingestion |
| 80 | +│ ├── 02_data_loading/ # Customer data loading |
| 81 | +│ ├── 03_data_processing/ # Silver layer transformations |
| 82 | +│ ├── 04_data_transformation/ # Gold layer clean data |
| 83 | +│ └── 05_state_processing/ # State-specific processing |
| 84 | +└── .github/workflows/ # CI/CD automation |
| 85 | + └── databricks-dab.yml # GitHub Actions workflow |
| 86 | +``` |
| 87 | + |
| 88 | +## 🚀 Quick Start |
| 89 | + |
| 90 | +### Prerequisites |
| 91 | + |
| 92 | +- AWS account with administrative permissions |
| 93 | +- Databricks account (see [infrastructure setup guide](./infrastructure/README.md)) |
| 94 | +- Python 3.8+ and Git |
| 95 | + |
| 96 | +### 1. Clone Repository |
| 97 | + |
| 98 | +```bash |
| 99 | +git clone https://github.com/stackql/databricks-lakeflow-jobs-example.git |
| 100 | +cd databricks-lakeflow-jobs-example |
| 101 | +``` |
| 102 | + |
| 103 | +### 2. Set Up Infrastructure |
| 104 | + |
| 105 | +Follow the comprehensive [Infrastructure Setup Guide](./infrastructure/README.md) to: |
| 106 | +- Configure AWS and Databricks accounts |
| 107 | +- Set up service principals and permissions |
| 108 | +- Deploy infrastructure using StackQL-Deploy |
| 109 | + |
| 110 | +### 3. Deploy Data Pipeline |
| 111 | + |
| 112 | +Once infrastructure is provisioned: |
| 113 | + |
| 114 | +```bash |
| 115 | +cd retail-job |
| 116 | + |
| 117 | +# Validate the bundle |
| 118 | +databricks bundle validate --target dev |
| 119 | + |
| 120 | +# Deploy the data pipeline |
| 121 | +databricks bundle deploy --target dev |
| 122 | + |
| 123 | +# Run the complete pipeline |
| 124 | +databricks bundle run retail_data_processing_job --target dev |
| 125 | +``` |
| 126 | + |
| 127 | +## 📊 Data Pipeline Deep Dive |
| 128 | + |
| 129 | +The retail data pipeline demonstrates a complete **medallion architecture** (Bronze → Silver → Gold): |
| 130 | + |
| 131 | +### Pipeline Stages |
| 132 | + |
| 133 | +1. **🥉 Bronze Layer - Data Ingestion** |
| 134 | + - **Orders Ingestion**: Loads raw sales orders data |
| 135 | + - **Sales Ingestion**: Loads raw sales transaction data |
| 136 | + - Tables: `orders_bronze`, `sales_bronze` |
| 137 | + |
| 138 | +2. **🥈 Silver Layer - Data Processing** |
| 139 | + - **Customer Loading**: Loads customer master data |
| 140 | + - **Data Joining**: Joins customers with sales and orders |
| 141 | + - **Duplicate Removal**: Conditional deduplication based on data quality |
| 142 | + - Tables: `customers_bronze`, `customer_sales_silver`, `customer_orders_silver` |
| 143 | + |
| 144 | +3. **🥇 Gold Layer - Data Transformation** |
| 145 | + - **Clean & Transform**: Business-ready, curated datasets |
| 146 | + - **State Processing**: Parallel processing for each US state using for-each loops |
| 147 | + - Tables: `retail_gold`, `state_summary_gold` |
| 148 | + |
| 149 | +### Advanced DAB Features Demonstrated |
| 150 | + |
| 151 | +- **🔄 Parallel Execution**: Multiple tasks run concurrently where dependencies allow |
| 152 | +- **🎯 Conditional Tasks**: Deduplication only runs if duplicates are detected |
| 153 | +- **🔁 For-Each Loops**: State processing runs in parallel for multiple states |
| 154 | +- **📧 Notifications**: Email alerts on job success/failure |
| 155 | +- **⏱️ Timeouts & Limits**: Job execution controls and concurrent run limits |
| 156 | +- **🎛️ Parameters**: Dynamic state-based processing with base parameters |
| 157 | + |
| 158 | +## 🔄 CI/CD Pipeline |
| 159 | + |
| 160 | +The GitHub Actions workflow ([`.github/workflows/databricks-dab.yml`](./.github/workflows/databricks-dab.yml)) provides complete automation: |
| 161 | + |
| 162 | +### Workflow Triggers |
| 163 | + |
| 164 | +- **Pull Requests**: Validates changes against dev environment |
| 165 | +- **Main Branch Push**: Deploys to production environment |
| 166 | +- **Path-Based**: Only triggers on infrastructure or job configuration changes |
| 167 | + |
| 168 | +### Deployment Steps |
| 169 | + |
| 170 | +1. **🏗️ Infrastructure Provisioning** |
| 171 | + ```yaml |
| 172 | + - name: Deploy Infrastructure with StackQL |
| 173 | + uses: stackql/stackql-deploy-action@v1.0.2 |
| 174 | + with: |
| 175 | + command: 'build' |
| 176 | + stack_dir: 'infrastructure' |
| 177 | + stack_env: ${{ env.ENVIRONMENT }} |
| 178 | + ``` |
| 179 | +
|
| 180 | +2. **📊 Workspace Configuration** |
| 181 | + - Extracts workspace details from StackQL deployment |
| 182 | + - Configures Databricks CLI with workspace credentials |
| 183 | + - Sets up environment-specific configurations |
| 184 | +
|
| 185 | +3. **✅ DAB Validation & Deployment** |
| 186 | + ```yaml |
| 187 | + - name: Validate Databricks Asset Bundle |
| 188 | + run: databricks bundle validate --target ${{ env.ENVIRONMENT }} |
| 189 | + |
| 190 | + - name: Deploy Databricks Jobs |
| 191 | + run: databricks bundle deploy --target ${{ env.ENVIRONMENT }} |
| 192 | + ``` |
| 193 | +
|
| 194 | +4. **🧪 Pipeline Testing** |
| 195 | + - Runs the complete data pipeline |
| 196 | + - Validates job execution and data quality |
| 197 | + - Reports results and generates summaries |
| 198 | +
|
| 199 | +### Environment Management |
| 200 | +
|
| 201 | +The workflow supports multiple environments with automatic detection: |
| 202 | +- **Dev Environment**: For pull requests and feature development |
| 203 | +- **Production Environment**: For main branch deployments |
| 204 | +
|
| 205 | +Environment-specific configurations are managed through: |
| 206 | +- StackQL environment variables and stack environments |
| 207 | +- Databricks Asset Bundle targets (`dev`, `prd`) |
| 208 | +- GitHub repository secrets for credentials |
| 209 | + |
| 210 | +## 🛠️ Key Technologies |
| 211 | + |
| 212 | +### StackQL & stackql-deploy |
| 213 | +- **SQL-based Infrastructure**: Manage cloud resources using familiar SQL syntax |
| 214 | +- **State-free Operations**: No state files - query infrastructure directly from APIs |
| 215 | +- **Multi-cloud Support**: Consistent interface across AWS, Azure, GCP, and SaaS providers |
| 216 | +- **GitOps Ready**: Native CI/CD integration with GitHub Actions |
| 217 | + |
| 218 | +### Databricks Asset Bundles |
| 219 | +- **Environment Consistency**: Deploy the same code across dev/staging/prod |
| 220 | +- **Version Control**: Infrastructure and code in sync with Git workflows |
| 221 | +- **Advanced Orchestration**: Complex dependencies, conditions, and parallel execution |
| 222 | +- **Resource Management**: Automated cluster provisioning and job scheduling |
| 223 | + |
| 224 | +### Modern DataOps Practices |
| 225 | +- **Infrastructure as Code**: Everything versioned and reproducible |
| 226 | +- **GitOps Workflows**: Pull request-based infrastructure changes |
| 227 | +- **Environment Parity**: Identical configurations across environments |
| 228 | +- **Automated Testing**: Pipeline validation and data quality checks |
| 229 | + |
| 230 | +## 📚 Learn More |
| 231 | + |
| 232 | +- **[Infrastructure Setup Guide](./infrastructure/README.md)**: Complete StackQL-Deploy setup and usage |
| 233 | +- **[StackQL Documentation](https://stackql.io/docs)**: Learn SQL-based infrastructure management |
| 234 | +- **[Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/)**: DAB concepts and advanced patterns |
| 235 | +- **[stackql-deploy GitHub Action](https://github.com/stackql/stackql-deploy-action)**: CI/CD integration guide |
| 236 | + |
| 237 | +## 🤝 Contributing |
| 238 | + |
| 239 | +1. Fork the repository |
| 240 | +2. Create a feature branch (`git checkout -b feature/amazing-feature`) |
| 241 | +3. Commit your changes (`git commit -m 'Add amazing feature'`) |
| 242 | +4. Push to the branch (`git push origin feature/amazing-feature`) |
| 243 | +5. Open a Pull Request |
| 244 | + |
| 245 | +## 📄 License |
| 246 | + |
| 247 | +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
| 248 | + |
| 249 | +## ⚠️ Important Notes |
| 250 | + |
| 251 | +- **Cost Management**: This project provisions billable cloud resources. Always run teardown commands after testing. |
| 252 | +- **Cleanup Required**: Cancel Databricks subscription after completing the exercise to avoid ongoing charges. |
| 253 | +- **Security**: Never commit credentials to version control. Use environment variables and CI/CD secrets. |
| 254 | + |
| 255 | +--- |
| 256 | + |
| 257 | +*Demonstrating the future of DataOps with SQL-based infrastructure management and modern data pipeline orchestration.* |
0 commit comments