Skip to content

Commit a52088c

Browse files
committed
updated READMEs
1 parent 08c84c5 commit a52088c

2 files changed

Lines changed: 466 additions & 133 deletions

File tree

README.md

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# Databricks Lakeflow Jobs with StackQL-Deploy
2+
3+
A complete end-to-end demonstration of deploying and managing **Databricks Lakeflow jobs** using **StackQL-Deploy** for infrastructure provisioning and **Databricks Asset Bundles (DABs)** for data pipeline management.
4+
5+
[![Databricks Asset Bundle CI/CD](https://github.com/stackql/databricks-lakeflow-jobs-example/actions/workflows/databricks-dab.yml/badge.svg)](https://github.com/stackql/databricks-lakeflow-jobs-example/actions/workflows/databricks-dab.yml)
6+
7+
## 🎯 Project Overview
8+
9+
This repository demonstrates modern DataOps practices by combining:
10+
11+
- **🏗️ Infrastructure as Code**: Using [StackQL](https://stackql.io) and [stackql-deploy](https://stackql-deploy.io) for SQL-based infrastructure management
12+
- **📊 Data Pipeline Management**: Using [Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/index.html) for job orchestration and deployment
13+
- **🚀 GitOps CI/CD**: Automated infrastructure provisioning and data pipeline deployment via GitHub Actions
14+
15+
### What This Project Does
16+
17+
1. **Provisions Databricks Infrastructure** using StackQL-Deploy:
18+
- AWS IAM roles and cross-account permissions
19+
- S3 buckets for workspace storage
20+
- Databricks workspace with Unity Catalog
21+
- Storage credentials and external locations
22+
23+
2. **Deploys a Retail Data Pipeline** using Databricks Asset Bundles:
24+
- Multi-stage data processing (Bronze → Silver → Gold)
25+
- Parallel task execution with dependency management
26+
- State-based conditional processing
27+
- For-each loops for parallel state processing
28+
29+
3. **Automates Everything** with GitHub Actions:
30+
- Infrastructure provisioning on push to main
31+
- DAB validation and deployment
32+
- Multi-environment support (dev/prod)
33+
34+
## 🏛️ Architecture
35+
36+
```mermaid
37+
graph TB
38+
subgraph "GitHub Repository"
39+
A[infrastructure/] --> B[StackQL-Deploy]
40+
C[retail-job/] --> D[Databricks Asset Bundle]
41+
end
42+
43+
subgraph "AWS Cloud"
44+
B --> E[IAM Roles]
45+
B --> F[S3 Buckets]
46+
B --> G[VPC/Security Groups]
47+
end
48+
49+
subgraph "Databricks Platform"
50+
B --> H[Workspace]
51+
D --> I[Lakeflow Jobs]
52+
H --> I
53+
I --> J[Bronze Tables]
54+
I --> K[Silver Tables]
55+
I --> L[Gold Tables]
56+
end
57+
58+
subgraph "CI/CD Pipeline"
59+
M[GitHub Actions] --> B
60+
M --> D
61+
M --> N[Multi-Environment Deployment]
62+
end
63+
```
64+
65+
## 📁 Repository Structure
66+
67+
```
68+
databricks-lakeflow-jobs-example/
69+
├── infrastructure/ # StackQL infrastructure templates
70+
│ ├── README.md # Infrastructure setup guide
71+
│ ├── stackql_manifest.yml # StackQL deployment configuration
72+
│ └── resources/ # Cloud resource templates
73+
│ ├── aws/ # AWS resources (IAM, S3)
74+
│ ├── databricks_account/ # Account-level Databricks resources
75+
│ └── databricks_workspace/ # Workspace configurations
76+
├── retail-job/ # Databricks Asset Bundle
77+
│ ├── databricks.yml # DAB configuration
78+
│ └── Task Files/ # Data pipeline notebooks
79+
│ ├── 01_data_ingestion/ # Bronze layer data ingestion
80+
│ ├── 02_data_loading/ # Customer data loading
81+
│ ├── 03_data_processing/ # Silver layer transformations
82+
│ ├── 04_data_transformation/ # Gold layer clean data
83+
│ └── 05_state_processing/ # State-specific processing
84+
└── .github/workflows/ # CI/CD automation
85+
└── databricks-dab.yml # GitHub Actions workflow
86+
```
87+
88+
## 🚀 Quick Start
89+
90+
### Prerequisites
91+
92+
- AWS account with administrative permissions
93+
- Databricks account (see [infrastructure setup guide](./infrastructure/README.md))
94+
- Python 3.8+ and Git
95+
96+
### 1. Clone Repository
97+
98+
```bash
99+
git clone https://github.com/stackql/databricks-lakeflow-jobs-example.git
100+
cd databricks-lakeflow-jobs-example
101+
```
102+
103+
### 2. Set Up Infrastructure
104+
105+
Follow the comprehensive [Infrastructure Setup Guide](./infrastructure/README.md) to:
106+
- Configure AWS and Databricks accounts
107+
- Set up service principals and permissions
108+
- Deploy infrastructure using StackQL-Deploy
109+
110+
### 3. Deploy Data Pipeline
111+
112+
Once infrastructure is provisioned:
113+
114+
```bash
115+
cd retail-job
116+
117+
# Validate the bundle
118+
databricks bundle validate --target dev
119+
120+
# Deploy the data pipeline
121+
databricks bundle deploy --target dev
122+
123+
# Run the complete pipeline
124+
databricks bundle run retail_data_processing_job --target dev
125+
```
126+
127+
## 📊 Data Pipeline Deep Dive
128+
129+
The retail data pipeline demonstrates a complete **medallion architecture** (Bronze → Silver → Gold):
130+
131+
### Pipeline Stages
132+
133+
1. **🥉 Bronze Layer - Data Ingestion**
134+
- **Orders Ingestion**: Loads raw sales orders data
135+
- **Sales Ingestion**: Loads raw sales transaction data
136+
- Tables: `orders_bronze`, `sales_bronze`
137+
138+
2. **🥈 Silver Layer - Data Processing**
139+
- **Customer Loading**: Loads customer master data
140+
- **Data Joining**: Joins customers with sales and orders
141+
- **Duplicate Removal**: Conditional deduplication based on data quality
142+
- Tables: `customers_bronze`, `customer_sales_silver`, `customer_orders_silver`
143+
144+
3. **🥇 Gold Layer - Data Transformation**
145+
- **Clean & Transform**: Business-ready, curated datasets
146+
- **State Processing**: Parallel processing for each US state using for-each loops
147+
- Tables: `retail_gold`, `state_summary_gold`
148+
149+
### Advanced DAB Features Demonstrated
150+
151+
- **🔄 Parallel Execution**: Multiple tasks run concurrently where dependencies allow
152+
- **🎯 Conditional Tasks**: Deduplication only runs if duplicates are detected
153+
- **🔁 For-Each Loops**: State processing runs in parallel for multiple states
154+
- **📧 Notifications**: Email alerts on job success/failure
155+
- **⏱️ Timeouts & Limits**: Job execution controls and concurrent run limits
156+
- **🎛️ Parameters**: Dynamic state-based processing with base parameters
157+
158+
## 🔄 CI/CD Pipeline
159+
160+
The GitHub Actions workflow ([`.github/workflows/databricks-dab.yml`](./.github/workflows/databricks-dab.yml)) provides complete automation:
161+
162+
### Workflow Triggers
163+
164+
- **Pull Requests**: Validates changes against dev environment
165+
- **Main Branch Push**: Deploys to production environment
166+
- **Path-Based**: Only triggers on infrastructure or job configuration changes
167+
168+
### Deployment Steps
169+
170+
1. **🏗️ Infrastructure Provisioning**
171+
```yaml
172+
- name: Deploy Infrastructure with StackQL
173+
uses: stackql/stackql-deploy-action@v1.0.2
174+
with:
175+
command: 'build'
176+
stack_dir: 'infrastructure'
177+
stack_env: ${{ env.ENVIRONMENT }}
178+
```
179+
180+
2. **📊 Workspace Configuration**
181+
- Extracts workspace details from StackQL deployment
182+
- Configures Databricks CLI with workspace credentials
183+
- Sets up environment-specific configurations
184+
185+
3. **✅ DAB Validation & Deployment**
186+
```yaml
187+
- name: Validate Databricks Asset Bundle
188+
run: databricks bundle validate --target ${{ env.ENVIRONMENT }}
189+
190+
- name: Deploy Databricks Jobs
191+
run: databricks bundle deploy --target ${{ env.ENVIRONMENT }}
192+
```
193+
194+
4. **🧪 Pipeline Testing**
195+
- Runs the complete data pipeline
196+
- Validates job execution and data quality
197+
- Reports results and generates summaries
198+
199+
### Environment Management
200+
201+
The workflow supports multiple environments with automatic detection:
202+
- **Dev Environment**: For pull requests and feature development
203+
- **Production Environment**: For main branch deployments
204+
205+
Environment-specific configurations are managed through:
206+
- StackQL environment variables and stack environments
207+
- Databricks Asset Bundle targets (`dev`, `prd`)
208+
- GitHub repository secrets for credentials
209+
210+
## 🛠️ Key Technologies
211+
212+
### StackQL & stackql-deploy
213+
- **SQL-based Infrastructure**: Manage cloud resources using familiar SQL syntax
214+
- **State-free Operations**: No state files - query infrastructure directly from APIs
215+
- **Multi-cloud Support**: Consistent interface across AWS, Azure, GCP, and SaaS providers
216+
- **GitOps Ready**: Native CI/CD integration with GitHub Actions
217+
218+
### Databricks Asset Bundles
219+
- **Environment Consistency**: Deploy the same code across dev/staging/prod
220+
- **Version Control**: Infrastructure and code in sync with Git workflows
221+
- **Advanced Orchestration**: Complex dependencies, conditions, and parallel execution
222+
- **Resource Management**: Automated cluster provisioning and job scheduling
223+
224+
### Modern DataOps Practices
225+
- **Infrastructure as Code**: Everything versioned and reproducible
226+
- **GitOps Workflows**: Pull request-based infrastructure changes
227+
- **Environment Parity**: Identical configurations across environments
228+
- **Automated Testing**: Pipeline validation and data quality checks
229+
230+
## 📚 Learn More
231+
232+
- **[Infrastructure Setup Guide](./infrastructure/README.md)**: Complete StackQL-Deploy setup and usage
233+
- **[StackQL Documentation](https://stackql.io/docs)**: Learn SQL-based infrastructure management
234+
- **[Databricks Asset Bundles](https://docs.databricks.com/en/dev-tools/bundles/)**: DAB concepts and advanced patterns
235+
- **[stackql-deploy GitHub Action](https://github.com/stackql/stackql-deploy-action)**: CI/CD integration guide
236+
237+
## 🤝 Contributing
238+
239+
1. Fork the repository
240+
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
241+
3. Commit your changes (`git commit -m 'Add amazing feature'`)
242+
4. Push to the branch (`git push origin feature/amazing-feature`)
243+
5. Open a Pull Request
244+
245+
## 📄 License
246+
247+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
248+
249+
## ⚠️ Important Notes
250+
251+
- **Cost Management**: This project provisions billable cloud resources. Always run teardown commands after testing.
252+
- **Cleanup Required**: Cancel Databricks subscription after completing the exercise to avoid ongoing charges.
253+
- **Security**: Never commit credentials to version control. Use environment variables and CI/CD secrets.
254+
255+
---
256+
257+
*Demonstrating the future of DataOps with SQL-based infrastructure management and modern data pipeline orchestration.*

0 commit comments

Comments
 (0)