In this workshop, we will explore Docker fundamentals and data engineering workflows using Docker containers. This workshop is part of Module 1 of the Data Engineering Zoomcamp.
Data Engineering is the design and development of systems for collecting, storing and analyzing data at scale.
- Basic understanding of Python
- Basic SQL knowledge (helpful but not required)
- Docker and Python installed on your machine
- Git (optional)
- Introduction to Docker - What is Docker, why use it, basic commands
- Virtual Environments and Data Pipelines - Setting up Python environments with uv
- Dockerizing the Pipeline - Creating a Dockerfile for a simple pipeline
- Running PostgreSQL with Docker - Dockerizing PostgreSQL database
- NY Taxi Dataset and Data Ingestion - Working with real data, pandas, SQLAlchemy
- Creating the Data Ingestion Script - Converting notebook to Python script
- pgAdmin - Database Management Tool - Web-based database management
- Dockerizing the Ingestion Script - Containerizing the pipeline
- Docker Compose - Multi-container orchestration
- SQL Refresher - SQL joins, aggregations, and queries
- Cleanup - Cleaning up Docker resources