An end-to-end local data engineering project that simulates website clickstream traffic, publishes events to Apache Kafka, validates and processes them in near real-time, routes invalid events to a dead-letter queue (DLQ), and writes rolling summary metrics to CSV.
This project is designed for learning and portfolio use, and demonstrates core streaming concepts:
- Event generation and producer patterns
- Kafka topic setup and partitioning
- Consumer-side validation and DLQ handling
- Manual offset commit after processing
- Periodic aggregation and sink output
Data flow:
click_generator.py -> producer.py -> Kafka topic website_clicks -> consumer.py ->
- valid events -> rolling metrics ->
data/processed_click_summary.csv - invalid events -> Kafka topic
website_clicks_dlq
- Python
- Apache Kafka (running in local KRaft mode via Docker Compose)
confluent-kafkaPython clientpython-dotenvfor environment-based configuration
.
|-- docker-compose.yml
|-- requirements.txt
|-- README.md
|-- data/
| `-- processed_click_summary.csv
`-- src/
|-- __init__.py
|-- click_generator.py
|-- config.py
|-- consumer.py
|-- producer.py
|-- topic_setup.py
`-- utils.py
- Python 3.10+
- Docker Desktop (or Docker Engine + Compose)
pip install -r requirements.txtWhy: installs Kafka client and environment loading packages needed by the Python applications.
docker compose up -dWhy: starts a single-node Kafka broker (KRaft mode) defined in docker-compose.yml.
python -m src.topic_setupCreates topics (if missing):
website_clickswebsite_clicks_dlq
python -m src.consumerConsumer responsibilities:
- Reads from
website_clicks - Validates each event
- Routes invalid events to
website_clicks_dlq - Maintains running counters
- Prints a snapshot every 10 seconds
- Writes snapshots to
data/processed_click_summary.csv - Commits offsets only after processing
python -m src.producerProducer responsibilities:
- Generates click events continuously
- Injects a small percentage of intentionally invalid events
- Serializes event payloads to JSON
- Uses
user_idas message key - Publishes to
website_clicks
Configuration is centralized in src/config.py and loaded from environment variables.
Supported variables and defaults:
| Variable | Default | Description |
|---|---|---|
BOOTSTRAP_SERVERS |
localhost:9092 |
Kafka broker address |
CLICK_TOPIC |
website_clicks |
Main topic for click events |
DLQ_TOPIC |
website_clicks_dlq |
Topic for invalid events |
CONSUMER_GROUP |
clickstream-consumer-group |
Consumer group id |
PARTITIONS |
3 |
Partition count when creating topics |
REPLICATION_FACTOR |
1 |
Replication factor when creating topics |
PRODUCER_DELAY_SECONDS |
0.4 |
Delay between produced events |
You can set these in your shell or in a local .env file at project root.
Generated click events include:
event_idevent_typeuser_idsession_idpage_urlreferrerdevice_typecountrytimestamp
Validation rules in src/utils.py:
- Required fields must exist
event_typemust beclickdevice_typemust be one of:mobile,desktop,tabletpage_urlmust start with/
Invalid events are sent to the DLQ with:
- validation failure reason
- original event payload
- failure timestamp
The consumer writes summary snapshots to data/processed_click_summary.csv.
CSV columns:
snapshot_timetotal_clicksunique_userstop_pagetop_page_clicks
- Consumer uses
auto.offset.reset=earliest - Auto-commit is disabled
- Offsets are committed synchronously after each processed message
These choices make processing behavior explicit and easier to reason about for learning and debugging.
Stop Python apps with Ctrl+C in each terminal.
Stop Kafka:
docker compose downOptionally remove broker volumes/state (if you add volumes in future):
docker compose down -v- Confirm Docker is running.
- Check broker container status:
docker ps - Verify port
9092is not blocked by another process.
- Ensure Kafka is healthy before running topic setup.
- Re-run:
python -m src.topic_setup
- Confirm producer is running.
- Confirm both apps point to same
BOOTSTRAP_SERVERSand topic names.
- The consumer writes snapshots every 10 seconds.
- Ensure valid events are being produced (not all routed to DLQ).
Possible next upgrades:
- Add schema enforcement (Avro/JSON Schema/Protobuf)
- Add unit/integration tests for generator, validation, and consumer flow
- Expose metrics with Prometheus
- Persist output to a database or object store instead of CSV
- Add dashboards for clickstream analytics