Time to first run: 30 minutes
Goal: Get the entire platform running locally and understand the data flow.
A GoldenSource for Crypto that:
- Polls exchange APIs (Binance, Kraken) hourly for instrument specs
- Stores raw responses in Bronze (Iceberg tables)
- Transforms to normalized history in Silver (SCD Type 2)
- Creates cross-exchange mappings in Gold (symbology)
- Serves data via REST API with point-in-time queries
Why it matters: Trading systems need accurate instrument specifications (tick size, lot size) with full historical audit trails.
Install these first (5 minutes):
# Python 3.11+
python3 --version # Should be 3.11 or higher
# uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Docker & Docker Compose
docker --version
docker compose version
# DuckDB CLI (for debugging)
brew install duckdb # macOS
# or
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip# Clone repository
git clone https://github.com/k2/k2-reference-data-platform.git
cd k2-reference-data-platform
# Install dependencies
make install-dev
# Verify installation
make help # Should show all available commandsWhat this does: Installs Python packages, sets up pre-commit hooks, configures development environment.
# Start Docker services
make docker-up
# Wait for services to be ready (~30 seconds)
# You should see:
# ✔ Container kafka Started
# ✔ Container postgres Started
# ✔ Container minio Started
# ✔ Container iceberg-rest StartedWhat this does: Starts Kafka, PostgreSQL, MinIO (S3), and Iceberg REST catalog.
Verify services:
# Check all containers running
docker ps | grep -E "kafka|postgres|minio|iceberg"
# Test MinIO (S3)
curl http://localhost:9000/minio/health/live
# Expected: 200 OK
# Test Iceberg catalog
curl http://localhost:8181/v1/config
# Expected: JSON config# Create Iceberg tables and register Avro schemas
make init-infra
# Expected output:
# ✓ Created Iceberg namespace: refdata
# ✓ Created table: bronze_instruments_binance
# ✓ Created table: bronze_instruments_kraken
# ✓ Registered schema: refdata-binance-instrument-raw-value
# ✓ Registered schema: refdata-kraken-instrument-raw-valueWhat this does: Creates empty Iceberg tables in MinIO and registers Avro schemas to Schema Registry.
# Ingest data from Binance
make ingest-binance
# Expected output:
# INFO: Ingesting instruments exchange=binance
# INFO: Fetched 500 instruments
# INFO: Published to Kafka topic=refdata.instruments.binance.raw
# INFO: Updated state storeVerify data landed:
# Check Kafka topic
kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic refdata.instruments.binance.raw \
--from-beginning \
--max-messages 1
# Should see Avro-serialized message# Transform Bronze → Silver → Gold
make dbt-run
# Expected output:
# Running with dbt=1.5.0
# Completed successfully
#
# Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3Verify transformations:
# Query Silver table
duckdb -c "
SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
"
# Expected: ~500 rows
# Query Gold symbology
duckdb -c "
SELECT * FROM iceberg_scan('s3://refdata-warehouse/gold/symbology') LIMIT 5
"
# Expected: Canonical IDs like BTC-USD-SPOT# Start FastAPI server
make api-dev
# Expected output:
# INFO: Starting K2 Reference Data API version=1.0.0
# INFO: Connection pool initialized pool_size=5
# INFO: Uvicorn running on http://0.0.0.0:8001Test API:
# Health check
curl http://localhost:8001/health
# Expected: {"status": "healthy", "version": "1.0.0"}
# Query instruments
curl "http://localhost:8001/v1/instruments?exchange=binance&limit=5"
# Expected: JSON with 5 instruments
# Interactive docs
open http://localhost:8001/docs# Run unit tests (fast, no Docker needed)
make test-unit
# Run integration tests (uses Docker services)
make test-integration
# Expected: All tests passing
# ======================== 71 passed in 5.23s ========================┌─────────────┐
│ Binance │ GET /api/v3/exchangeInfo
│ Kraken │ GET /0/public/AssetPairs
└──────┬──────┘
│ Python ingestion (hourly cron)
▼
┌─────────────┐
│ Kafka │ refdata.instruments.{exchange}.raw
└──────┬──────┘
│ Kafka → Iceberg consumer
▼
┌─────────────┐
│ Bronze │ Raw JSON in Iceberg tables
│ (Iceberg) │ Partitioned by day, 7-day retention
└──────┬──────┘
│ DBT transformation (hourly)
▼
┌─────────────┐
│ Silver │ Normalized, historized (SCD Type 2)
│ (Iceberg) │ Bitemporal: valid_from/to + record_created_at
└──────┬──────┘
│ DBT symbology mapping
▼
┌─────────────┐
│ Gold │ Canonical IDs, cross-exchange mapping
│ (Iceberg) │ BTC-USD-SPOT → {binance: BTCUSDT, kraken: XBT/USD}
└──────┬──────┘
│ DuckDB queries
▼
┌─────────────┐
│ FastAPI │ REST API with bitemporal queries
│ (Port 8001)│ GET /v1/instruments?as_of=...
└─────────────┘
1. Exchange API Response (Bronze):
{
"symbol": "BTCUSDT",
"baseAsset": "BTC",
"quoteAsset": "USDT",
"filters": [
{"filterType": "PRICE_FILTER", "tickSize": "0.01"}
]
}2. DBT Transformation (Silver):
-- Parsed to relational schema
exchange: 'binance'
symbol: 'BTCUSDT'
tick_size: 0.01
valid_from: '2024-01-23 10:00:00'
valid_to: NULL -- Current record
record_created_at: '2024-01-23 10:05:00'3. Symbology Mapping (Gold):
canonical_id: 'BTC-USD-SPOT'
base_asset: 'BTC' -- Normalized
quote_asset: 'USD' -- USDT → USD
binance_symbol: 'BTCUSDT'
kraken_symbol: 'XBT/USD' -- Different symbol, same instrument!4. API Query:
# Current state
curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT"
# Historical state (point-in-time)
curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT&as_of=2024-01-15T10:00:00Z"
# Symbology lookup
curl "http://localhost:8001/v1/symbology/BTC-USD-SPOT"Problem: Exchange announces "tick size changes Jan 15" on Jan 10. We ingest Jan 11. On Jan 16, they correct: "Actually Jan 14".
Solution: Track TWO timestamps:
- Business Time (
valid_from,valid_to): When spec was effective in reality - System Time (
record_created_at): When we learned about it
Query: "What was tick_size on Jan 14 10pm?"
- Looks at
valid_from<= Jan 14 10pm <valid_to - If multiple corrections, uses latest
record_created_at
Read: ADR-001: Bitemporal Modeling
Problem: Instrument specs change over time. How to track history?
Solution: Don't UPDATE records, INSERT new versions:
-- Old record (closed)
valid_from: 2024-01-10
valid_to: 2024-01-15 23:59:59 -- Closed
-- New record (current)
valid_from: 2024-01-15
valid_to: NULL -- CurrentResult: Full audit trail of all changes.
Problem: Same instrument, different symbols:
- Binance:
BTCUSDT - Kraken:
XBT/USD(XBT instead of BTC!) - Coinbase:
BTC-USD
Solution: Canonical IDs: BTC-USD-SPOT
Read: ADR-004: Symbology Mapping
# Development
make install-dev # Install dependencies
make docker-up # Start services
make docker-down # Stop services
make docker-clean # Stop + remove volumes
# Ingestion
make ingest-binance # Ingest Binance data
make ingest-kraken # Ingest Kraken data
make ingest-now # Ingest all exchanges
# DBT
make dbt-run # Run transformations
make dbt-test # Run data quality tests
make dbt-docs # Generate and view docs
make dbt-clean # Clean build artifacts
# API
make api-dev # Start API (auto-reload)
make api # Start API (production mode)
# Testing
make test-unit # Unit tests (fast)
make test-integration # Integration tests (Docker)
make test-all # All tests
make coverage # Test coverage report
# Code Quality
make lint # Lint code
make format # Format code
make type-check # Type checking
make quality # All quality checks# Check if ports already in use
lsof -i :9092 # Kafka
lsof -i :5432 # PostgreSQL
lsof -i :9000 # MinIO
# Kill conflicting processes or change ports in docker-compose.refdata.yml# Check Kafka is running
docker ps | grep kafka
# Check Kafka logs
docker logs kafka
# Test Kafka connection
kafka-topics --list --bootstrap-server localhost:9092# Re-run initialization
make init-infra
# Verify tables exist
curl http://localhost:8181/v1/namespaces/refdata/tables# Check if data exists in Silver
duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')"
# If 0, run ingestion + DBT
make ingest-now
make dbt-run# Clean and reinstall
make clean
make install-dev
# Run tests with verbose output
pytest -v -s tests/unit/Now that you're running, read these in order:
- Developer Onboarding - Day-by-day learning plan (Week 1)
- Common Workflows - How to add exchanges, fix bugs, etc.
- DBT Guide - Deep dive on transformations
- API Guide - Using the REST API
- Architecture Decision Records - Why we made key choices
Join the team:
- Slack: #k2-refdata-platform
- Standup: Daily 10am
- Office hours: Tuesday/Thursday 2-3pm
Project Structure:
src/refdata/
├── ingestion/ # Exchange clients, Kafka producers
├── api/ # FastAPI endpoints
├── common/ # Config, logging, DB utils
└── cli/ # CLI commands
dbt/
├── models/ # SQL transformations
│ ├── bronze/ # Source definitions
│ ├── silver/ # SCD Type 2 models
│ └── gold/ # Symbology master
└── macros/ # Custom SQL functions
tests/
├── unit/ # Fast, isolated tests
├── integration/ # Docker-based tests
└── e2e/ # Full pipeline tests
Key Files:
CLAUDE.md- Development standardsREADME.md- Project overviewMakefile- All commandspyproject.toml- Dependencies
Help:
make help # Show all commands
dbt --help # DBT help
pytest --help # Testing helpWelcome to the team! 🚀
Questions? Ask in #k2-refdata-platform or ping @data-engineering-lead