Skip to content

Latest commit

 

History

History
495 lines (374 loc) · 11.9 KB

File metadata and controls

495 lines (374 loc) · 11.9 KB

Getting Started - K2 Reference Data Platform

Time to first run: 30 minutes

Goal: Get the entire platform running locally and understand the data flow.


What You're Building

A GoldenSource for Crypto that:

  1. Polls exchange APIs (Binance, Kraken) hourly for instrument specs
  2. Stores raw responses in Bronze (Iceberg tables)
  3. Transforms to normalized history in Silver (SCD Type 2)
  4. Creates cross-exchange mappings in Gold (symbology)
  5. Serves data via REST API with point-in-time queries

Why it matters: Trading systems need accurate instrument specifications (tick size, lot size) with full historical audit trails.


Prerequisites

Install these first (5 minutes):

# Python 3.11+
python3 --version  # Should be 3.11 or higher

# uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Docker & Docker Compose
docker --version
docker compose version

# DuckDB CLI (for debugging)
brew install duckdb  # macOS
# or
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip

Quick Setup (25 minutes)

Step 1: Clone and Install (5 min)

# Clone repository
git clone https://github.com/k2/k2-reference-data-platform.git
cd k2-reference-data-platform

# Install dependencies
make install-dev

# Verify installation
make help  # Should show all available commands

What this does: Installs Python packages, sets up pre-commit hooks, configures development environment.

Step 2: Start Infrastructure (5 min)

# Start Docker services
make docker-up

# Wait for services to be ready (~30 seconds)
# You should see:
# ✔ Container kafka        Started
# ✔ Container postgres     Started
# ✔ Container minio        Started
# ✔ Container iceberg-rest Started

What this does: Starts Kafka, PostgreSQL, MinIO (S3), and Iceberg REST catalog.

Verify services:

# Check all containers running
docker ps | grep -E "kafka|postgres|minio|iceberg"

# Test MinIO (S3)
curl http://localhost:9000/minio/health/live
# Expected: 200 OK

# Test Iceberg catalog
curl http://localhost:8181/v1/config
# Expected: JSON config

Step 3: Initialize Data Warehouse (5 min)

# Create Iceberg tables and register Avro schemas
make init-infra

# Expected output:
# ✓ Created Iceberg namespace: refdata
# ✓ Created table: bronze_instruments_binance
# ✓ Created table: bronze_instruments_kraken
# ✓ Registered schema: refdata-binance-instrument-raw-value
# ✓ Registered schema: refdata-kraken-instrument-raw-value

What this does: Creates empty Iceberg tables in MinIO and registers Avro schemas to Schema Registry.

Step 4: Run First Ingestion (3 min)

# Ingest data from Binance
make ingest-binance

# Expected output:
# INFO: Ingesting instruments exchange=binance
# INFO: Fetched 500 instruments
# INFO: Published to Kafka topic=refdata.instruments.binance.raw
# INFO: Updated state store

Verify data landed:

# Check Kafka topic
kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic refdata.instruments.binance.raw \
  --from-beginning \
  --max-messages 1

# Should see Avro-serialized message

Step 5: Run DBT Transformations (3 min)

# Transform Bronze → Silver → Gold
make dbt-run

# Expected output:
# Running with dbt=1.5.0
# Completed successfully
#
# Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

Verify transformations:

# Query Silver table
duckdb -c "
SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
"
# Expected: ~500 rows

# Query Gold symbology
duckdb -c "
SELECT * FROM iceberg_scan('s3://refdata-warehouse/gold/symbology') LIMIT 5
"
# Expected: Canonical IDs like BTC-USD-SPOT

Step 6: Start API (2 min)

# Start FastAPI server
make api-dev

# Expected output:
# INFO: Starting K2 Reference Data API version=1.0.0
# INFO: Connection pool initialized pool_size=5
# INFO: Uvicorn running on http://0.0.0.0:8001

Test API:

# Health check
curl http://localhost:8001/health
# Expected: {"status": "healthy", "version": "1.0.0"}

# Query instruments
curl "http://localhost:8001/v1/instruments?exchange=binance&limit=5"
# Expected: JSON with 5 instruments

# Interactive docs
open http://localhost:8001/docs

Step 7: Run Tests (2 min)

# Run unit tests (fast, no Docker needed)
make test-unit

# Run integration tests (uses Docker services)
make test-integration

# Expected: All tests passing
# ======================== 71 passed in 5.23s ========================

Understanding the Data Flow

Visual Flow

┌─────────────┐
│  Binance    │  GET /api/v3/exchangeInfo
│  Kraken     │  GET /0/public/AssetPairs
└──────┬──────┘
       │ Python ingestion (hourly cron)
       ▼
┌─────────────┐
│   Kafka     │  refdata.instruments.{exchange}.raw
└──────┬──────┘
       │ Kafka → Iceberg consumer
       ▼
┌─────────────┐
│  Bronze     │  Raw JSON in Iceberg tables
│  (Iceberg)  │  Partitioned by day, 7-day retention
└──────┬──────┘
       │ DBT transformation (hourly)
       ▼
┌─────────────┐
│  Silver     │  Normalized, historized (SCD Type 2)
│  (Iceberg)  │  Bitemporal: valid_from/to + record_created_at
└──────┬──────┘
       │ DBT symbology mapping
       ▼
┌─────────────┐
│   Gold      │  Canonical IDs, cross-exchange mapping
│  (Iceberg)  │  BTC-USD-SPOT → {binance: BTCUSDT, kraken: XBT/USD}
└──────┬──────┘
       │ DuckDB queries
       ▼
┌─────────────┐
│  FastAPI    │  REST API with bitemporal queries
│  (Port 8001)│  GET /v1/instruments?as_of=...
└─────────────┘

Example: Single Instrument Journey

1. Exchange API Response (Bronze):

{
  "symbol": "BTCUSDT",
  "baseAsset": "BTC",
  "quoteAsset": "USDT",
  "filters": [
    {"filterType": "PRICE_FILTER", "tickSize": "0.01"}
  ]
}

2. DBT Transformation (Silver):

-- Parsed to relational schema
exchange: 'binance'
symbol: 'BTCUSDT'
tick_size: 0.01
valid_from: '2024-01-23 10:00:00'
valid_to: NULL  -- Current record
record_created_at: '2024-01-23 10:05:00'

3. Symbology Mapping (Gold):

canonical_id: 'BTC-USD-SPOT'
base_asset: 'BTC'  -- Normalized
quote_asset: 'USD'  -- USDT → USD
binance_symbol: 'BTCUSDT'
kraken_symbol: 'XBT/USD'  -- Different symbol, same instrument!

4. API Query:

# Current state
curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT"

# Historical state (point-in-time)
curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT&as_of=2024-01-15T10:00:00Z"

# Symbology lookup
curl "http://localhost:8001/v1/symbology/BTC-USD-SPOT"

Core Concepts (Read This!)

Bitemporal Modeling

Problem: Exchange announces "tick size changes Jan 15" on Jan 10. We ingest Jan 11. On Jan 16, they correct: "Actually Jan 14".

Solution: Track TWO timestamps:

  1. Business Time (valid_from, valid_to): When spec was effective in reality
  2. System Time (record_created_at): When we learned about it

Query: "What was tick_size on Jan 14 10pm?"

  • Looks at valid_from <= Jan 14 10pm < valid_to
  • If multiple corrections, uses latest record_created_at

Read: ADR-001: Bitemporal Modeling

SCD Type 2 (Slowly Changing Dimensions)

Problem: Instrument specs change over time. How to track history?

Solution: Don't UPDATE records, INSERT new versions:

-- Old record (closed)
valid_from: 2024-01-10
valid_to: 2024-01-15 23:59:59  -- Closed

-- New record (current)
valid_from: 2024-01-15
valid_to: NULL  -- Current

Result: Full audit trail of all changes.

Symbology Normalization

Problem: Same instrument, different symbols:

  • Binance: BTCUSDT
  • Kraken: XBT/USD (XBT instead of BTC!)
  • Coinbase: BTC-USD

Solution: Canonical IDs: BTC-USD-SPOT

Read: ADR-004: Symbology Mapping


Common Commands

# Development
make install-dev          # Install dependencies
make docker-up            # Start services
make docker-down          # Stop services
make docker-clean         # Stop + remove volumes

# Ingestion
make ingest-binance       # Ingest Binance data
make ingest-kraken        # Ingest Kraken data
make ingest-now           # Ingest all exchanges

# DBT
make dbt-run              # Run transformations
make dbt-test             # Run data quality tests
make dbt-docs             # Generate and view docs
make dbt-clean            # Clean build artifacts

# API
make api-dev              # Start API (auto-reload)
make api                  # Start API (production mode)

# Testing
make test-unit            # Unit tests (fast)
make test-integration     # Integration tests (Docker)
make test-all             # All tests
make coverage             # Test coverage report

# Code Quality
make lint                 # Lint code
make format               # Format code
make type-check           # Type checking
make quality              # All quality checks

Troubleshooting

Docker services won't start

# Check if ports already in use
lsof -i :9092  # Kafka
lsof -i :5432  # PostgreSQL
lsof -i :9000  # MinIO

# Kill conflicting processes or change ports in docker-compose.refdata.yml

Ingestion fails with Kafka connection error

# Check Kafka is running
docker ps | grep kafka

# Check Kafka logs
docker logs kafka

# Test Kafka connection
kafka-topics --list --bootstrap-server localhost:9092

DBT can't find Iceberg tables

# Re-run initialization
make init-infra

# Verify tables exist
curl http://localhost:8181/v1/namespaces/refdata/tables

API returns empty results

# Check if data exists in Silver
duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')"

# If 0, run ingestion + DBT
make ingest-now
make dbt-run

Tests failing

# Clean and reinstall
make clean
make install-dev

# Run tests with verbose output
pytest -v -s tests/unit/

Next Steps

Now that you're running, read these in order:

  1. Developer Onboarding - Day-by-day learning plan (Week 1)
  2. Common Workflows - How to add exchanges, fix bugs, etc.
  3. DBT Guide - Deep dive on transformations
  4. API Guide - Using the REST API
  5. Architecture Decision Records - Why we made key choices

Join the team:

  • Slack: #k2-refdata-platform
  • Standup: Daily 10am
  • Office hours: Tuesday/Thursday 2-3pm

Quick Reference

Project Structure:

src/refdata/
├── ingestion/      # Exchange clients, Kafka producers
├── api/            # FastAPI endpoints
├── common/         # Config, logging, DB utils
└── cli/            # CLI commands

dbt/
├── models/         # SQL transformations
│   ├── bronze/     # Source definitions
│   ├── silver/     # SCD Type 2 models
│   └── gold/       # Symbology master
└── macros/         # Custom SQL functions

tests/
├── unit/           # Fast, isolated tests
├── integration/    # Docker-based tests
└── e2e/            # Full pipeline tests

Key Files:

  • CLAUDE.md - Development standards
  • README.md - Project overview
  • Makefile - All commands
  • pyproject.toml - Dependencies

Help:

make help           # Show all commands
dbt --help          # DBT help
pytest --help       # Testing help

Welcome to the team! 🚀

Questions? Ask in #k2-refdata-platform or ping @data-engineering-lead