Getting Started - K2 Reference Data Platform

Time to first run: 30 minutes

Goal: Get the entire platform running locally and understand the data flow.

What You're Building

A GoldenSource for Crypto that:

Polls exchange APIs (Binance, Kraken) hourly for instrument specs
Stores raw responses in Bronze (Iceberg tables)
Transforms to normalized history in Silver (SCD Type 2)
Creates cross-exchange mappings in Gold (symbology)
Serves data via REST API with point-in-time queries

Why it matters: Trading systems need accurate instrument specifications (tick size, lot size) with full historical audit trails.

Prerequisites

Install these first (5 minutes):

# Python 3.11+
python3 --version  # Should be 3.11 or higher

# uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Docker & Docker Compose
docker --version
docker compose version

# DuckDB CLI (for debugging)
brew install duckdb  # macOS
# or
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip

Quick Setup (25 minutes)

Step 1: Clone and Install (5 min)

# Clone repository
git clone https://github.com/k2/k2-reference-data-platform.git
cd k2-reference-data-platform

# Install dependencies
make install-dev

# Verify installation
make help  # Should show all available commands

What this does: Installs Python packages, sets up pre-commit hooks, configures development environment.

Step 2: Start Infrastructure (5 min)

# Start Docker services
make docker-up

# Wait for services to be ready (~30 seconds)
# You should see:
# ✔ Container kafka        Started
# ✔ Container postgres     Started
# ✔ Container minio        Started
# ✔ Container iceberg-rest Started

What this does: Starts Kafka, PostgreSQL, MinIO (S3), and Iceberg REST catalog.

Verify services:

# Check all containers running
docker ps | grep -E "kafka|postgres|minio|iceberg"

# Test MinIO (S3)
curl http://localhost:9000/minio/health/live
# Expected: 200 OK

# Test Iceberg catalog
curl http://localhost:8181/v1/config
# Expected: JSON config

Step 3: Initialize Data Warehouse (5 min)

# Create Iceberg tables and register Avro schemas
make init-infra

# Expected output:
# ✓ Created Iceberg namespace: refdata
# ✓ Created table: bronze_instruments_binance
# ✓ Created table: bronze_instruments_kraken
# ✓ Registered schema: refdata-binance-instrument-raw-value
# ✓ Registered schema: refdata-kraken-instrument-raw-value

What this does: Creates empty Iceberg tables in MinIO and registers Avro schemas to Schema Registry.

Step 4: Run First Ingestion (3 min)

# Ingest data from Binance
make ingest-binance

# Expected output:
# INFO: Ingesting instruments exchange=binance
# INFO: Fetched 500 instruments
# INFO: Published to Kafka topic=refdata.instruments.binance.raw
# INFO: Updated state store

Verify data landed:

# Check Kafka topic
kafka-console-consumer \
  --bootstrap-server localhost:9092 \
  --topic refdata.instruments.binance.raw \
  --from-beginning \
  --max-messages 1

# Should see Avro-serialized message

Step 5: Run DBT Transformations (3 min)

# Transform Bronze → Silver → Gold
make dbt-run

# Expected output:
# Running with dbt=1.5.0
# Completed successfully
#
# Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

Verify transformations:

# Query Silver table
duckdb -c "
SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
"
# Expected: ~500 rows

# Query Gold symbology
duckdb -c "
SELECT * FROM iceberg_scan('s3://refdata-warehouse/gold/symbology') LIMIT 5
"
# Expected: Canonical IDs like BTC-USD-SPOT

Step 6: Start API (2 min)

# Start FastAPI server
make api-dev

# Expected output:
# INFO: Starting K2 Reference Data API version=1.0.0
# INFO: Connection pool initialized pool_size=5
# INFO: Uvicorn running on http://0.0.0.0:8001

Test API:

# Health check
curl http://localhost:8001/health
# Expected: {"status": "healthy", "version": "1.0.0"}

# Query instruments
curl "http://localhost:8001/v1/instruments?exchange=binance&limit=5"
# Expected: JSON with 5 instruments

# Interactive docs
open http://localhost:8001/docs

Step 7: Run Tests (2 min)

# Run unit tests (fast, no Docker needed)
make test-unit

# Run integration tests (uses Docker services)
make test-integration

# Expected: All tests passing
# ======================== 71 passed in 5.23s ========================

Understanding the Data Flow

Visual Flow

┌─────────────┐
│  Binance    │  GET /api/v3/exchangeInfo
│  Kraken     │  GET /0/public/AssetPairs
└──────┬──────┘
       │ Python ingestion (hourly cron)
       ▼
┌─────────────┐
│   Kafka     │  refdata.instruments.{exchange}.raw
└──────┬──────┘
       │ Kafka → Iceberg consumer
       ▼
┌─────────────┐
│  Bronze     │  Raw JSON in Iceberg tables
│  (Iceberg)  │  Partitioned by day, 7-day retention
└──────┬──────┘
       │ DBT transformation (hourly)
       ▼
┌─────────────┐
│  Silver     │  Normalized, historized (SCD Type 2)
│  (Iceberg)  │  Bitemporal: valid_from/to + record_created_at
└──────┬──────┘
       │ DBT symbology mapping
       ▼
┌─────────────┐
│   Gold      │  Canonical IDs, cross-exchange mapping
│  (Iceberg)  │  BTC-USD-SPOT → {binance: BTCUSDT, kraken: XBT/USD}
└──────┬──────┘
       │ DuckDB queries
       ▼
┌─────────────┐
│  FastAPI    │  REST API with bitemporal queries
│  (Port 8001)│  GET /v1/instruments?as_of=...
└─────────────┘

Example: Single Instrument Journey

1. Exchange API Response (Bronze):

{
  "symbol": "BTCUSDT",
  "baseAsset": "BTC",
  "quoteAsset": "USDT",
  "filters": [
    {"filterType": "PRICE_FILTER", "tickSize": "0.01"}
  ]
}

2. DBT Transformation (Silver):

-- Parsed to relational schema
exchange: 'binance'
symbol: 'BTCUSDT'
tick_size: 0.01
valid_from: '2024-01-23 10:00:00'
valid_to: NULL  -- Current record
record_created_at: '2024-01-23 10:05:00'

3. Symbology Mapping (Gold):

canonical_id: 'BTC-USD-SPOT'
base_asset: 'BTC'  -- Normalized
quote_asset: 'USD'  -- USDT → USD
binance_symbol: 'BTCUSDT'
kraken_symbol: 'XBT/USD'  -- Different symbol, same instrument!

4. API Query:

# Current state
curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT"

# Historical state (point-in-time)
curl "http://localhost:8001/v1/instruments?exchange=binance&symbol=BTCUSDT&as_of=2024-01-15T10:00:00Z"

# Symbology lookup
curl "http://localhost:8001/v1/symbology/BTC-USD-SPOT"

Core Concepts (Read This!)

Bitemporal Modeling

Problem: Exchange announces "tick size changes Jan 15" on Jan 10. We ingest Jan 11. On Jan 16, they correct: "Actually Jan 14".

Solution: Track TWO timestamps:

Business Time (valid_from, valid_to): When spec was effective in reality
System Time (record_created_at): When we learned about it

Query: "What was tick_size on Jan 14 10pm?"

Looks at valid_from <= Jan 14 10pm < valid_to
If multiple corrections, uses latest record_created_at

Read: ADR-001: Bitemporal Modeling

SCD Type 2 (Slowly Changing Dimensions)

Problem: Instrument specs change over time. How to track history?

Solution: Don't UPDATE records, INSERT new versions:

-- Old record (closed)
valid_from: 2024-01-10
valid_to: 2024-01-15 23:59:59  -- Closed

-- New record (current)
valid_from: 2024-01-15
valid_to: NULL  -- Current

Result: Full audit trail of all changes.

Symbology Normalization

Problem: Same instrument, different symbols:

Binance: BTCUSDT
Kraken: XBT/USD (XBT instead of BTC!)
Coinbase: BTC-USD

Solution: Canonical IDs: BTC-USD-SPOT

Read: ADR-004: Symbology Mapping

Common Commands

# Development
make install-dev          # Install dependencies
make docker-up            # Start services
make docker-down          # Stop services
make docker-clean         # Stop + remove volumes

# Ingestion
make ingest-binance       # Ingest Binance data
make ingest-kraken        # Ingest Kraken data
make ingest-now           # Ingest all exchanges

# DBT
make dbt-run              # Run transformations
make dbt-test             # Run data quality tests
make dbt-docs             # Generate and view docs
make dbt-clean            # Clean build artifacts

# API
make api-dev              # Start API (auto-reload)
make api                  # Start API (production mode)

# Testing
make test-unit            # Unit tests (fast)
make test-integration     # Integration tests (Docker)
make test-all             # All tests
make coverage             # Test coverage report

# Code Quality
make lint                 # Lint code
make format               # Format code
make type-check           # Type checking
make quality              # All quality checks

Troubleshooting

Docker services won't start

# Check if ports already in use
lsof -i :9092  # Kafka
lsof -i :5432  # PostgreSQL
lsof -i :9000  # MinIO

# Kill conflicting processes or change ports in docker-compose.refdata.yml

Ingestion fails with Kafka connection error

# Check Kafka is running
docker ps | grep kafka

# Check Kafka logs
docker logs kafka

# Test Kafka connection
kafka-topics --list --bootstrap-server localhost:9092

DBT can't find Iceberg tables

# Re-run initialization
make init-infra

# Verify tables exist
curl http://localhost:8181/v1/namespaces/refdata/tables

API returns empty results

# Check if data exists in Silver
duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')"

# If 0, run ingestion + DBT
make ingest-now
make dbt-run

Tests failing

# Clean and reinstall
make clean
make install-dev

# Run tests with verbose output
pytest -v -s tests/unit/

Next Steps

Now that you're running, read these in order:

Developer Onboarding - Day-by-day learning plan (Week 1)
Common Workflows - How to add exchanges, fix bugs, etc.
DBT Guide - Deep dive on transformations
API Guide - Using the REST API
Architecture Decision Records - Why we made key choices

Join the team:

Slack: #k2-refdata-platform
Standup: Daily 10am
Office hours: Tuesday/Thursday 2-3pm

Quick Reference

Project Structure:

src/refdata/
├── ingestion/      # Exchange clients, Kafka producers
├── api/            # FastAPI endpoints
├── common/         # Config, logging, DB utils
└── cli/            # CLI commands

dbt/
├── models/         # SQL transformations
│   ├── bronze/     # Source definitions
│   ├── silver/     # SCD Type 2 models
│   └── gold/       # Symbology master
└── macros/         # Custom SQL functions

tests/
├── unit/           # Fast, isolated tests
├── integration/    # Docker-based tests
└── e2e/            # Full pipeline tests

Key Files:

CLAUDE.md - Development standards
README.md - Project overview
Makefile - All commands
pyproject.toml - Dependencies

Help:

make help           # Show all commands
dbt --help          # DBT help
pytest --help       # Testing help

Welcome to the team! 🚀

Questions? Ask in #k2-refdata-platform or ping @data-engineering-lead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started - K2 Reference Data Platform

What You're Building

Prerequisites

Quick Setup (25 minutes)

Step 1: Clone and Install (5 min)

Step 2: Start Infrastructure (5 min)

Step 3: Initialize Data Warehouse (5 min)

Step 4: Run First Ingestion (3 min)

Step 5: Run DBT Transformations (3 min)

Step 6: Start API (2 min)

Step 7: Run Tests (2 min)

Understanding the Data Flow

Visual Flow

Example: Single Instrument Journey

Core Concepts (Read This!)

Bitemporal Modeling

SCD Type 2 (Slowly Changing Dimensions)

Symbology Normalization

Common Commands

Troubleshooting

Docker services won't start

Ingestion fails with Kafka connection error

DBT can't find Iceberg tables

API returns empty results

Tests failing

Next Steps

Quick Reference

FilesExpand file tree

GETTING-STARTED.md

Latest commit

History

GETTING-STARTED.md

File metadata and controls

Getting Started - K2 Reference Data Platform

What You're Building

Prerequisites

Quick Setup (25 minutes)

Step 1: Clone and Install (5 min)

Step 2: Start Infrastructure (5 min)

Step 3: Initialize Data Warehouse (5 min)

Step 4: Run First Ingestion (3 min)

Step 5: Run DBT Transformations (3 min)

Step 6: Start API (2 min)

Step 7: Run Tests (2 min)

Understanding the Data Flow

Visual Flow

Example: Single Instrument Journey

Core Concepts (Read This!)

Bitemporal Modeling

SCD Type 2 (Slowly Changing Dimensions)

Symbology Normalization

Common Commands

Troubleshooting

Docker services won't start

Ingestion fails with Kafka connection error

DBT can't find Iceberg tables

API returns empty results

Tests failing

Next Steps

Quick Reference