Skip to content

Latest commit

 

History

History
844 lines (642 loc) · 15 KB

File metadata and controls

844 lines (642 loc) · 15 KB

Troubleshooting Guide

Quick reference for when things break.

Each section: Symptom → Diagnosis → Fix → Prevention


Table of Contents


Services Won't Start

Docker Compose Fails

Symptom:

make docker-up
# Error: port already in use

Diagnosis:

# Check which ports are in use
lsof -i :9092  # Kafka
lsof -i :5432  # PostgreSQL
lsof -i :9000  # MinIO
lsof -i :8181  # Iceberg REST

Fix:

# Option 1: Kill conflicting processes
kill -9 <PID>

# Option 2: Change ports in docker-compose.refdata.yml
# Edit ports: "9093:9092" instead of "9092:9092"

# Option 3: Clean everything and restart
make docker-clean
make docker-up

Prevention:

  • Use project-specific ports (avoid common ones)
  • Always run make docker-down when finished

Container Crashes on Startup

Symptom:

docker ps | grep kafka
# Shows container restarting repeatedly

Diagnosis:

# Check container logs
docker logs kafka --tail=100

# Common errors:
# - "Address already in use"
# - "Out of memory"
# - "Permission denied"

Fix:

If out of memory:

# Increase Docker memory
# Docker Desktop → Settings → Resources → Memory → 8GB+

# Restart Docker

If permission denied:

# Fix volume permissions
sudo chown -R $USER:$USER infrastructure/data/

# Recreate volumes
make docker-clean
make docker-up

If address in use:

# Clean up old containers
docker ps -a | grep kafka
docker rm -f <container_id>

Ingestion Failures

API Connection Errors

Symptom:

make ingest-binance
# ERROR: Connection refused to api.binance.com

Diagnosis:

# Test connectivity
curl https://api.binance.com/api/v3/exchangeInfo
ping api.binance.com

# Check DNS
nslookup api.binance.com

# Check firewall
# Corporate firewall may block crypto exchanges

Fix:

If network issue:

# Try different network (mobile hotspot)
# Or configure proxy

export HTTP_PROXY=http://proxy.company.com:8080
export HTTPS_PROXY=http://proxy.company.com:8080

If rate limited:

# Increase delay in client
# src/refdata/ingestion/sources/base.py

self.rate_limit_rps = 5  # Decrease from 10

Prevention:

  • Respect rate limits
  • Implement exponential backoff
  • Monitor error rates

Kafka Connection Errors

Symptom:

make ingest-binance
# ERROR: Failed to publish to Kafka: Connection refused

Diagnosis:

# Check Kafka is running
docker ps | grep kafka

# Check Kafka logs
docker logs kafka --tail=50

# Test connection
kafka-topics --bootstrap-server localhost:9092 --list

Fix:

If Kafka not running:

make docker-down
make docker-up

If wrong bootstrap server:

# Check .env file
cat .env | grep KAFKA_BOOTSTRAP

# Should be: localhost:9092 (for local dev)
# Not: kafka:29092 (that's internal Docker network)

Prevention:

  • Always verify Kafka health before ingestion
  • Add health check to ingestion script

Schema Registry Errors

Symptom:

make ingest-binance
# ERROR: Schema registration failed: 409 Conflict

Diagnosis:

# Check existing schemas
curl http://localhost:8081/subjects

# Get schema details
curl http://localhost:8081/subjects/refdata-binance-instrument-raw-value/versions/latest

Fix:

If 409 Conflict (schema incompatible):

# Option 1: Fix schema to be BACKWARD compatible
# Add new fields as optional (with default: null)

# Option 2: Delete and re-register (DEV ONLY!)
curl -X DELETE http://localhost:8081/subjects/refdata-binance-instrument-raw-value
python scripts/register_schemas.py

If 422 Invalid Schema:

# Validate Avro schema
# Use online validator or:
python -c "
import json
from confluent_kafka.schema_registry import Schema

with open('config/schemas/binance_instrument_raw.avsc') as f:
    schema_str = f.read()
    schema = json.loads(schema_str)  # Should not raise error
"

Prevention:

  • Always test schema changes in dev first
  • Use BACKWARD compatibility mode
  • Never delete schemas in production

DBT Errors

Connection Errors

Symptom:

make dbt-run
# Database Error: Could not connect to DuckDB

Diagnosis:

# Check DuckDB can access S3
duckdb -c "
SELECT * FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
LIMIT 1
"

# Check credentials
echo $DBT_S3_ACCESS_KEY_ID
echo $DBT_S3_ENDPOINT

Fix:

If credentials missing:

# Set environment variables
export DBT_S3_ENDPOINT=localhost:9000
export DBT_S3_ACCESS_KEY_ID=admin
export DBT_S3_SECRET_ACCESS_KEY=password

# Or create .env file
cat > .env << EOF
DBT_S3_ENDPOINT=localhost:9000
DBT_S3_ACCESS_KEY_ID=admin
DBT_S3_SECRET_ACCESS_KEY=password
EOF

If MinIO not accessible:

# Check MinIO running
curl http://localhost:9000/minio/health/live

# Restart if needed
docker restart minio

Compilation Errors

Symptom:

make dbt-run
# Compilation Error: Undefined macro 'normalize_asset'

Diagnosis:

cd dbt

# Check macro exists
ls -la macros/normalize_asset.sql

# Compile specific model
dbt compile --select silver_instruments

Fix:

If macro missing:

# Check file exists and has correct name
# Filename must match macro name

# Refresh DBT project
dbt clean
dbt deps  # If using packages
dbt compile

If syntax error in macro:

# View compiled SQL to see error
cat target/compiled/k2_refdata/models/silver/silver_instruments.sql

# Look for Jinja errors

SQL Errors

Symptom:

make dbt-run
# Database Error: column "tick_size" does not exist

Diagnosis:

# View compiled SQL
cat target/run/k2_refdata/models/silver/silver_instruments.sql

# Check source table schema
duckdb -c "
DESCRIBE iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
"

Fix:

If column name mismatch:

-- Fix in DBT model
-- Change: tick_size
-- To: json_extract_string(data, '$.tickSize')

If table not found:

# Re-create Iceberg tables
python scripts/init_iceberg_catalog.py

# Full refresh
dbt run --full-refresh

Test Failures

Symptom:

make dbt-test
# FAIL 1 unique_silver_instruments_instrument_sk

Diagnosis:

# View failed test SQL
cat target/compiled/k2_refdata/models/silver/silver_instruments.yml/unique_silver_instruments_instrument_sk.sql

# Run test SQL manually
duckdb -c "$(cat target/compiled/.../unique_silver_instruments_instrument_sk.sql)"

Fix:

If duplicates found:

# Option 1: Fix source data
# Investigate why duplicates exist

# Option 2: Add DISTINCT in model
# (Usually wrong approach - fix root cause)

# Option 3: Full refresh
dbt run --full-refresh --select silver_instruments

Prevention:

  • Always run tests before committing
  • Add tests for new fields
  • Monitor test results in CI/CD

API Issues

API Won't Start

Symptom:

make api-dev
# ERROR: Unable to start server

Diagnosis:

# Check port available
lsof -i :8001

# Check Python environment
which python
python --version

# Try running directly
python -m refdata.api.main

Fix:

If port in use:

# Kill process
lsof -i :8001 | grep LISTEN | awk '{print $2}' | xargs kill -9

# Or change port
uvicorn refdata.api.main:app --port 8002

If import errors:

# Reinstall dependencies
make clean
make install-dev

# Check module installed
pip list | grep refdata

Connection Pool Errors

Symptom:

curl http://localhost:8001/v1/instruments
# ERROR: No connection available within 30s

Diagnosis:

# Check API logs
kubectl logs -l app=refdata-api | grep "connection"

# Check pool size
# Look for: "Connection pool initialized pool_size=X"

Fix:

If pool exhausted:

# Increase pool size
# src/refdata/common/duckdb_pool.py

DuckDBConnectionPool(
    min_connections=10,  # Increase
    max_connections=100,  # Increase
)

If connection leak:

# Ensure connections are returned
# Use context manager:

with pool.get_connection() as conn:
    result = conn.execute(query).fetchall()
# Connection automatically returned

Slow Queries

Symptom:

curl http://localhost:8001/v1/instruments
# Takes > 5 seconds

Diagnosis:

# Check API logs for duration
kubectl logs -l app=refdata-api | grep duration_ms

# Profile query
duckdb -c "
EXPLAIN ANALYZE
SELECT * FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE exchange = 'binance'
"

Fix:

If full table scan:

-- Add filters to prune partitions
WHERE exchange = 'binance'  -- Partitioned by exchange
  AND valid_from >= '2024-01-01'  -- Partitioned by months(valid_from)

If large result set:

# Add pagination
limit: int = Query(100, le=1000)  # Max 1000

If Iceberg metadata loading:

# Increase DuckDB memory
# dbt/profiles.yml
settings:
  memory_limit: '8GB'  # Increase from 4GB

Data Quality Problems

Duplicate Records

Symptom:

make dbt-test
# FAIL unique_silver_instruments_instrument_sk

Diagnosis:

# Find duplicates
duckdb -c "
SELECT instrument_sk, COUNT(*)
FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE valid_to IS NULL
GROUP BY instrument_sk
HAVING COUNT(*) > 1
"

Fix:

If SCD Type 2 not closing records:

-- Check DBT model UPDATE logic
-- Should set valid_to for old records before INSERT new

If same record ingested twice:

# Check state store
psql -h localhost -U postgres -d refdata -c "
SELECT exchange, last_hash, last_ingestion_timestamp
FROM ingestion_state
"

# If hash not updating, fix state store logic

Missing Data

Symptom:

duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')"
# Returns 0 or less than expected

Diagnosis:

# Check each layer
# Bronze:
duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')"

# If Bronze has data but Silver doesn't:
# DBT transformation issue

# If Bronze has no data:
# Ingestion issue

Fix:

If Bronze missing:

# Re-run ingestion
make ingest-now

If Silver missing:

# Check DBT incremental logic
# May be filtering out all records

# Full refresh
dbt run --full-refresh --select silver_instruments

Null Values

Symptom:

curl http://localhost:8001/v1/instruments | jq '.data[0].tick_size'
# Returns null instead of value

Diagnosis:

# Check Bronze has value
duckdb -c "
SELECT api_response_raw
FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
LIMIT 1
" | jq '.symbols[0].filters'

# Check Silver parsing
duckdb -c "
SELECT tick_size
FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE exchange = 'binance'
LIMIT 10
"

Fix:

If Bronze has value but Silver doesn't:

-- Fix JSON parsing in DBT model
-- Check field path is correct

-- Before:
json_extract_string(data, '$.tickSize')

-- After (if nested):
json_extract_string(
  json_extract(data, '$.filters[0]'),
  '$.tickSize'
)

Performance Issues

High API Latency

Symptom:

curl http://localhost:8001/v1/instruments
# Takes > 2 seconds

Diagnosis:

# Check metrics
curl http://localhost:8001/metrics | grep http_request_duration

# Check DuckDB query time
# Add logging to queries.py

Fix:

  1. Increase connection pool size
  2. Add query caching
  3. Optimize Iceberg partitioning
  4. Add filters to queries
  5. Increase API replicas (horizontal scaling)

See COMMON-WORKFLOWS.md for details.

DBT Runs Slow

Symptom:

make dbt-run
# Takes > 10 minutes

Diagnosis:

# Time each model
dbt run --select silver_instruments --debug

# Check model materialization
cat dbt/models/silver/silver_instruments.sql | grep "materialized"

Fix:

If full refresh when should be incremental:

# Remove --full-refresh flag
dbt run --select silver_instruments  # Without --full-refresh

If incremental logic slow:

-- Optimize WHERE clause
{% if is_incremental() %}
WHERE ingestion_timestamp > (
  -- Cache this value instead of recalculating
  SELECT MAX(record_created_at) FROM {{ this }}
)
{% endif %}

If too many models:

# Run in parallel
dbt run --threads 8

Getting Help

Self-Service

  1. Search docs: grep -r "error message" docs/
  2. Check logs: Always check logs first
  3. Read error message: It usually tells you what's wrong
  4. Google: Search for exact error message

Escalation

  1. Slack: #k2-refdata-platform

    • Include: error message, what you tried, logs
  2. Office Hours: Tuesday/Thursday 2-3pm

    • Come prepared with specifics
  3. PagerDuty: Production emergencies only

    • On-call: data-engineering-oncall

Creating Good Bug Reports

Include:

  • What you expected: "API should return data"
  • What happened: "API returns 500 error"
  • Steps to reproduce:
    1. make docker-up
    2. make api-dev
    3. curl http://localhost:8001/v1/instruments
  • Environment: Local/staging/production
  • Logs: Relevant log excerpts
  • What you tried: "I restarted the API, same error"

Good example:

BUG: API returns 500 on /v1/instruments

Expected: Return list of instruments
Actual: 500 Internal Server Error

Steps to reproduce:
1. make docker-up
2. make ingest-binance
3. make dbt-run
4. make api-dev
5. curl http://localhost:8001/v1/instruments

Environment: Local development

Logs:
ERROR: DuckDB connection pool exhausted
Traceback: ...

What I tried:
- Restarted API: same error
- Checked DuckDB connection: works manually
- Increased pool size to 20: still fails

Help needed: How to debug connection pool exhaustion?

Common Error Messages

"Connection refused"

Meaning: Service not running or wrong port

Fix:

  • Check service is running: docker ps
  • Check correct port in config
  • Restart service

"Address already in use"

Meaning: Port conflict

Fix:

  • Kill process: lsof -i :<port> | grep LISTEN | awk '{print $2}' | xargs kill
  • Or change port

"Schema validation failed"

Meaning: Avro schema incompatible

Fix:

  • Make schema BACKWARD compatible
  • Add default values to new fields
  • Never remove required fields

"Permission denied"

Meaning: File/directory permissions

Fix:

  • sudo chown -R $USER:$USER <directory>
  • Check Docker volume permissions

"Out of memory"

Meaning: Container/process exceeded memory limit

Fix:

  • Increase Docker memory (Docker Desktop settings)
  • Increase DuckDB memory limit in profiles.yml
  • Optimize queries to use less memory

Still stuck? Ask in #k2-refdata-platform - we're here to help!