Troubleshooting Guide

Quick reference for when things break.

Each section: Symptom → Diagnosis → Fix → Prevention

Services Won't Start
Ingestion Failures
DBT Errors
API Issues
Data Quality Problems
Performance Issues

Services Won't Start

Docker Compose Fails

Symptom:

make docker-up
# Error: port already in use

Diagnosis:

# Check which ports are in use
lsof -i :9092  # Kafka
lsof -i :5432  # PostgreSQL
lsof -i :9000  # MinIO
lsof -i :8181  # Iceberg REST

Fix:

# Option 1: Kill conflicting processes
kill -9 <PID>

# Option 2: Change ports in docker-compose.refdata.yml
# Edit ports: "9093:9092" instead of "9092:9092"

# Option 3: Clean everything and restart
make docker-clean
make docker-up

Prevention:

Use project-specific ports (avoid common ones)
Always run make docker-down when finished

Container Crashes on Startup

Symptom:

docker ps | grep kafka
# Shows container restarting repeatedly

Diagnosis:

# Check container logs
docker logs kafka --tail=100

# Common errors:
# - "Address already in use"
# - "Out of memory"
# - "Permission denied"

Fix:

If out of memory:

# Increase Docker memory
# Docker Desktop → Settings → Resources → Memory → 8GB+

# Restart Docker

If permission denied:

# Fix volume permissions
sudo chown -R $USER:$USER infrastructure/data/

# Recreate volumes
make docker-clean
make docker-up

If address in use:

# Clean up old containers
docker ps -a | grep kafka
docker rm -f <container_id>

Ingestion Failures

API Connection Errors

Symptom:

make ingest-binance
# ERROR: Connection refused to api.binance.com

Diagnosis:

# Test connectivity
curl https://api.binance.com/api/v3/exchangeInfo
ping api.binance.com

# Check DNS
nslookup api.binance.com

# Check firewall
# Corporate firewall may block crypto exchanges

Fix:

If network issue:

# Try different network (mobile hotspot)
# Or configure proxy

export HTTP_PROXY=http://proxy.company.com:8080
export HTTPS_PROXY=http://proxy.company.com:8080

If rate limited:

# Increase delay in client
# src/refdata/ingestion/sources/base.py

self.rate_limit_rps = 5  # Decrease from 10

Prevention:

Respect rate limits
Implement exponential backoff
Monitor error rates

Kafka Connection Errors

Symptom:

make ingest-binance
# ERROR: Failed to publish to Kafka: Connection refused

Diagnosis:

# Check Kafka is running
docker ps | grep kafka

# Check Kafka logs
docker logs kafka --tail=50

# Test connection
kafka-topics --bootstrap-server localhost:9092 --list

Fix:

If Kafka not running:

make docker-down
make docker-up

If wrong bootstrap server:

# Check .env file
cat .env | grep KAFKA_BOOTSTRAP

# Should be: localhost:9092 (for local dev)
# Not: kafka:29092 (that's internal Docker network)

Prevention:

Always verify Kafka health before ingestion
Add health check to ingestion script

Schema Registry Errors

Symptom:

make ingest-binance
# ERROR: Schema registration failed: 409 Conflict

Diagnosis:

# Check existing schemas
curl http://localhost:8081/subjects

# Get schema details
curl http://localhost:8081/subjects/refdata-binance-instrument-raw-value/versions/latest

Fix:

If 409 Conflict (schema incompatible):

# Option 1: Fix schema to be BACKWARD compatible
# Add new fields as optional (with default: null)

# Option 2: Delete and re-register (DEV ONLY!)
curl -X DELETE http://localhost:8081/subjects/refdata-binance-instrument-raw-value
python scripts/register_schemas.py

If 422 Invalid Schema:

# Validate Avro schema
# Use online validator or:
python -c "
import json
from confluent_kafka.schema_registry import Schema

with open('config/schemas/binance_instrument_raw.avsc') as f:
    schema_str = f.read()
    schema = json.loads(schema_str)  # Should not raise error
"

Prevention:

Always test schema changes in dev first
Use BACKWARD compatibility mode
Never delete schemas in production

DBT Errors

Connection Errors

Symptom:

make dbt-run
# Database Error: Could not connect to DuckDB

Diagnosis:

# Check DuckDB can access S3
duckdb -c "
SELECT * FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
LIMIT 1
"

# Check credentials
echo $DBT_S3_ACCESS_KEY_ID
echo $DBT_S3_ENDPOINT

Fix:

If credentials missing:

# Set environment variables
export DBT_S3_ENDPOINT=localhost:9000
export DBT_S3_ACCESS_KEY_ID=admin
export DBT_S3_SECRET_ACCESS_KEY=password

# Or create .env file
cat > .env << EOF
DBT_S3_ENDPOINT=localhost:9000
DBT_S3_ACCESS_KEY_ID=admin
DBT_S3_SECRET_ACCESS_KEY=password
EOF

If MinIO not accessible:

# Check MinIO running
curl http://localhost:9000/minio/health/live

# Restart if needed
docker restart minio

Compilation Errors

Symptom:

make dbt-run
# Compilation Error: Undefined macro 'normalize_asset'

Diagnosis:

cd dbt

# Check macro exists
ls -la macros/normalize_asset.sql

# Compile specific model
dbt compile --select silver_instruments

Fix:

If macro missing:

# Check file exists and has correct name
# Filename must match macro name

# Refresh DBT project
dbt clean
dbt deps  # If using packages
dbt compile

If syntax error in macro:

# View compiled SQL to see error
cat target/compiled/k2_refdata/models/silver/silver_instruments.sql

# Look for Jinja errors

SQL Errors

Symptom:

make dbt-run
# Database Error: column "tick_size" does not exist

Diagnosis:

# View compiled SQL
cat target/run/k2_refdata/models/silver/silver_instruments.sql

# Check source table schema
duckdb -c "
DESCRIBE iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
"

Fix:

If column name mismatch:

-- Fix in DBT model
-- Change: tick_size
-- To: json_extract_string(data, '$.tickSize')

If table not found:

# Re-create Iceberg tables
python scripts/init_iceberg_catalog.py

# Full refresh
dbt run --full-refresh

Test Failures

Symptom:

make dbt-test
# FAIL 1 unique_silver_instruments_instrument_sk

Diagnosis:

# View failed test SQL
cat target/compiled/k2_refdata/models/silver/silver_instruments.yml/unique_silver_instruments_instrument_sk.sql

# Run test SQL manually
duckdb -c "$(cat target/compiled/.../unique_silver_instruments_instrument_sk.sql)"

Fix:

If duplicates found:

# Option 1: Fix source data
# Investigate why duplicates exist

# Option 2: Add DISTINCT in model
# (Usually wrong approach - fix root cause)

# Option 3: Full refresh
dbt run --full-refresh --select silver_instruments

Prevention:

Always run tests before committing
Add tests for new fields
Monitor test results in CI/CD

API Issues

API Won't Start

Symptom:

make api-dev
# ERROR: Unable to start server

Diagnosis:

# Check port available
lsof -i :8001

# Check Python environment
which python
python --version

# Try running directly
python -m refdata.api.main

Fix:

If port in use:

# Kill process
lsof -i :8001 | grep LISTEN | awk '{print $2}' | xargs kill -9

# Or change port
uvicorn refdata.api.main:app --port 8002

If import errors:

# Reinstall dependencies
make clean
make install-dev

# Check module installed
pip list | grep refdata

Connection Pool Errors

Symptom:

curl http://localhost:8001/v1/instruments
# ERROR: No connection available within 30s

Diagnosis:

# Check API logs
kubectl logs -l app=refdata-api | grep "connection"

# Check pool size
# Look for: "Connection pool initialized pool_size=X"

Fix:

If pool exhausted:

# Increase pool size
# src/refdata/common/duckdb_pool.py

DuckDBConnectionPool(
    min_connections=10,  # Increase
    max_connections=100,  # Increase
)

If connection leak:

# Ensure connections are returned
# Use context manager:

with pool.get_connection() as conn:
    result = conn.execute(query).fetchall()
# Connection automatically returned

Slow Queries

Symptom:

curl http://localhost:8001/v1/instruments
# Takes > 5 seconds

Diagnosis:

# Check API logs for duration
kubectl logs -l app=refdata-api | grep duration_ms

# Profile query
duckdb -c "
EXPLAIN ANALYZE
SELECT * FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE exchange = 'binance'
"

Fix:

If full table scan:

-- Add filters to prune partitions
WHERE exchange = 'binance'  -- Partitioned by exchange
  AND valid_from >= '2024-01-01'  -- Partitioned by months(valid_from)

If large result set:

# Add pagination
limit: int = Query(100, le=1000)  # Max 1000

If Iceberg metadata loading:

# Increase DuckDB memory
# dbt/profiles.yml
settings:
  memory_limit: '8GB'  # Increase from 4GB

Data Quality Problems

Duplicate Records

Symptom:

make dbt-test
# FAIL unique_silver_instruments_instrument_sk

Diagnosis:

# Find duplicates
duckdb -c "
SELECT instrument_sk, COUNT(*)
FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE valid_to IS NULL
GROUP BY instrument_sk
HAVING COUNT(*) > 1
"

Fix:

If SCD Type 2 not closing records:

-- Check DBT model UPDATE logic
-- Should set valid_to for old records before INSERT new

If same record ingested twice:

# Check state store
psql -h localhost -U postgres -d refdata -c "
SELECT exchange, last_hash, last_ingestion_timestamp
FROM ingestion_state
"

# If hash not updating, fix state store logic

Missing Data

Symptom:

duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')"
# Returns 0 or less than expected

Diagnosis:

# Check each layer
# Bronze:
duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')"

# If Bronze has data but Silver doesn't:
# DBT transformation issue

# If Bronze has no data:
# Ingestion issue

Fix:

If Bronze missing:

# Re-run ingestion
make ingest-now

If Silver missing:

# Check DBT incremental logic
# May be filtering out all records

# Full refresh
dbt run --full-refresh --select silver_instruments

Null Values

Symptom:

curl http://localhost:8001/v1/instruments | jq '.data[0].tick_size'
# Returns null instead of value

Diagnosis:

# Check Bronze has value
duckdb -c "
SELECT api_response_raw
FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
LIMIT 1
" | jq '.symbols[0].filters'

# Check Silver parsing
duckdb -c "
SELECT tick_size
FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE exchange = 'binance'
LIMIT 10
"

Fix:

If Bronze has value but Silver doesn't:

-- Fix JSON parsing in DBT model
-- Check field path is correct

-- Before:
json_extract_string(data, '$.tickSize')

-- After (if nested):
json_extract_string(
  json_extract(data, '$.filters[0]'),
  '$.tickSize'
)

Performance Issues

High API Latency

Symptom:

curl http://localhost:8001/v1/instruments
# Takes > 2 seconds

Diagnosis:

# Check metrics
curl http://localhost:8001/metrics | grep http_request_duration

# Check DuckDB query time
# Add logging to queries.py

Fix:

Increase connection pool size
Add query caching
Optimize Iceberg partitioning
Add filters to queries
Increase API replicas (horizontal scaling)

See COMMON-WORKFLOWS.md for details.

DBT Runs Slow

Symptom:

make dbt-run
# Takes > 10 minutes

Diagnosis:

# Time each model
dbt run --select silver_instruments --debug

# Check model materialization
cat dbt/models/silver/silver_instruments.sql | grep "materialized"

Fix:

If full refresh when should be incremental:

# Remove --full-refresh flag
dbt run --select silver_instruments  # Without --full-refresh

If incremental logic slow:

-- Optimize WHERE clause
{% if is_incremental() %}
WHERE ingestion_timestamp > (
  -- Cache this value instead of recalculating
  SELECT MAX(record_created_at) FROM {{ this }}
)
{% endif %}

If too many models:

# Run in parallel
dbt run --threads 8

Getting Help

Self-Service

Search docs: grep -r "error message" docs/
Check logs: Always check logs first
Read error message: It usually tells you what's wrong
Google: Search for exact error message

Escalation

Slack: #k2-refdata-platform
- Include: error message, what you tried, logs
Office Hours: Tuesday/Thursday 2-3pm
- Come prepared with specifics
PagerDuty: Production emergencies only
- On-call: data-engineering-oncall

Creating Good Bug Reports

Include:

What you expected: "API should return data"
What happened: "API returns 500 error"

Steps to reproduce:

1. make docker-up
2. make api-dev
3. curl http://localhost:8001/v1/instruments

Environment: Local/staging/production
Logs: Relevant log excerpts
What you tried: "I restarted the API, same error"

Good example:

BUG: API returns 500 on /v1/instruments

Expected: Return list of instruments
Actual: 500 Internal Server Error

Steps to reproduce:
1. make docker-up
2. make ingest-binance
3. make dbt-run
4. make api-dev
5. curl http://localhost:8001/v1/instruments

Environment: Local development

Logs:
ERROR: DuckDB connection pool exhausted
Traceback: ...

What I tried:
- Restarted API: same error
- Checked DuckDB connection: works manually
- Increased pool size to 20: still fails

Help needed: How to debug connection pool exhaustion?

Common Error Messages

"Connection refused"

Meaning: Service not running or wrong port

Fix:

Check service is running: docker ps
Check correct port in config
Restart service

"Address already in use"

Meaning: Port conflict

Fix:

Kill process: lsof -i :<port> | grep LISTEN | awk '{print $2}' | xargs kill
Or change port

"Schema validation failed"

Meaning: Avro schema incompatible

Fix:

Make schema BACKWARD compatible
Add default values to new fields
Never remove required fields

"Permission denied"

Meaning: File/directory permissions

Fix:

sudo chown -R $USER:$USER <directory>
Check Docker volume permissions

"Out of memory"

Meaning: Container/process exceeded memory limit

Fix:

Increase Docker memory (Docker Desktop settings)
Increase DuckDB memory limit in profiles.yml
Optimize queries to use less memory

Still stuck? Ask in #k2-refdata-platform - we're here to help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide

Table of Contents

Services Won't Start

Docker Compose Fails

Container Crashes on Startup

Ingestion Failures

API Connection Errors

Kafka Connection Errors

Schema Registry Errors

DBT Errors

Connection Errors

Compilation Errors

SQL Errors

Test Failures

API Issues

API Won't Start

Connection Pool Errors

Slow Queries

Data Quality Problems

Duplicate Records

Missing Data

Null Values

Performance Issues

High API Latency

DBT Runs Slow

Getting Help

Self-Service

Escalation

Creating Good Bug Reports

Common Error Messages

"Connection refused"

"Address already in use"

"Schema validation failed"

"Permission denied"

"Out of memory"

FilesExpand file tree

TROUBLESHOOTING.md

Latest commit

History

TROUBLESHOOTING.md

File metadata and controls

Troubleshooting Guide

Table of Contents

Services Won't Start

Docker Compose Fails

Container Crashes on Startup

Ingestion Failures

API Connection Errors

Kafka Connection Errors

Schema Registry Errors

DBT Errors

Connection Errors

Compilation Errors

SQL Errors

Test Failures

API Issues

API Won't Start

Connection Pool Errors

Slow Queries

Data Quality Problems

Duplicate Records

Missing Data

Null Values

Performance Issues

High API Latency

DBT Runs Slow

Getting Help

Self-Service

Escalation

Creating Good Bug Reports

Common Error Messages

"Connection refused"

"Address already in use"

"Schema validation failed"

"Permission denied"

"Out of memory"