Quick reference for when things break.
Each section: Symptom → Diagnosis → Fix → Prevention
- Services Won't Start
- Ingestion Failures
- DBT Errors
- API Issues
- Data Quality Problems
- Performance Issues
Symptom:
make docker-up
# Error: port already in useDiagnosis:
# Check which ports are in use
lsof -i :9092 # Kafka
lsof -i :5432 # PostgreSQL
lsof -i :9000 # MinIO
lsof -i :8181 # Iceberg RESTFix:
# Option 1: Kill conflicting processes
kill -9 <PID>
# Option 2: Change ports in docker-compose.refdata.yml
# Edit ports: "9093:9092" instead of "9092:9092"
# Option 3: Clean everything and restart
make docker-clean
make docker-upPrevention:
- Use project-specific ports (avoid common ones)
- Always run
make docker-downwhen finished
Symptom:
docker ps | grep kafka
# Shows container restarting repeatedlyDiagnosis:
# Check container logs
docker logs kafka --tail=100
# Common errors:
# - "Address already in use"
# - "Out of memory"
# - "Permission denied"Fix:
If out of memory:
# Increase Docker memory
# Docker Desktop → Settings → Resources → Memory → 8GB+
# Restart DockerIf permission denied:
# Fix volume permissions
sudo chown -R $USER:$USER infrastructure/data/
# Recreate volumes
make docker-clean
make docker-upIf address in use:
# Clean up old containers
docker ps -a | grep kafka
docker rm -f <container_id>Symptom:
make ingest-binance
# ERROR: Connection refused to api.binance.comDiagnosis:
# Test connectivity
curl https://api.binance.com/api/v3/exchangeInfo
ping api.binance.com
# Check DNS
nslookup api.binance.com
# Check firewall
# Corporate firewall may block crypto exchangesFix:
If network issue:
# Try different network (mobile hotspot)
# Or configure proxy
export HTTP_PROXY=http://proxy.company.com:8080
export HTTPS_PROXY=http://proxy.company.com:8080If rate limited:
# Increase delay in client
# src/refdata/ingestion/sources/base.py
self.rate_limit_rps = 5 # Decrease from 10Prevention:
- Respect rate limits
- Implement exponential backoff
- Monitor error rates
Symptom:
make ingest-binance
# ERROR: Failed to publish to Kafka: Connection refusedDiagnosis:
# Check Kafka is running
docker ps | grep kafka
# Check Kafka logs
docker logs kafka --tail=50
# Test connection
kafka-topics --bootstrap-server localhost:9092 --listFix:
If Kafka not running:
make docker-down
make docker-upIf wrong bootstrap server:
# Check .env file
cat .env | grep KAFKA_BOOTSTRAP
# Should be: localhost:9092 (for local dev)
# Not: kafka:29092 (that's internal Docker network)Prevention:
- Always verify Kafka health before ingestion
- Add health check to ingestion script
Symptom:
make ingest-binance
# ERROR: Schema registration failed: 409 ConflictDiagnosis:
# Check existing schemas
curl http://localhost:8081/subjects
# Get schema details
curl http://localhost:8081/subjects/refdata-binance-instrument-raw-value/versions/latestFix:
If 409 Conflict (schema incompatible):
# Option 1: Fix schema to be BACKWARD compatible
# Add new fields as optional (with default: null)
# Option 2: Delete and re-register (DEV ONLY!)
curl -X DELETE http://localhost:8081/subjects/refdata-binance-instrument-raw-value
python scripts/register_schemas.pyIf 422 Invalid Schema:
# Validate Avro schema
# Use online validator or:
python -c "
import json
from confluent_kafka.schema_registry import Schema
with open('config/schemas/binance_instrument_raw.avsc') as f:
schema_str = f.read()
schema = json.loads(schema_str) # Should not raise error
"Prevention:
- Always test schema changes in dev first
- Use BACKWARD compatibility mode
- Never delete schemas in production
Symptom:
make dbt-run
# Database Error: Could not connect to DuckDBDiagnosis:
# Check DuckDB can access S3
duckdb -c "
SELECT * FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
LIMIT 1
"
# Check credentials
echo $DBT_S3_ACCESS_KEY_ID
echo $DBT_S3_ENDPOINTFix:
If credentials missing:
# Set environment variables
export DBT_S3_ENDPOINT=localhost:9000
export DBT_S3_ACCESS_KEY_ID=admin
export DBT_S3_SECRET_ACCESS_KEY=password
# Or create .env file
cat > .env << EOF
DBT_S3_ENDPOINT=localhost:9000
DBT_S3_ACCESS_KEY_ID=admin
DBT_S3_SECRET_ACCESS_KEY=password
EOFIf MinIO not accessible:
# Check MinIO running
curl http://localhost:9000/minio/health/live
# Restart if needed
docker restart minioSymptom:
make dbt-run
# Compilation Error: Undefined macro 'normalize_asset'Diagnosis:
cd dbt
# Check macro exists
ls -la macros/normalize_asset.sql
# Compile specific model
dbt compile --select silver_instrumentsFix:
If macro missing:
# Check file exists and has correct name
# Filename must match macro name
# Refresh DBT project
dbt clean
dbt deps # If using packages
dbt compileIf syntax error in macro:
# View compiled SQL to see error
cat target/compiled/k2_refdata/models/silver/silver_instruments.sql
# Look for Jinja errorsSymptom:
make dbt-run
# Database Error: column "tick_size" does not existDiagnosis:
# View compiled SQL
cat target/run/k2_refdata/models/silver/silver_instruments.sql
# Check source table schema
duckdb -c "
DESCRIBE iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
"Fix:
If column name mismatch:
-- Fix in DBT model
-- Change: tick_size
-- To: json_extract_string(data, '$.tickSize')If table not found:
# Re-create Iceberg tables
python scripts/init_iceberg_catalog.py
# Full refresh
dbt run --full-refreshSymptom:
make dbt-test
# FAIL 1 unique_silver_instruments_instrument_skDiagnosis:
# View failed test SQL
cat target/compiled/k2_refdata/models/silver/silver_instruments.yml/unique_silver_instruments_instrument_sk.sql
# Run test SQL manually
duckdb -c "$(cat target/compiled/.../unique_silver_instruments_instrument_sk.sql)"Fix:
If duplicates found:
# Option 1: Fix source data
# Investigate why duplicates exist
# Option 2: Add DISTINCT in model
# (Usually wrong approach - fix root cause)
# Option 3: Full refresh
dbt run --full-refresh --select silver_instrumentsPrevention:
- Always run tests before committing
- Add tests for new fields
- Monitor test results in CI/CD
Symptom:
make api-dev
# ERROR: Unable to start serverDiagnosis:
# Check port available
lsof -i :8001
# Check Python environment
which python
python --version
# Try running directly
python -m refdata.api.mainFix:
If port in use:
# Kill process
lsof -i :8001 | grep LISTEN | awk '{print $2}' | xargs kill -9
# Or change port
uvicorn refdata.api.main:app --port 8002If import errors:
# Reinstall dependencies
make clean
make install-dev
# Check module installed
pip list | grep refdataSymptom:
curl http://localhost:8001/v1/instruments
# ERROR: No connection available within 30sDiagnosis:
# Check API logs
kubectl logs -l app=refdata-api | grep "connection"
# Check pool size
# Look for: "Connection pool initialized pool_size=X"Fix:
If pool exhausted:
# Increase pool size
# src/refdata/common/duckdb_pool.py
DuckDBConnectionPool(
min_connections=10, # Increase
max_connections=100, # Increase
)If connection leak:
# Ensure connections are returned
# Use context manager:
with pool.get_connection() as conn:
result = conn.execute(query).fetchall()
# Connection automatically returnedSymptom:
curl http://localhost:8001/v1/instruments
# Takes > 5 secondsDiagnosis:
# Check API logs for duration
kubectl logs -l app=refdata-api | grep duration_ms
# Profile query
duckdb -c "
EXPLAIN ANALYZE
SELECT * FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE exchange = 'binance'
"Fix:
If full table scan:
-- Add filters to prune partitions
WHERE exchange = 'binance' -- Partitioned by exchange
AND valid_from >= '2024-01-01' -- Partitioned by months(valid_from)If large result set:
# Add pagination
limit: int = Query(100, le=1000) # Max 1000If Iceberg metadata loading:
# Increase DuckDB memory
# dbt/profiles.yml
settings:
memory_limit: '8GB' # Increase from 4GBSymptom:
make dbt-test
# FAIL unique_silver_instruments_instrument_skDiagnosis:
# Find duplicates
duckdb -c "
SELECT instrument_sk, COUNT(*)
FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE valid_to IS NULL
GROUP BY instrument_sk
HAVING COUNT(*) > 1
"Fix:
If SCD Type 2 not closing records:
-- Check DBT model UPDATE logic
-- Should set valid_to for old records before INSERT newIf same record ingested twice:
# Check state store
psql -h localhost -U postgres -d refdata -c "
SELECT exchange, last_hash, last_ingestion_timestamp
FROM ingestion_state
"
# If hash not updating, fix state store logicSymptom:
duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')"
# Returns 0 or less than expectedDiagnosis:
# Check each layer
# Bronze:
duckdb -c "SELECT COUNT(*) FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')"
# If Bronze has data but Silver doesn't:
# DBT transformation issue
# If Bronze has no data:
# Ingestion issueFix:
If Bronze missing:
# Re-run ingestion
make ingest-nowIf Silver missing:
# Check DBT incremental logic
# May be filtering out all records
# Full refresh
dbt run --full-refresh --select silver_instrumentsSymptom:
curl http://localhost:8001/v1/instruments | jq '.data[0].tick_size'
# Returns null instead of valueDiagnosis:
# Check Bronze has value
duckdb -c "
SELECT api_response_raw
FROM iceberg_scan('s3://refdata-warehouse/bronze/instruments/binance')
LIMIT 1
" | jq '.symbols[0].filters'
# Check Silver parsing
duckdb -c "
SELECT tick_size
FROM iceberg_scan('s3://refdata-warehouse/silver/instruments')
WHERE exchange = 'binance'
LIMIT 10
"Fix:
If Bronze has value but Silver doesn't:
-- Fix JSON parsing in DBT model
-- Check field path is correct
-- Before:
json_extract_string(data, '$.tickSize')
-- After (if nested):
json_extract_string(
json_extract(data, '$.filters[0]'),
'$.tickSize'
)Symptom:
curl http://localhost:8001/v1/instruments
# Takes > 2 secondsDiagnosis:
# Check metrics
curl http://localhost:8001/metrics | grep http_request_duration
# Check DuckDB query time
# Add logging to queries.pyFix:
- Increase connection pool size
- Add query caching
- Optimize Iceberg partitioning
- Add filters to queries
- Increase API replicas (horizontal scaling)
See COMMON-WORKFLOWS.md for details.
Symptom:
make dbt-run
# Takes > 10 minutesDiagnosis:
# Time each model
dbt run --select silver_instruments --debug
# Check model materialization
cat dbt/models/silver/silver_instruments.sql | grep "materialized"Fix:
If full refresh when should be incremental:
# Remove --full-refresh flag
dbt run --select silver_instruments # Without --full-refreshIf incremental logic slow:
-- Optimize WHERE clause
{% if is_incremental() %}
WHERE ingestion_timestamp > (
-- Cache this value instead of recalculating
SELECT MAX(record_created_at) FROM {{ this }}
)
{% endif %}If too many models:
# Run in parallel
dbt run --threads 8- Search docs:
grep -r "error message" docs/ - Check logs: Always check logs first
- Read error message: It usually tells you what's wrong
- Google: Search for exact error message
-
Slack: #k2-refdata-platform
- Include: error message, what you tried, logs
-
Office Hours: Tuesday/Thursday 2-3pm
- Come prepared with specifics
-
PagerDuty: Production emergencies only
- On-call: data-engineering-oncall
Include:
- What you expected: "API should return data"
- What happened: "API returns 500 error"
- Steps to reproduce:
1. make docker-up 2. make api-dev 3. curl http://localhost:8001/v1/instruments
- Environment: Local/staging/production
- Logs: Relevant log excerpts
- What you tried: "I restarted the API, same error"
Good example:
BUG: API returns 500 on /v1/instruments
Expected: Return list of instruments
Actual: 500 Internal Server Error
Steps to reproduce:
1. make docker-up
2. make ingest-binance
3. make dbt-run
4. make api-dev
5. curl http://localhost:8001/v1/instruments
Environment: Local development
Logs:
ERROR: DuckDB connection pool exhausted
Traceback: ...
What I tried:
- Restarted API: same error
- Checked DuckDB connection: works manually
- Increased pool size to 20: still fails
Help needed: How to debug connection pool exhaustion?
Meaning: Service not running or wrong port
Fix:
- Check service is running:
docker ps - Check correct port in config
- Restart service
Meaning: Port conflict
Fix:
- Kill process:
lsof -i :<port> | grep LISTEN | awk '{print $2}' | xargs kill - Or change port
Meaning: Avro schema incompatible
Fix:
- Make schema BACKWARD compatible
- Add default values to new fields
- Never remove required fields
Meaning: File/directory permissions
Fix:
sudo chown -R $USER:$USER <directory>- Check Docker volume permissions
Meaning: Container/process exceeded memory limit
Fix:
- Increase Docker memory (Docker Desktop settings)
- Increase DuckDB memory limit in profiles.yml
- Optimize queries to use less memory
Still stuck? Ask in #k2-refdata-platform - we're here to help!