Purpose: Comprehensive checklist for production deployment of K2 Reference Data Platform.
Use this before: Initial production deployment, major version upgrades, disaster recovery
- Kubernetes cluster version 1.25+ verified
- Kafka cluster accessible (shared with k2-market-data-platform)
- PostgreSQL 14+ database created (
refdataschema) - MinIO/S3 bucket created (
refdata-warehouse) - Iceberg REST catalog deployed and accessible
- Docker registry credentials configured
- Network policies reviewed and approved
- kubectl context configured for production cluster
- S3/MinIO credentials generated (IAM user:
refdata-platform) - Kafka credentials generated (ACLs:
refdata.*topics) - PostgreSQL credentials generated (user:
refdata_app) - Schema Registry access verified
- Secrets created in Kubernetes namespace
- Code reviewed and approved (PR merged to
main) - All unit tests passing (
make test-unit) - All integration tests passing (
make test-integration) - DBT models validated (
dbt test) - API tests passing (
make test-integration) - Code quality checks passing (
make quality) - Docker images built and pushed to registry
- Configuration values verified (no hardcoded dev credentials)
- Create Kubernetes namespace:
refdata-platform - Label namespace:
env=production - Create S3 credentials secret
- Create Kafka credentials secret
- Create PostgreSQL credentials secret
- Verify secrets:
kubectl get secrets -n refdata-platform
- Create Iceberg namespace:
refdata - Create Bronze tables:
bronze_instruments_binance,bronze_instruments_kraken - Register Avro schemas to Schema Registry
- Create Kafka topics:
refdata.instruments.binance.raw,refdata.instruments.kraken.raw - Initialize PostgreSQL state store tables
- Verify Iceberg catalog:
curl http://iceberg-rest:8181/v1/namespaces
Script: make init-infra (run from deployment pod)
- Apply API deployment:
kubectl apply -f api-deployment.yaml - Verify pods running:
kubectl get pods -l app=refdata-api - Check logs:
kubectl logs -l app=refdata-api --tail=100 - Test health endpoint:
curl http://refdata-api:8001/health - Verify connection pool initialized (logs show "pool_size=5")
- Apply Ingress:
kubectl apply -f api-ingress.yaml - Verify external URL accessible:
curl https://refdata-api.k2.com/health - Test SSL certificate valid
- Apply ingestion CronJob:
kubectl apply -f ingestion-cronjob.yaml - Manually trigger first run:
kubectl create job --from=cronjob/refdata-ingestion manual-1 - Monitor job logs:
kubectl logs job/manual-1 -f - Verify Kafka message published: Check
refdata.instruments.binance.rawtopic - Verify Bronze table populated: Query Iceberg table
- Check state store updated: Query PostgreSQL
- Verify no errors in logs
- Apply DBT CronJob:
kubectl apply -f dbt-cronjob.yaml - Manually trigger first run:
kubectl create job --from=cronjob/refdata-dbt manual-1 - Monitor DBT logs:
kubectl logs job/manual-1 -f - Verify Silver table populated: Query via API
- Verify Gold table populated: Check symbology endpoint
- Verify all DBT tests passed
- Check DBT docs generated:
make dbt-docs
- Apply ServiceMonitor for Prometheus:
kubectl apply -f servicemonitor.yaml - Import Grafana dashboard:
refdata-platform-dashboard.json - Verify metrics scraping: Check Prometheus targets
- Set up alerts:
- API p95 latency > 200ms
- Ingestion job failure
- DBT job failure
- API error rate > 1%
- Configure PagerDuty integration
- Test alert delivery: Trigger test alert
- Health Check:
curl https://refdata-api.k2.com/healthreturns 200 - List Instruments:
curl https://refdata-api.k2.com/v1/instruments?limit=10returns data - Point-in-Time Query: Test with historical timestamp
- Instrument History: Test audit trail endpoint
- Symbology Lookup: Test canonical ID lookup
- Reverse Symbology: Test exchange symbol resolution
- Error Handling: Test 404 response for non-existent instrument
- API Documentation: Verify
/docsand/redocaccessible
- API p50 latency < 50ms
- API p95 latency < 100ms
- API p99 latency < 200ms
- Connection pool utilization < 80% under load
- Load test: 100 concurrent requests/sec for 5 minutes
- No memory leaks (monitor pod memory over 1 hour)
- No connection pool exhaustion
Tools: ab -n 10000 -c 100 https://refdata-api.k2.com/health
- Silver table row count matches Bronze ingestion
- No duplicate
instrument_skvalues - All current records have
valid_to = NULL - No temporal overlaps (SCD Type 2 test passes)
- Gold symbology count matches unique canonical IDs
- All DBT tests passing (
dbt test) - No orphaned records (referential integrity)
- Secrets not visible in logs
- No hardcoded credentials in code
- Network policies enforced (API can't access Kafka directly)
- HTTPS enforced (HTTP redirects to HTTPS)
- CORS configured correctly (no
allow_origins=*in production) - Rate limiting enabled (if implemented)
- API key authentication enabled (if implemented)
- README.md updated with production URLs
- API-GUIDE.md reviewed and accurate
- DBT-GUIDE.md up to date
- DEPLOYMENT.md runbook created
- MANUAL-OVERRIDE.md runbook created
- ADRs documented (5 required ADRs exist)
- OpenAPI spec published
- Grafana dashboard imported and verified
- Prometheus metrics exposed (
/metricsendpoint) - Alerts configured in Alertmanager
- PagerDuty integration tested
- Slack notifications configured (#k2-refdata-platform)
- Log aggregation configured (Loki/Elasticsearch)
- Distributed tracing enabled (Jaeger/Tempo)
- Deployment runbook tested
- Manual override runbook tested (in dev environment)
- Rollback procedure documented and tested
- Disaster recovery plan documented
- On-call runbook created
- Escalation matrix defined
- Iceberg snapshot retention configured (30 days)
- PostgreSQL state store backup scheduled (daily)
- Kafka topic retention configured (7 days)
- S3 bucket versioning enabled
- Disaster recovery drill scheduled (within 30 days)
- RTO/RPO defined (RTO: 1 hour, RPO: 1 hour)
- Send deployment notification to stakeholders
- Freeze code changes (release branch locked)
- Final smoke test in staging environment
- On-call engineer assigned
- Backup on-call engineer assigned
- Rollback plan reviewed
- Execute deployment steps (Deployment section above)
- Monitor Grafana dashboard continuously (first 30 minutes)
- Monitor error logs in real-time
- Verify API requests succeeding
- Check Kafka message flow
- Verify DBT run successful
- Verify metrics stable:
- API request rate > 0
- Error rate < 1%
- Latency < 100ms p95
- Verify ingestion ran successfully
- Verify DBT transformations completed
- Check data quality (row counts, tests)
- No critical alerts fired
- Review past 24 hours of logs
- Check resource utilization (CPU, memory)
- Verify scheduled CronJobs ran (ingestion + DBT)
- Review metrics dashboard
- Document any issues encountered
- Send deployment success notification
- Schedule post-mortem (if issues occurred)
Immediately rollback if:
- API error rate > 10% for 5 minutes
- API unavailable for 5 minutes
- Data corruption detected (duplicate records, missing data)
- Critical security vulnerability discovered
- Production incident (P0/P1) caused by deployment
Rollback Procedure:
- Notify stakeholders: #k2-refdata-platform
- Execute rollback:
kubectl rollout undo deployment/refdata-api - Verify rollback successful: Check health endpoint
- Investigate root cause
- Create incident report
- Schedule fix deployment
Deployment is successful when:
- All functional tests passing
- Performance targets met (p95 < 100ms)
- Zero critical/high severity bugs
- Monitoring & alerting operational
- Documentation complete
- Data quality tests passing
- Runbooks tested
- On-call team trained
SLOs (Service Level Objectives):
- Availability: 99.9% (43 minutes downtime/month)
- Latency: p95 < 100ms, p99 < 200ms
- Data Freshness: < 2 hours lag (ingestion + DBT)
- Data Quality: 100% DBT tests passing
- Monitor production metrics daily
- Review error logs daily
- Hold daily standup with on-call team
- Address any high-priority bugs
- Update documentation based on learnings
- Conduct disaster recovery drill
- Review and optimize performance
- Add missing monitoring/alerts
- Update runbooks based on incidents
- Collect user feedback
- Review SLO metrics
- Plan Phase 2 features (Bybit, Coinbase)
- Optimize infrastructure costs
- Security audit
- Capacity planning
Deployment completed by: _____________________________ Date: _______
Verified by: _____________________________ Date: _______
Approved for production: _____________________________ Date: _______
# Check all pods
kubectl get pods -n refdata-platform
# Check API health
curl https://refdata-api.k2.com/health
# View API logs
kubectl logs -n refdata-platform -l app=refdata-api --tail=100 -f
# Check latest ingestion job
kubectl get jobs -n refdata-platform | grep ingestion | tail -1
# Check latest DBT job
kubectl get jobs -n refdata-platform | grep dbt | tail -1
# Force ingestion run
kubectl create job --from=cronjob/refdata-ingestion manual-$(date +%s)
# Force DBT run
kubectl create job --from=cronjob/refdata-dbt manual-$(date +%s)
# Rollback deployment
kubectl rollout undo deployment/refdata-api -n refdata-platform
# Scale API
kubectl scale deployment refdata-api --replicas=5 -n refdata-platformLast Updated: 2026-01-23 Owner: Data Engineering Team Next Review: 2026-04-23