Production Deployment Checklist

Purpose: Comprehensive checklist for production deployment of K2 Reference Data Platform.

Use this before: Initial production deployment, major version upgrades, disaster recovery

Pre-Deployment

Infrastructure

Kubernetes cluster version 1.25+ verified
Kafka cluster accessible (shared with k2-market-data-platform)
PostgreSQL 14+ database created (refdata schema)
MinIO/S3 bucket created (refdata-warehouse)
Iceberg REST catalog deployed and accessible
Docker registry credentials configured
Network policies reviewed and approved

Access & Credentials

kubectl context configured for production cluster
S3/MinIO credentials generated (IAM user: refdata-platform)
Kafka credentials generated (ACLs: refdata.* topics)
PostgreSQL credentials generated (user: refdata_app)
Schema Registry access verified
Secrets created in Kubernetes namespace

Code & Configuration

Code reviewed and approved (PR merged to main)
All unit tests passing (make test-unit)
All integration tests passing (make test-integration)
DBT models validated (dbt test)
API tests passing (make test-integration)
Code quality checks passing (make quality)
Docker images built and pushed to registry
Configuration values verified (no hardcoded dev credentials)

Deployment

Step 1: Namespace & Secrets

Create Kubernetes namespace: refdata-platform
Label namespace: env=production
Create S3 credentials secret
Create Kafka credentials secret
Create PostgreSQL credentials secret
Verify secrets: kubectl get secrets -n refdata-platform

Step 2: Initialize Infrastructure

Create Iceberg namespace: refdata
Create Bronze tables: bronze_instruments_binance, bronze_instruments_kraken
Register Avro schemas to Schema Registry
Create Kafka topics: refdata.instruments.binance.raw, refdata.instruments.kraken.raw
Initialize PostgreSQL state store tables
Verify Iceberg catalog: curl http://iceberg-rest:8181/v1/namespaces

Script: make init-infra (run from deployment pod)

Step 3: Deploy API

Apply API deployment: kubectl apply -f api-deployment.yaml
Verify pods running: kubectl get pods -l app=refdata-api
Check logs: kubectl logs -l app=refdata-api --tail=100
Test health endpoint: curl http://refdata-api:8001/health
Verify connection pool initialized (logs show "pool_size=5")
Apply Ingress: kubectl apply -f api-ingress.yaml
Verify external URL accessible: curl https://refdata-api.k2.com/health
Test SSL certificate valid

Step 4: Deploy Ingestion CronJob

Apply ingestion CronJob: kubectl apply -f ingestion-cronjob.yaml
Manually trigger first run: kubectl create job --from=cronjob/refdata-ingestion manual-1
Monitor job logs: kubectl logs job/manual-1 -f
Verify Kafka message published: Check refdata.instruments.binance.raw topic
Verify Bronze table populated: Query Iceberg table
Check state store updated: Query PostgreSQL
Verify no errors in logs

Step 5: Deploy DBT CronJob

Apply DBT CronJob: kubectl apply -f dbt-cronjob.yaml
Manually trigger first run: kubectl create job --from=cronjob/refdata-dbt manual-1
Monitor DBT logs: kubectl logs job/manual-1 -f
Verify Silver table populated: Query via API
Verify Gold table populated: Check symbology endpoint
Verify all DBT tests passed
Check DBT docs generated: make dbt-docs

Step 6: Configure Monitoring

Post-Deployment Verification

Functional Testing

Health Check: curl https://refdata-api.k2.com/health returns 200
List Instruments: curl https://refdata-api.k2.com/v1/instruments?limit=10 returns data
Point-in-Time Query: Test with historical timestamp
Instrument History: Test audit trail endpoint
Symbology Lookup: Test canonical ID lookup
Reverse Symbology: Test exchange symbol resolution
Error Handling: Test 404 response for non-existent instrument
API Documentation: Verify /docs and /redoc accessible

Performance Testing

API p50 latency < 50ms
API p95 latency < 100ms
API p99 latency < 200ms
Connection pool utilization < 80% under load
Load test: 100 concurrent requests/sec for 5 minutes
No memory leaks (monitor pod memory over 1 hour)
No connection pool exhaustion

Tools: ab -n 10000 -c 100 https://refdata-api.k2.com/health

Data Quality

Silver table row count matches Bronze ingestion
No duplicate instrument_sk values
All current records have valid_to = NULL
No temporal overlaps (SCD Type 2 test passes)
Gold symbology count matches unique canonical IDs
All DBT tests passing (dbt test)
No orphaned records (referential integrity)

Security

Secrets not visible in logs
No hardcoded credentials in code
Network policies enforced (API can't access Kafka directly)
HTTPS enforced (HTTP redirects to HTTPS)
CORS configured correctly (no allow_origins=* in production)
Rate limiting enabled (if implemented)
API key authentication enabled (if implemented)

Operational Readiness

Documentation

README.md updated with production URLs
API-GUIDE.md reviewed and accurate
DBT-GUIDE.md up to date
DEPLOYMENT.md runbook created
MANUAL-OVERRIDE.md runbook created
ADRs documented (5 required ADRs exist)
OpenAPI spec published

Monitoring & Alerting

Grafana dashboard imported and verified
Prometheus metrics exposed (/metrics endpoint)
Alerts configured in Alertmanager
PagerDuty integration tested
Slack notifications configured (#k2-refdata-platform)
Log aggregation configured (Loki/Elasticsearch)
Distributed tracing enabled (Jaeger/Tempo)

Runbooks

Deployment runbook tested
Manual override runbook tested (in dev environment)
Rollback procedure documented and tested
Disaster recovery plan documented
On-call runbook created
Escalation matrix defined

Backup & DR

Iceberg snapshot retention configured (30 days)
PostgreSQL state store backup scheduled (daily)
Kafka topic retention configured (7 days)
S3 bucket versioning enabled
Disaster recovery drill scheduled (within 30 days)
RTO/RPO defined (RTO: 1 hour, RPO: 1 hour)

Go-Live Checklist

Pre-Launch (T-24 hours)

Send deployment notification to stakeholders
Freeze code changes (release branch locked)
Final smoke test in staging environment
On-call engineer assigned
Backup on-call engineer assigned
Rollback plan reviewed

Launch (T-0)

Execute deployment steps (Deployment section above)
Monitor Grafana dashboard continuously (first 30 minutes)
Monitor error logs in real-time
Verify API requests succeeding
Check Kafka message flow
Verify DBT run successful

Post-Launch (T+1 hour)

Verify metrics stable:
- API request rate > 0
- Error rate < 1%
- Latency < 100ms p95
Verify ingestion ran successfully
Verify DBT transformations completed
Check data quality (row counts, tests)
No critical alerts fired

Post-Launch (T+24 hours)

Review past 24 hours of logs
Check resource utilization (CPU, memory)
Verify scheduled CronJobs ran (ingestion + DBT)
Review metrics dashboard
Document any issues encountered
Send deployment success notification
Schedule post-mortem (if issues occurred)

Rollback Triggers

Immediately rollback if:

API error rate > 10% for 5 minutes
API unavailable for 5 minutes
Data corruption detected (duplicate records, missing data)
Critical security vulnerability discovered
Production incident (P0/P1) caused by deployment

Rollback Procedure:

Notify stakeholders: #k2-refdata-platform
Execute rollback: kubectl rollout undo deployment/refdata-api
Verify rollback successful: Check health endpoint
Investigate root cause
Create incident report
Schedule fix deployment

Success Criteria

Deployment is successful when:

SLOs (Service Level Objectives):

Availability: 99.9% (43 minutes downtime/month)
Latency: p95 < 100ms, p99 < 200ms
Data Freshness: < 2 hours lag (ingestion + DBT)
Data Quality: 100% DBT tests passing

Post-Deployment Tasks

Immediate (Week 1)

Monitor production metrics daily
Review error logs daily
Hold daily standup with on-call team
Address any high-priority bugs
Update documentation based on learnings

Short-term (Month 1)

Conduct disaster recovery drill
Review and optimize performance
Add missing monitoring/alerts
Update runbooks based on incidents
Collect user feedback

Long-term (Quarter 1)

Sign-Off

Deployment completed by: _____________________________ Date: _______

Verified by: _____________________________ Date: _______

Approved for production: _____________________________ Date: _______

Appendix: Quick Commands

# Check all pods
kubectl get pods -n refdata-platform

# Check API health
curl https://refdata-api.k2.com/health

# View API logs
kubectl logs -n refdata-platform -l app=refdata-api --tail=100 -f

# Check latest ingestion job
kubectl get jobs -n refdata-platform | grep ingestion | tail -1

# Check latest DBT job
kubectl get jobs -n refdata-platform | grep dbt | tail -1

# Force ingestion run
kubectl create job --from=cronjob/refdata-ingestion manual-$(date +%s)

# Force DBT run
kubectl create job --from=cronjob/refdata-dbt manual-$(date +%s)

# Rollback deployment
kubectl rollout undo deployment/refdata-api -n refdata-platform

# Scale API
kubectl scale deployment refdata-api --replicas=5 -n refdata-platform

Last Updated: 2026-01-23 Owner: Data Engineering Team Next Review: 2026-04-23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production Deployment Checklist

Pre-Deployment

Infrastructure

Access & Credentials

Code & Configuration

Deployment

Step 1: Namespace & Secrets

Step 2: Initialize Infrastructure

Step 3: Deploy API

Step 4: Deploy Ingestion CronJob

Step 5: Deploy DBT CronJob

Step 6: Configure Monitoring

Post-Deployment Verification

Functional Testing

Performance Testing

Data Quality

Security

Operational Readiness

Documentation

Monitoring & Alerting

Runbooks

Backup & DR

Go-Live Checklist

Pre-Launch (T-24 hours)

Launch (T-0)

Post-Launch (T+1 hour)

Post-Launch (T+24 hours)

Rollback Triggers

Success Criteria

Post-Deployment Tasks

Immediate (Week 1)

Short-term (Month 1)

Long-term (Quarter 1)

Sign-Off

Appendix: Quick Commands

FilesExpand file tree

DEPLOYMENT-CHECKLIST.md

Latest commit

History

DEPLOYMENT-CHECKLIST.md

File metadata and controls

Production Deployment Checklist

Pre-Deployment

Infrastructure

Access & Credentials

Code & Configuration

Deployment

Step 1: Namespace & Secrets

Step 2: Initialize Infrastructure

Step 3: Deploy API

Step 4: Deploy Ingestion CronJob

Step 5: Deploy DBT CronJob

Step 6: Configure Monitoring

Post-Deployment Verification

Functional Testing

Performance Testing

Data Quality

Security

Operational Readiness

Documentation

Monitoring & Alerting

Runbooks

Backup & DR

Go-Live Checklist

Pre-Launch (T-24 hours)

Launch (T-0)

Post-Launch (T+1 hour)

Post-Launch (T+24 hours)

Rollback Triggers

Success Criteria

Post-Deployment Tasks

Immediate (Week 1)

Short-term (Month 1)

Long-term (Quarter 1)

Sign-Off

Appendix: Quick Commands