Skip to content

Latest commit

 

History

History
355 lines (265 loc) · 10.9 KB

File metadata and controls

355 lines (265 loc) · 10.9 KB

Production Deployment Checklist

Purpose: Comprehensive checklist for production deployment of K2 Reference Data Platform.

Use this before: Initial production deployment, major version upgrades, disaster recovery


Pre-Deployment

Infrastructure

  • Kubernetes cluster version 1.25+ verified
  • Kafka cluster accessible (shared with k2-market-data-platform)
  • PostgreSQL 14+ database created (refdata schema)
  • MinIO/S3 bucket created (refdata-warehouse)
  • Iceberg REST catalog deployed and accessible
  • Docker registry credentials configured
  • Network policies reviewed and approved

Access & Credentials

  • kubectl context configured for production cluster
  • S3/MinIO credentials generated (IAM user: refdata-platform)
  • Kafka credentials generated (ACLs: refdata.* topics)
  • PostgreSQL credentials generated (user: refdata_app)
  • Schema Registry access verified
  • Secrets created in Kubernetes namespace

Code & Configuration

  • Code reviewed and approved (PR merged to main)
  • All unit tests passing (make test-unit)
  • All integration tests passing (make test-integration)
  • DBT models validated (dbt test)
  • API tests passing (make test-integration)
  • Code quality checks passing (make quality)
  • Docker images built and pushed to registry
  • Configuration values verified (no hardcoded dev credentials)

Deployment

Step 1: Namespace & Secrets

  • Create Kubernetes namespace: refdata-platform
  • Label namespace: env=production
  • Create S3 credentials secret
  • Create Kafka credentials secret
  • Create PostgreSQL credentials secret
  • Verify secrets: kubectl get secrets -n refdata-platform

Step 2: Initialize Infrastructure

  • Create Iceberg namespace: refdata
  • Create Bronze tables: bronze_instruments_binance, bronze_instruments_kraken
  • Register Avro schemas to Schema Registry
  • Create Kafka topics: refdata.instruments.binance.raw, refdata.instruments.kraken.raw
  • Initialize PostgreSQL state store tables
  • Verify Iceberg catalog: curl http://iceberg-rest:8181/v1/namespaces

Script: make init-infra (run from deployment pod)

Step 3: Deploy API

  • Apply API deployment: kubectl apply -f api-deployment.yaml
  • Verify pods running: kubectl get pods -l app=refdata-api
  • Check logs: kubectl logs -l app=refdata-api --tail=100
  • Test health endpoint: curl http://refdata-api:8001/health
  • Verify connection pool initialized (logs show "pool_size=5")
  • Apply Ingress: kubectl apply -f api-ingress.yaml
  • Verify external URL accessible: curl https://refdata-api.k2.com/health
  • Test SSL certificate valid

Step 4: Deploy Ingestion CronJob

  • Apply ingestion CronJob: kubectl apply -f ingestion-cronjob.yaml
  • Manually trigger first run: kubectl create job --from=cronjob/refdata-ingestion manual-1
  • Monitor job logs: kubectl logs job/manual-1 -f
  • Verify Kafka message published: Check refdata.instruments.binance.raw topic
  • Verify Bronze table populated: Query Iceberg table
  • Check state store updated: Query PostgreSQL
  • Verify no errors in logs

Step 5: Deploy DBT CronJob

  • Apply DBT CronJob: kubectl apply -f dbt-cronjob.yaml
  • Manually trigger first run: kubectl create job --from=cronjob/refdata-dbt manual-1
  • Monitor DBT logs: kubectl logs job/manual-1 -f
  • Verify Silver table populated: Query via API
  • Verify Gold table populated: Check symbology endpoint
  • Verify all DBT tests passed
  • Check DBT docs generated: make dbt-docs

Step 6: Configure Monitoring

  • Apply ServiceMonitor for Prometheus: kubectl apply -f servicemonitor.yaml
  • Import Grafana dashboard: refdata-platform-dashboard.json
  • Verify metrics scraping: Check Prometheus targets
  • Set up alerts:
    • API p95 latency > 200ms
    • Ingestion job failure
    • DBT job failure
    • API error rate > 1%
  • Configure PagerDuty integration
  • Test alert delivery: Trigger test alert

Post-Deployment Verification

Functional Testing

  • Health Check: curl https://refdata-api.k2.com/health returns 200
  • List Instruments: curl https://refdata-api.k2.com/v1/instruments?limit=10 returns data
  • Point-in-Time Query: Test with historical timestamp
  • Instrument History: Test audit trail endpoint
  • Symbology Lookup: Test canonical ID lookup
  • Reverse Symbology: Test exchange symbol resolution
  • Error Handling: Test 404 response for non-existent instrument
  • API Documentation: Verify /docs and /redoc accessible

Performance Testing

  • API p50 latency < 50ms
  • API p95 latency < 100ms
  • API p99 latency < 200ms
  • Connection pool utilization < 80% under load
  • Load test: 100 concurrent requests/sec for 5 minutes
  • No memory leaks (monitor pod memory over 1 hour)
  • No connection pool exhaustion

Tools: ab -n 10000 -c 100 https://refdata-api.k2.com/health

Data Quality

  • Silver table row count matches Bronze ingestion
  • No duplicate instrument_sk values
  • All current records have valid_to = NULL
  • No temporal overlaps (SCD Type 2 test passes)
  • Gold symbology count matches unique canonical IDs
  • All DBT tests passing (dbt test)
  • No orphaned records (referential integrity)

Security

  • Secrets not visible in logs
  • No hardcoded credentials in code
  • Network policies enforced (API can't access Kafka directly)
  • HTTPS enforced (HTTP redirects to HTTPS)
  • CORS configured correctly (no allow_origins=* in production)
  • Rate limiting enabled (if implemented)
  • API key authentication enabled (if implemented)

Operational Readiness

Documentation

  • README.md updated with production URLs
  • API-GUIDE.md reviewed and accurate
  • DBT-GUIDE.md up to date
  • DEPLOYMENT.md runbook created
  • MANUAL-OVERRIDE.md runbook created
  • ADRs documented (5 required ADRs exist)
  • OpenAPI spec published

Monitoring & Alerting

  • Grafana dashboard imported and verified
  • Prometheus metrics exposed (/metrics endpoint)
  • Alerts configured in Alertmanager
  • PagerDuty integration tested
  • Slack notifications configured (#k2-refdata-platform)
  • Log aggregation configured (Loki/Elasticsearch)
  • Distributed tracing enabled (Jaeger/Tempo)

Runbooks

  • Deployment runbook tested
  • Manual override runbook tested (in dev environment)
  • Rollback procedure documented and tested
  • Disaster recovery plan documented
  • On-call runbook created
  • Escalation matrix defined

Backup & DR

  • Iceberg snapshot retention configured (30 days)
  • PostgreSQL state store backup scheduled (daily)
  • Kafka topic retention configured (7 days)
  • S3 bucket versioning enabled
  • Disaster recovery drill scheduled (within 30 days)
  • RTO/RPO defined (RTO: 1 hour, RPO: 1 hour)

Go-Live Checklist

Pre-Launch (T-24 hours)

  • Send deployment notification to stakeholders
  • Freeze code changes (release branch locked)
  • Final smoke test in staging environment
  • On-call engineer assigned
  • Backup on-call engineer assigned
  • Rollback plan reviewed

Launch (T-0)

  • Execute deployment steps (Deployment section above)
  • Monitor Grafana dashboard continuously (first 30 minutes)
  • Monitor error logs in real-time
  • Verify API requests succeeding
  • Check Kafka message flow
  • Verify DBT run successful

Post-Launch (T+1 hour)

  • Verify metrics stable:
    • API request rate > 0
    • Error rate < 1%
    • Latency < 100ms p95
  • Verify ingestion ran successfully
  • Verify DBT transformations completed
  • Check data quality (row counts, tests)
  • No critical alerts fired

Post-Launch (T+24 hours)

  • Review past 24 hours of logs
  • Check resource utilization (CPU, memory)
  • Verify scheduled CronJobs ran (ingestion + DBT)
  • Review metrics dashboard
  • Document any issues encountered
  • Send deployment success notification
  • Schedule post-mortem (if issues occurred)

Rollback Triggers

Immediately rollback if:

  • API error rate > 10% for 5 minutes
  • API unavailable for 5 minutes
  • Data corruption detected (duplicate records, missing data)
  • Critical security vulnerability discovered
  • Production incident (P0/P1) caused by deployment

Rollback Procedure:

  1. Notify stakeholders: #k2-refdata-platform
  2. Execute rollback: kubectl rollout undo deployment/refdata-api
  3. Verify rollback successful: Check health endpoint
  4. Investigate root cause
  5. Create incident report
  6. Schedule fix deployment

Success Criteria

Deployment is successful when:

  • All functional tests passing
  • Performance targets met (p95 < 100ms)
  • Zero critical/high severity bugs
  • Monitoring & alerting operational
  • Documentation complete
  • Data quality tests passing
  • Runbooks tested
  • On-call team trained

SLOs (Service Level Objectives):

  • Availability: 99.9% (43 minutes downtime/month)
  • Latency: p95 < 100ms, p99 < 200ms
  • Data Freshness: < 2 hours lag (ingestion + DBT)
  • Data Quality: 100% DBT tests passing

Post-Deployment Tasks

Immediate (Week 1)

  • Monitor production metrics daily
  • Review error logs daily
  • Hold daily standup with on-call team
  • Address any high-priority bugs
  • Update documentation based on learnings

Short-term (Month 1)

  • Conduct disaster recovery drill
  • Review and optimize performance
  • Add missing monitoring/alerts
  • Update runbooks based on incidents
  • Collect user feedback

Long-term (Quarter 1)

  • Review SLO metrics
  • Plan Phase 2 features (Bybit, Coinbase)
  • Optimize infrastructure costs
  • Security audit
  • Capacity planning

Sign-Off

Deployment completed by: _____________________________ Date: _______

Verified by: _____________________________ Date: _______

Approved for production: _____________________________ Date: _______


Appendix: Quick Commands

# Check all pods
kubectl get pods -n refdata-platform

# Check API health
curl https://refdata-api.k2.com/health

# View API logs
kubectl logs -n refdata-platform -l app=refdata-api --tail=100 -f

# Check latest ingestion job
kubectl get jobs -n refdata-platform | grep ingestion | tail -1

# Check latest DBT job
kubectl get jobs -n refdata-platform | grep dbt | tail -1

# Force ingestion run
kubectl create job --from=cronjob/refdata-ingestion manual-$(date +%s)

# Force DBT run
kubectl create job --from=cronjob/refdata-dbt manual-$(date +%s)

# Rollback deployment
kubectl rollout undo deployment/refdata-api -n refdata-platform

# Scale API
kubectl scale deployment refdata-api --replicas=5 -n refdata-platform

Last Updated: 2026-01-23 Owner: Data Engineering Team Next Review: 2026-04-23