DevOps interviews focus on CI/CD pipeline design, infrastructure as code, container orchestration, monitoring, incident response, and SRE principles. This guide covers common questions, scenarios, and key concepts to help you prepare.
Requirements: 20+ microservices, multiple teams, frequent deployments
- Source control: monorepo vs multi-repo (trade-offs)
- Monorepo: easier code sharing, atomic changes across services, complex CI triggers
- Multi-repo: clearer ownership, independent versioning, simpler per-service CI
- Pipeline stages:
- Code commit triggers pipeline (webhook from GitHub/GitLab)
- Build: compile, lint, unit tests
- Container image build and push to registry
- Security scanning (SAST, container image scanning)
- Deploy to staging environment
- Integration and end-to-end tests
- Manual approval gate (for production)
- Deploy to production (canary or blue-green)
- Post-deployment smoke tests and monitoring
- Tools: GitHub Actions, GitLab CI, Jenkins, ArgoCD, Flux
- Core principles:
- Git is the single source of truth for infrastructure and application state
- Changes are made through pull requests, not direct cluster access
- Automated reconciliation ensures desired state matches actual state
- Implementation:
- Application manifests stored in a Git repository
- ArgoCD or Flux watches the repository for changes
- Pull-based deployment (agent pulls config, more secure than push)
- Drift detection alerts when actual state diverges from desired state
- Benefits: audit trail, rollback via git revert, consistent environments
- Documentation: https://argo-cd.readthedocs.io/en/stable/
- Use migration tools: Flyway, Liquibase, Alembic, golang-migrate
- Migrations are versioned and stored in source control
- Forward-only migrations (never edit a migration that has been applied)
- Backward-compatible changes for zero-downtime deployments:
- Add columns as nullable first
- Create new tables before removing old ones
- Use expand-contract pattern for schema changes
- Run migrations as a separate step before application deployment
- Include rollback scripts for every migration
- Never store secrets in source control
- Use secret management tools:
- HashiCorp Vault
- AWS Secrets Manager
- Azure Key Vault
- Google Secret Manager
- CI/CD platform secrets (GitHub Actions secrets, GitLab CI variables)
- For Kubernetes: External Secrets Operator to sync cloud secrets
- Rotate secrets automatically on a regular schedule
- Audit secret access and usage
Scenario: Your team accidentally deleted the Terraform state file. What do you do?
- Prevention:
- Store state in a remote backend (S3 + DynamoDB, Azure Blob, GCS)
- Enable state file versioning on the storage bucket
- Enable state locking to prevent concurrent modifications
- Restrict access to the state file (contains sensitive data)
- Recovery:
- Restore from versioned backup in the storage backend
- If no backup: use
terraform importto re-associate resources - Never recreate resources that are already running in production
- Documentation: https://developer.hashicorp.com/terraform/language/state
Scenario: Two engineers are running Terraform apply simultaneously. What happens?
- Without state locking: state corruption, resource conflicts, potential outages
- With state locking (DynamoDB, Consul, Azure Blob lease): second operation fails immediately
- Best practice: always enable state locking, run Terraform in CI/CD only (not locally)
- Keep modules focused on a single responsibility
- Use input variables with descriptions, types, and validation rules
- Output essential values that other modules need
- Version modules using Git tags (semantic versioning)
- Use a module registry (Terraform Cloud, private Git repos)
- Test modules with Terratest or terraform-compliance
- Example module structure:
modules/ ├── networking/ │ ├── main.tf │ ├── variables.tf │ ├── outputs.tf │ └── README.md ├── compute/ └── database/ - Documentation: https://developer.hashicorp.com/terraform/language/modules/develop
- Drift: actual infrastructure differs from Terraform state
- Detection: run
terraform planregularly (scheduled in CI/CD) - Causes: manual changes in console, other tools modifying resources
- Response options:
- Import the manual change:
terraform import - Revert the manual change:
terraform applyto enforce desired state - Update Terraform config to match the manual change (if intentional)
- Import the manual change:
- Prevention: restrict console access, enforce changes through IaC only
- Tools: Driftctl, Spacelift drift detection, Terraform Cloud drift detection
Scenario: Pods are stuck in CrashLoopBackOff. How do you diagnose?
- Check pod events:
kubectl describe pod <pod-name> - Check container logs:
kubectl logs <pod-name> --previous - Common causes:
- Application crash on startup (configuration error, missing env vars)
- Failed health checks (readiness/liveness probes misconfigured)
- OOM killed (memory limits too low)
- Missing dependencies (database not reachable, secrets not mounted)
- Check resource limits:
kubectl top pod - Exec into the container for debugging:
kubectl exec -it <pod-name> -- /bin/sh
Scenario: A service is unreachable from other pods in the cluster.
- Verify the Service exists:
kubectl get svc - Check Service selector matches pod labels:
kubectl describe svc <svc-name> - Verify endpoints:
kubectl get endpoints <svc-name>(should list pod IPs) - Check network policies:
kubectl get networkpolicies - Test DNS resolution:
kubectl exec -it <pod> -- nslookup <svc-name> - Check pod readiness: unready pods are removed from Service endpoints
- Gradually replaces old pods with new pods
- Configurable: maxSurge and maxUnavailable
- Automatic rollback on failure with
kubectl rollout undo - No additional infrastructure needed
- Run two identical environments (blue = current, green = new)
- Switch traffic by updating the Service selector
- Instant rollback by switching selector back
- Requires double the resources during deployment
- Deploy new version to a small subset of pods
- Route a percentage of traffic to canary pods
- Monitor error rates and latency
- Gradually increase traffic to canary
- Tools: Flagger, Argo Rollouts, Istio traffic splitting
- Documentation: https://argoproj.github.io/argo-rollouts/
- Route traffic based on request attributes (headers, cookies, user ID)
- Requires service mesh or advanced ingress controller
- Used for feature testing and user experience experiments
- Always set resource requests and limits for CPU and memory
- Requests: guaranteed resources (used for scheduling)
- Limits: maximum resources (pod is throttled or killed if exceeded)
- Use Vertical Pod Autoscaler (VPA) to recommend resource values
- Use Horizontal Pod Autoscaler (HPA) to scale pod count based on metrics
- Use Pod Disruption Budgets (PDBs) to ensure availability during maintenance
- Documentation: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
- Four pillars of observability:
- Metrics: Prometheus + Grafana (or Datadog, CloudWatch)
- Logs: EFK stack (Elasticsearch, Fluentd, Kibana) or Loki + Grafana
- Traces: Jaeger, Zipkin, or cloud-native (X-Ray, Cloud Trace)
- Events: Kubernetes events, CloudWatch Events, custom event streams
- Infrastructure: CPU, memory, disk, network utilization
- Application: request rate, error rate, latency (RED method)
- Business: sign-ups, orders, revenue, user engagement
- Saturation: queue depth, thread pool usage, connection pool usage
- Alert on symptoms, not causes (e.g., "high error rate" not "CPU is high")
- Use severity levels: critical (page), warning (ticket), info (log)
- Avoid alert fatigue - every alert should require action
- Include runbook links in alert notifications
- Implement alerting hierarchy:
- Automated remediation (self-healing)
- Dashboard with manual intervention
- Notification to team channel
- Page on-call engineer
- Review and tune alerts regularly (suppress noisy alerts, add missing ones)
- Define SLIs: availability, latency, throughput, error rate
- Set SLOs: 99.9% availability, p99 latency under 200ms
- Alert on error budget burn rate, not individual metric thresholds
- Fast burn alert: consuming error budget at 14.4x rate (pages immediately)
- Slow burn alert: consuming error budget at 1x rate (creates ticket)
- Documentation: https://sre.google/workbook/alerting-on-slos/
- Triage: check monitoring dashboards for scope and impact
- Communicate: update status page, notify stakeholders
- Investigate:
- Check recent deployments (rollback if correlated)
- Review application logs for error messages
- Check infrastructure health (database, cache, external services)
- Verify DNS and certificate status
- Mitigate: apply the fastest fix first (rollback, restart, scale up)
- Resolve: apply permanent fix after mitigation
- Follow up: write post-mortem, create action items
- Identify the problem queries: slow query log, Performance Insights, pg_stat_activity
- Check for recent changes: new deployment, schema change, increased traffic
- Immediate mitigation: kill long-running queries, add read replica, scale up
- Root cause: missing index, N+1 query, inefficient join, table lock contention
- Long-term fix: optimize queries, add indexes, implement caching, connection pooling
- Blameless culture - focus on systems and processes, not individuals
- Timeline of events with timestamps
- Root cause analysis (use 5 Whys technique)
- What went well and what could be improved
- Action items with owners and due dates
- Share findings broadly to prevent recurrence
- Template: https://sre.google/sre-book/postmortem-culture/
- SLI (Service Level Indicator): quantitative measure of service reliability
- Availability: successful requests / total requests
- Latency: proportion of requests faster than threshold
- Throughput: requests processed per second
- SLO (Service Level Objective): target value for an SLI
- Example: 99.9% of requests should succeed (allows 8.76 hours downtime per year)
- SLA (Service Level Agreement): contractual obligation with consequences
- Typically less aggressive than SLOs (SLO of 99.95%, SLA of 99.9%)
- Error budget = 1 - SLO target
- Example: 99.9% SLO means 0.1% error budget (43.8 minutes per month)
- When error budget is healthy: push new features, experiment
- When error budget is consumed: freeze deployments, focus on reliability
- Aligns development velocity with reliability goals
- Toil: manual, repetitive, automatable work that scales linearly with service size
- Goal: keep toil below 50% of engineering time
- Prioritize automating tasks that are:
- Frequent (daily or weekly)
- Time-consuming
- Error-prone when done manually
- Predictable and well-defined
- Automation examples: scaling, deployment, certificate renewal, log rotation
- How does DNS resolution work? Recursive resolver, root servers, TLD servers, authoritative servers
- Explain TCP three-way handshake. SYN, SYN-ACK, ACK
- What is a reverse proxy? Receives client requests on behalf of backend servers (Nginx, HAProxy)
- HTTP status codes: 2xx success, 3xx redirect, 4xx client error, 5xx server error
- How do you troubleshoot high CPU?
top,htop,strace,perf - How do you troubleshoot disk space?
df -h,du -sh *,find / -size +100M - How do you troubleshoot network issues?
ping,traceroute,netstat,ss,tcpdump - What is the difference between a process and a thread? Processes have separate memory spaces; threads share memory within a process
- What is the principle of least privilege? Grant only the minimum permissions needed
- Explain mTLS. Both client and server present certificates for mutual authentication
- What is a zero-trust network? Trust nothing, verify everything - authentication and authorization at every boundary
- The Phoenix Project by Gene Kim (DevOps culture)
- Site Reliability Engineering by Google: https://sre.google/sre-book/table-of-contents/
- Kubernetes Documentation: https://kubernetes.io/docs/home/
- Terraform Documentation: https://developer.hashicorp.com/terraform/docs
- The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, John Willis