This guide covers comprehensive DevOps interview questions from fundamentals to advanced topics, including scenario-based questions commonly asked at FAANG/MAANG companies.
Q: Define DevOps and its core principles.
A: DevOps is a cultural and technical movement that combines development (Dev) and operations (Ops) to enable faster, more reliable software delivery. Core principles:
- Culture: Break down silos, shared responsibility
- Automation: CI/CD, IaC, configuration management
- Lean: Eliminate waste, continuous improvement
- Measurement: Data-driven decisions, monitoring
- Sharing: Knowledge sharing, blameless postmortems
Q: Explain the difference between DevOps, SRE, and Platform Engineering.
A:
- DevOps: Cultural movement focused on collaboration and automation
- SRE: Google's implementation of DevOps with focus on reliability (SLOs, error budgets)
- Platform Engineering: Building internal platforms to abstract infrastructure complexity
Q: Explain CI/CD and its benefits.
A:
- CI (Continuous Integration): Automatically build, test, and validate code on every commit
- CD (Continuous Delivery): Automatically prepare releases for deployment
- CD (Continuous Deployment): Automatically deploy to production
Benefits: Faster feedback, reduced risk, consistent releases, improved quality
Q: What's the difference between Continuous Delivery and Continuous Deployment?
A:
- Continuous Delivery: Code is always deployable, but requires manual approval
- Continuous Deployment: Every passing change automatically deploys to production
Q: Explain the difference between Docker image and container.
A:
- Image: Read-only template containing application and dependencies
- Container: Running instance of an image with its own filesystem, network, process space
Q: How do Docker layers work?
A: Each Dockerfile instruction creates a layer. Layers are:
- Cached and reusable
- Stacked on top of each other
- Read-only (except the top writable layer)
- Shared between images using same base
Q: What is a multi-stage build and why use it?
A: Multi-stage builds use multiple FROM statements to:
- Separate build and runtime environments
- Reduce final image size (no build tools)
- Improve security (fewer attack vectors)
# Build stage
FROM node:18 AS builder
WORKDIR /app
COPY . .
RUN npm ci && npm run build
# Runtime stage
FROM nginx:alpine
COPY --from=builder /app/dist /usr/share/nginx/htmlQ: How would you troubleshoot a container that keeps crashing?
A:
- Check logs:
docker logs <container> - Check exit code:
docker inspect --format='{{.State.ExitCode}}' - Run interactively:
docker run -it <image> sh - Check resource limits: Memory/CPU constraints
- Verify health checks and startup probes
- Check for missing environment variables or configs
Q: Explain Kubernetes architecture.
A:
- Control Plane: API Server, etcd, Scheduler, Controller Manager
- Worker Nodes: kubelet, kube-proxy, Container Runtime
- API Server: Front-end, validates requests, updates etcd
- etcd: Distributed key-value store for cluster state
- Scheduler: Assigns pods to nodes
- kubelet: Node agent, manages pod lifecycle
Q: What's the difference between Deployment and StatefulSet?
A:
- Deployment: Stateless apps, pods are interchangeable, random names
- StatefulSet: Stateful apps, stable pod identity, ordered deployment, persistent storage per pod
Q: Explain Kubernetes networking model.
A:
- Every pod gets unique IP
- Pods can communicate without NAT
- Services abstract pod IPs (ClusterIP, NodePort, LoadBalancer)
- Ingress for external HTTP routing
Q: How would you debug a pod stuck in CrashLoopBackOff?
A:
# Check pod events
kubectl describe pod <pod-name>
# Check logs
kubectl logs <pod-name> --previous
# Check resources
kubectl top pod <pod-name>
# Exec into pod (if possible)
kubectl exec -it <pod-name> -- sh
# Check YAML configuration
kubectl get pod <pod-name> -o yamlQ: Explain Terraform state and why it's important.
A: State tracks:
- Resource IDs mapped to configuration
- Metadata and dependencies
- Enables plan/apply operations
Best practices:
- Remote state (S3, GCS)
- State locking (DynamoDB)
- Never edit state manually
- Use workspaces for environments
Q: How do you handle secrets in Terraform?
A:
- Never commit secrets to version control
- Use environment variables or
-varflags - Integrate with Vault or AWS Secrets Manager
- Mark sensitive outputs:
sensitive = true - Use SOPS for encrypted tfvars
Q: Explain Terraform modules.
A: Modules are reusable Terraform configurations:
- Encapsulate related resources
- Accept input variables
- Expose outputs
- Version controlled
- Enable DRY infrastructure
Q: What are the three pillars of observability?
A:
- Metrics: Numerical data over time (Prometheus)
- Logs: Detailed event records (Loki, ELK)
- Traces: Request flow across services (Jaeger, Zipkin)
Q: Explain the difference between monitoring and observability.
A:
- Monitoring: Collecting known metrics, alerts on thresholds
- Observability: Understanding system state from outputs, debugging unknown issues
Q: How would you design alerting for a microservices architecture?
A:
- Alert on symptoms (user impact), not causes
- Use SLO-based alerting
- Implement alert hierarchy (page/ticket/log)
- Avoid alert fatigue with proper thresholds
- Include runbooks with alerts
- Use multi-window burn rate alerts
Q: How do you handle secrets in Kubernetes?
A:
- Kubernetes Secrets (base64, not encrypted by default)
- Enable encryption at rest
- External Secrets Operator with Vault/AWS Secrets Manager
- Sealed Secrets for GitOps
- SOPS with age/KMS
Q: Explain the principle of least privilege.
A: Grant only minimum permissions required:
- Time-limited access (JIT)
- Role-based access control
- Regular access reviews
- Separate service accounts per component
Q: A production deployment caused errors. How do you handle it?
A:
- Immediate: Roll back deployment
- Communicate: Notify stakeholders
- Investigate: Check logs, metrics, recent changes
- Root cause: Analyze what went wrong
- Fix: Implement proper fix
- Prevent: Add tests, improve CI/CD gates
- Document: Blameless postmortem
Q: Design a zero-downtime deployment strategy.
A:
- Blue-Green: Two identical environments, switch traffic
- Canary: Gradual rollout to percentage of users
- Rolling: Replace pods gradually
Key considerations:
- Backward-compatible database changes
- Health checks before receiving traffic
- Quick rollback capability
- Feature flags for new functionality
Q: How would you reduce deployment time from 30 minutes to 5 minutes?
A:
- Parallelize test execution
- Use faster CI runners (self-hosted, larger instances)
- Implement efficient caching (dependencies, Docker layers)
- Optimize Docker builds (multi-stage, minimal base images)
- Skip unnecessary steps (incremental builds)
- Use test selection (only affected tests)
Q: A critical service is experiencing 50% error rate. Walk through your debugging process.
A:
- Assess scope: Which endpoints? Which users?
- Check recent changes: Deployments, config changes
- Review metrics: CPU, memory, connections, queue depth
- Check dependencies: Database, external APIs, DNS
- Analyze logs: Error patterns, stack traces
- Trace requests: Identify where failures occur
- Mitigate: Scale, rollback, failover
- Document: Timeline, actions, recovery
# Docker
docker build -t app:v1 .
docker run -d -p 8080:80 app:v1
docker exec -it <container> sh
docker logs -f <container>
# Kubernetes
kubectl get pods -A
kubectl describe pod <pod>
kubectl logs -f <pod>
kubectl exec -it <pod> -- sh
kubectl rollout restart deployment/<name>
kubectl rollout undo deployment/<name>
# Terraform
terraform init
terraform plan -out=plan.tfplan
terraform apply plan.tfplan
terraform state list
terraform import <resource> <id>Next: Review Kubernetes Interview questions.