| last-updated | 2026-05-03 |
|---|---|
| difficulty | intermediate |
This comprehensive roadmap guides DevOps Engineers and Site Reliability Engineers through essential skills, tools, and certifications needed to excel in modern infrastructure, automation, and reliability engineering roles.
- Core Competencies
- Foundational Knowledge
- Essential Tools & Technologies
- Certification Paths
- Learning Path Recommendations
- Career Progression
Essential Skills:
- Command-line proficiency (bash, zsh)
- File system management and permissions
- Process management and monitoring
- Package management (apt, yum, dnf)
- User and group management
- System performance tuning
- Log management and analysis
- Kernel basics and system calls
Key Areas:
- Systemd and service management
- Cron jobs and scheduling
- Shell scripting and automation
- SSH and secure remote access
- Network configuration and troubleshooting
- Storage management (LVM, RAID)
Core Knowledge:
- OSI and TCP/IP models
- IP addressing and subnetting
- DNS, DHCP, and routing
- Load balancing concepts
- Firewalls and security groups
- VPN and VPC architecture
- CDN and edge computing
- Network monitoring and debugging
Tools:
- tcpdump, Wireshark
- netstat, ss, ip commands
- dig, nslookup, host
- curl, wget, httpie
- iptables, nftables
Languages (Priority Order):
- Python - Automation, APIs, data processing
- Go - High-performance tools, Kubernetes operators
- Bash - System automation, CI/CD scripts
- JavaScript/TypeScript - Infrastructure as Code, serverless
- Ruby - Configuration management (Chef)
Key Concepts:
- REST API design and consumption
- Error handling and logging
- Testing (unit, integration)
- Version control with Git
- Code review practices
- Documentation standards
Git Mastery:
- Branching strategies (GitFlow, trunk-based)
- Merge vs. rebase workflows
- Git hooks and automation
- Resolving conflicts
- Cherry-picking and history rewriting
- Submodules and subtrees
- Git LFS for large files
Platforms:
- GitHub (Actions, Pages, Packages)
- GitLab (CI/CD, Container Registry)
- Bitbucket (Pipelines)
- Azure Repos
- AWS CodeCommit
Core Concepts:
- HCL syntax and structure
- Providers and resources
- State management (local, remote)
- Modules and reusability
- Workspaces and environments
- Data sources and outputs
- Dynamic blocks and meta-arguments
Advanced Topics:
- State locking and backends
- Import existing resources
- Terraform Cloud/Enterprise
- Policy as Code (Sentinel, OPA)
- Testing with Terratest
- Multi-cloud strategies
Best Practices:
- Directory structure conventions
- Variable management
- Secret handling
- Version pinning
- CI/CD integration
- Drift detection
Key Features:
- Templates (JSON/YAML)
- Stacks and StackSets
- Change sets and drift detection
- Nested stacks
- Custom resources with Lambda
- CloudFormation Registry
ARM Templates:
- Template structure
- Parameters and variables
- Linked and nested templates
- Deployment modes
Bicep:
- Simplified syntax
- Modules and registry
- Converting ARM to Bicep
- VS Code integration
Concepts:
- Configuration files (YAML/Python)
- Templates and schemas
- Composite types
- Deployment lifecycle
Features:
- Multi-language support (Python, TypeScript, Go)
- State management
- Pulumi Cloud
- Policy as Code
Configuration Management:
- Playbooks and roles
- Inventory management
- Variables and facts
- Handlers and notifications
- Ansible Vault for secrets
- Galaxy for community roles
Use Cases:
- Server configuration
- Application deployment
- Orchestration tasks
- Cloud provisioning
Fundamentals:
- Container vs. VM architecture
- Docker engine and daemon
- Dockerfile best practices
- Multi-stage builds
- Image layers and caching
- Docker Compose for multi-container apps
- Volume and network management
- Docker registry operations
Security:
- Image scanning (Trivy, Clair)
- Rootless containers
- Security profiles (AppArmor, SELinux)
- Content trust and signing
- Minimal base images (Alpine, distroless)
Core Concepts:
- Architecture (control plane, nodes)
- Pods, ReplicaSets, Deployments
- Services and Ingress
- ConfigMaps and Secrets
- Persistent Volumes (PV/PVC)
- Namespaces and RBAC
- StatefulSets and DaemonSets
- Jobs and CronJobs
Advanced Topics:
- Helm charts and releases
- Operators and CRDs
- Network policies
- Resource quotas and limits
- HPA and VPA autoscaling
- Service mesh (Istio, Linkerd)
- GitOps with ArgoCD/Flux
- Kubernetes Federation
Tools:
- kubectl and kustomize
- Helm package manager
- K9s terminal UI
- Lens IDE
- kubeadm for cluster setup
- Rancher for management
Distributions:
- Amazon EKS
- Azure AKS
- Google GKE
- Red Hat OpenShift
- Rancher Kubernetes Engine (RKE)
- K3s for edge/IoT
Core Features:
- Declarative vs. Scripted pipelines
- Jenkinsfile syntax
- Agent configuration
- Shared libraries
- Credentials management
- Plugin ecosystem
- Distributed builds
- Blue Ocean UI
Best Practices:
- Pipeline as Code
- Parameterized builds
- Artifacts management
- Build triggers
- Post-build actions
- Security hardening
Features:
- .gitlab-ci.yml configuration
- Runners (shared, group, project)
- Pipeline stages and jobs
- Variables and environments
- Container registry integration
- Auto DevOps
- Review apps
- Deployment strategies
Concepts:
- Workflows and events
- Jobs and steps
- Actions marketplace
- Self-hosted runners
- Matrix builds
- Caching strategies
- Secrets management
- Reusable workflows
Components:
- Azure Pipelines (YAML/Classic)
- Azure Repos (Git)
- Azure Artifacts
- Azure Test Plans
- Azure Boards (Agile)
Pipeline Features:
- Multi-stage pipelines
- Deployment groups
- Service connections
- Variable groups
- Release gates
Integration:
- CodeCommit for source
- CodeBuild for CI
- CodeDeploy for deployment
- Integration with third-party tools
- Cross-region deployments
Common Features:
- YAML-based configuration
- Docker layer caching
- Parallel execution
- Workflow orchestration
- Cloud and self-hosted options
Core Concepts:
- Time-series data model
- PromQL query language
- Exporters and instrumentation
- Service discovery
- Alerting rules
- Recording rules
- Federation
- Long-term storage solutions
Exporters:
- Node exporter (system metrics)
- Blackbox exporter (probing)
- Custom application exporters
- Third-party exporters
Features:
- Dashboard creation and management
- Data source integration
- Alerting and notifications
- Variables and templating
- Plugins ecosystem
- Grafana Loki for logs
- Grafana Tempo for traces
- Organizations and teams
Elasticsearch:
- Index management
- Search and aggregations
- Cluster configuration
- Shard management
Logstash:
- Input, filter, output plugins
- Grok patterns
- Data transformation
- Pipeline management
Kibana:
- Visualizations and dashboards
- Discover and search
- Canvas and lens
- Alerting and machine learning
Fluentd:
- Log collection and forwarding
- Plugin architecture
- Buffer and retry mechanisms
- Routing and filtering
Features:
- Agent deployment
- APM and distributed tracing
- Log management
- Infrastructure monitoring
- Synthetic monitoring
- Dashboards and alerts
- Integration catalog
Capabilities:
- Application performance monitoring
- Infrastructure monitoring
- Browser and mobile monitoring
- Synthetic monitoring
- Distributed tracing
- NRQL query language
Features:
- Log aggregation and analysis
- SPL (Search Processing Language)
- Dashboards and reports
- Alerts and automation
- Machine learning toolkit
Services:
- Metrics and alarms
- Logs and Insights
- Events and EventBridge
- Dashboards
- Synthetics
- ServiceLens for X-Ray integration
Components:
- Application Insights
- Log Analytics
- Metrics and alerts
- Workbooks
- Azure Monitor for containers
Tools:
- Cloud Monitoring
- Cloud Logging
- Cloud Trace
- Cloud Profiler
- Error Reporting
Use Cases:
- Server provisioning
- Application deployment
- Configuration drift management
- Ad-hoc task execution
Components:
- Chef Server
- Chef Workstation
- Chef Client
- Cookbooks and recipes
- InSpec for compliance
Features:
- Puppet DSL
- Manifest files
- Modules and classes
- Hiera for data separation
- PuppetDB
Architecture:
- Master-minion model
- Salt states
- Execution modules
- Event-driven automation
- Remote execution
Core Services:
- EC2, ECS, EKS (Compute)
- S3, EBS, EFS (Storage)
- VPC, Route 53, CloudFront (Networking)
- RDS, DynamoDB, ElastiCache (Databases)
- Lambda, API Gateway (Serverless)
- CloudFormation, CDK (IaC)
- IAM (Identity and Access)
- CloudWatch, X-Ray (Monitoring)
- Systems Manager, OpsWorks (Operations)
DevOps Tools:
- CodeCommit, CodeBuild, CodeDeploy, CodePipeline
- Elastic Beanstalk
- AWS Batch
- Step Functions
Core Services:
- Virtual Machines, App Service, AKS (Compute)
- Blob Storage, Managed Disks (Storage)
- Virtual Network, Load Balancer, CDN (Networking)
- SQL Database, Cosmos DB (Databases)
- Functions, Logic Apps (Serverless)
- ARM Templates, Bicep (IaC)
- Azure AD (Identity)
- Monitor, Application Insights (Monitoring)
DevOps Tools:
- Azure DevOps (Pipelines, Repos, Boards)
- Azure Resource Manager
- Azure Automation
Core Services:
- Compute Engine, GKE, Cloud Run (Compute)
- Cloud Storage, Persistent Disk (Storage)
- VPC, Cloud Load Balancing, Cloud CDN (Networking)
- Cloud SQL, Firestore, Bigtable (Databases)
- Cloud Functions, Cloud Run (Serverless)
- Deployment Manager, Terraform (IaC)
- Cloud IAM (Identity)
- Cloud Monitoring, Cloud Logging (Operations)
DevOps Tools:
- Cloud Build
- Container Registry
- Artifact Registry
Prerequisites:
- Recommended: AWS Solutions Architect Associate or Developer Associate
- 2+ years of hands-on AWS experience
Exam Details:
- Duration: 180 minutes
- Format: 75 multiple choice/multiple answer
- Cost: $300 USD
- Passing Score: 750/1000
- Validity: 3 years
Domain Breakdown:
-
SDLC Automation (22%)
- CI/CD pipelines with CodePipeline
- CodeBuild for compilation and testing
- CodeDeploy deployment strategies
- Blue/green and canary deployments
- Artifact management with S3 and CodeArtifact
-
Configuration Management and IaC (19%)
- CloudFormation templates and stacks
- AWS CDK (Cloud Development Kit)
- Systems Manager Parameter Store
- OpsWorks for configuration management
- Elastic Beanstalk for application deployment
-
Monitoring and Logging (15%)
- CloudWatch metrics, alarms, and dashboards
- CloudWatch Logs Insights
- X-Ray for distributed tracing
- CloudTrail for API auditing
- EventBridge for event-driven architecture
-
Policies and Standards Automation (10%)
- IAM policies and roles
- Service Control Policies (SCPs)
- AWS Config rules
- AWS Organizations
- Tagging strategies
-
Incident and Event Response (18%)
- Automated remediation with Lambda
- Systems Manager Automation
- AWS Backup strategies
- Disaster recovery planning
- Auto Scaling and health checks
-
Security and Compliance (16%)
- Secrets Manager and Parameter Store
- KMS encryption
- Security Hub and GuardDuty
- Inspector for vulnerability scanning
- WAF and Shield for protection
Study Resources:
- A Cloud Guru AWS DevOps Pro course
- Linux Academy hands-on labs
- AWS Documentation and whitepapers
- AWS Well-Architected Framework
- Practice exams on Tutorials Dojo
Key Services to Master:
- CloudFormation/CDK
- CodePipeline, CodeBuild, CodeDeploy
- ECS/EKS for containers
- Lambda for serverless
- CloudWatch and X-Ray
- Systems Manager
- IAM and security services
Prerequisites:
- Required: Azure Administrator Associate OR Azure Developer Associate
- Experience with Azure administration and development
Exam Details:
- Exam Code: AZ-400
- Duration: 180 minutes
- Format: 40-60 questions
- Cost: $165 USD
- Passing Score: 700/1000
- Validity: 1 year (annual renewal)
Domain Breakdown:
-
Configure processes and communications (10-15%)
- Agile and DevOps practices
- Azure Boards for work tracking
- Communication and collaboration tools
- Sprint planning and retrospectives
-
Design and implement source control (15-20%)
- Git workflow strategies
- Azure Repos
- Branch policies and pull requests
- Code review processes
- Git hooks and automation
-
Define and implement continuous integration (20-25%)
- Azure Pipelines YAML
- Build agents (Microsoft-hosted vs. self-hosted)
- Build triggers and schedules
- Artifact management
- Package management with Azure Artifacts
- Container registry integration
-
Define and implement continuous delivery (10-15%)
- Release pipelines
- Deployment patterns (blue/green, canary)
- Approval gates
- Deployment slots for App Service
- AKS deployment strategies
- Rollback procedures
-
Develop a security and compliance plan (10-15%)
- Azure Key Vault integration
- Service principals and managed identities
- Azure Policy for governance
- Secure DevOps practices
- Dependency scanning
- OWASP top 10
-
Implement an instrumentation strategy (10-15%)
- Application Insights
- Log Analytics workspaces
- KQL (Kusto Query Language)
- Alerts and action groups
- Availability tests
- Distributed tracing
Study Resources:
- Microsoft Learn learning paths
- Pluralsight Azure DevOps courses
- Azure DevOps Labs
- Microsoft Documentation
- Practice assessments on Microsoft Learn
Key Technologies:
- Azure DevOps (Pipelines, Repos, Boards, Artifacts)
- Azure Resource Manager and Bicep
- Azure Kubernetes Service
- Application Insights and Log Analytics
- Azure Key Vault
- Azure Policy and Blueprints
Recommended as prerequisite
Focus Areas:
- Azure identity and governance
- Storage management
- Virtual networking
- Monitoring and backup
- Compute resources
Prerequisites:
- Recommended: Associate Cloud Engineer certification
- 3+ years of industry experience with 1+ year on GCP
Exam Details:
- Duration: 120 minutes
- Format: 50-60 multiple choice
- Cost: $200 USD
- Validity: 2 years
Domain Breakdown:
-
Bootstrapping a Google Cloud organization (10%)
- Resource hierarchy (org, folders, projects)
- IAM policies and roles
- Billing accounts and budgets
- Organization policies
-
Building and implementing CI/CD pipelines (20%)
- Cloud Build pipelines
- Container Registry and Artifact Registry
- Binary Authorization
- Deployment strategies
- Integration with GitHub/GitLab/Bitbucket
-
Applying site reliability engineering practices (20%)
- SLI, SLO, SLA definitions
- Error budgets
- Toil reduction
- Blameless postmortems
- Capacity planning
-
Optimizing service performance (20%)
- GKE autoscaling (HPA, VPA, cluster autoscaler)
- Cloud CDN and load balancing
- Performance profiling with Cloud Profiler
- Caching strategies
- Database optimization
-
Managing service incidents (15%)
- Incident response procedures
- On-call rotations
- Escalation paths
- Alert configuration
- Debugging in production
-
Monitoring and logging (15%)
- Cloud Monitoring (formerly Stackdriver)
- Cloud Logging
- Log-based metrics
- Uptime checks
- Dashboards and SLO monitoring
Study Resources:
- Google Cloud Skills Boost (Qwiklabs)
- Coursera GCP specializations
- Official GCP documentation
- Practice exams
- SRE books (Google)
Key Services:
- Cloud Build
- GKE (Google Kubernetes Engine)
- Cloud Run
- Cloud Functions
- Cloud Monitoring and Logging
- Cloud Trace and Profiler
- Deployment Manager
- Binary Authorization
Exam Details:
- Duration: 2 hours
- Format: Performance-based (hands-on)
- Cost: $395 USD (includes one free retake)
- Validity: 3 years
Domains:
- Cluster Architecture, Installation & Configuration (25%)
- Workloads & Scheduling (15%)
- Services & Networking (20%)
- Storage (10%)
- Troubleshooting (30%)
Preparation:
- Killer.sh practice exams
- Kubernetes documentation practice
- KodeKloud CKA course
- Hands-on lab environment
Focus:
- Application deployment
- Container design patterns
- Multi-container pods
- ConfigMaps and Secrets
- Observability
- Services and networking
Prerequisites:
- Valid CKA certification required
Focus:
- Cluster setup and hardening
- System hardening
- Minimize microservice vulnerabilities
- Supply chain security
- Monitoring, logging, and runtime security
Exam Details:
- Duration: 60 minutes
- Format: 57 multiple choice
- Cost: $70.50 USD
- Validity: 2 years
Topics:
- IaC concepts
- Terraform purpose and workflow
- Terraform basics (init, plan, apply, destroy)
- Terraform modules
- State management
- Terraform Cloud and Enterprise
Study Resources:
- HashiCorp Learn platform
- Terraform documentation
- Practice labs
- Sample questions
Focus:
- System administration
- Ansible automation
- Bash scripting
- Performance tuning
Performance-based exam covering:
- Essential commands
- File system management
- Networking
- User management
- Service configuration
Topics:
- Observability concepts
- Prometheus fundamentals
- PromQL
- Instrumentation
- Alerting
Coverage:
- Dashboard creation
- Data sources
- Alerting
- User management
Month 1-2: Foundations
- Linux fundamentals
- Networking basics
- Git and version control
- Python/Bash scripting basics
Month 3-4: Core DevOps Tools
- Docker containerization
- CI/CD concepts with Jenkins or GitLab CI
- Basic AWS/Azure/GCP services
- Terraform basics
Month 5-6: First Certification
- Choose: AWS Solutions Architect Associate, Azure Administrator, or GCP Associate Cloud Engineer
- Build small projects
- Practice hands-on labs
Month 6-9:
- Kubernetes fundamentals
- Advanced CI/CD patterns
- Monitoring with Prometheus/Grafana
- Configuration management (Ansible)
Month 10-12:
- Kubernetes deep dive
- Infrastructure as Code mastery
- Cloud-native architecture
- CKA or CKAD certification
Month 13-18:
- SRE principles
- Advanced monitoring and observability
- Security best practices
- Cloud DevOps Professional certification
Focus Areas:
- Multi-cloud strategies
- Service mesh implementation
- GitOps workflows
- Platform engineering
- Chaos engineering
- FinOps practices
Certifications:
- CKS (Kubernetes Security)
- Multiple cloud DevOps Professional certs
- Specialized certifications (Prometheus, Istio, etc.)
Junior DevOps Engineer
- Required: Linux basics, scripting, version control
- Salary Range: $60,000 - $85,000
- Certifications: Cloud Associate level
Build/Release Engineer
- Required: CI/CD tools, build systems, scripting
- Salary Range: $65,000 - $90,000
DevOps Engineer
- Required: IaC, containers, CI/CD, cloud platforms
- Salary Range: $90,000 - $140,000
- Certifications: Professional level cloud certs, CKA
Site Reliability Engineer
- Required: Programming, system design, monitoring
- Salary Range: $100,000 - $150,000
- Certifications: Cloud Professional, CKA
Cloud Engineer
- Required: Multi-cloud expertise, architecture
- Salary Range: $95,000 - $145,000
Senior DevOps/SRE Engineer
- Required: Architecture, mentoring, automation at scale
- Salary Range: $130,000 - $180,000
- Certifications: Multiple Professional certs, CKS
DevOps Architect
- Required: Enterprise architecture, strategy, leadership
- Salary Range: $140,000 - $200,000
Platform Engineer
- Required: Internal tooling, developer experience
- Salary Range: $135,000 - $190,000
Staff/Principal SRE
- Required: Technical leadership, system design at scale
- Salary Range: $160,000 - $250,000+
Engineering Manager - DevOps/SRE
- Required: People management, strategy, technical depth
- Salary Range: $150,000 - $220,000
VP of Engineering/Infrastructure
- Required: Executive leadership, business acumen
- Salary Range: $200,000 - $400,000+
-
Version Control Everything
- All infrastructure code in Git
- Meaningful commit messages
- Code review process
-
Modular and Reusable
- Create modules for common patterns
- DRY principle
- Parameterize configurations
-
Testing
- Unit tests for modules
- Integration tests for deployments
- Policy validation
-
Documentation
- README for each module
- Architecture diagrams
- Runbooks for operations
-
Pipeline as Code
- Version controlled pipelines
- Declarative configuration
- Reusable pipeline templates
-
Fast Feedback
- Parallel execution where possible
- Fail fast on errors
- Clear error messages
-
Security Integration
- SAST/DAST scanning
- Dependency vulnerability checks
- Secret scanning
- Container image scanning
-
Artifact Management
- Immutable artifacts
- Versioning strategy
- Retention policies
-
The Four Golden Signals
- Latency
- Traffic
- Errors
- Saturation
-
Alerting Philosophy
- Alert on symptoms, not causes
- Actionable alerts only
- Avoid alert fatigue
- Clear escalation paths
-
Observability
- Metrics, logs, and traces
- Distributed tracing
- Structured logging
- Correlation IDs
-
SLOs and Error Budgets
- Define SLIs based on user experience
- Set realistic SLOs
- Use error budgets for decision making
-
Incident Management
- Clear incident response procedures
- Blameless postmortems
- Action items tracked and completed
-
Capacity Planning
- Monitor resource utilization
- Plan for growth
- Load testing
-
Toil Reduction
- Automate repetitive tasks
- Measure and track toil
- Prioritize automation work
-
Shift Left Security
- Security in CI/CD pipeline
- Early vulnerability detection
- Developer security training
-
Secrets Management
- Never commit secrets to Git
- Use secret management tools
- Rotate secrets regularly
- Least privilege access
-
Compliance
- Audit logging
- Policy as Code
- Regular security assessments
-
The Phoenix Project - Gene Kim
- DevOps principles through narrative
- Three ways of DevOps
-
The DevOps Handbook - Gene Kim, Jez Humble
- Practical DevOps implementation
- Case studies and patterns
-
Site Reliability Engineering - Google
- SRE principles from Google
- Error budgets and monitoring
-
The Site Reliability Workbook - Google
- Practical SRE implementation
- Real-world examples
-
Accelerate - Nicole Forsgren, Jez Humble
- Research-backed DevOps metrics
- High-performing teams
-
Continuous Delivery - Jez Humble, David Farley
- Deployment pipeline patterns
- Release strategies
-
Infrastructure as Code - Kief Morris
- IaC patterns and practices
- Tool-agnostic principles
-
Kubernetes Patterns - Bilgin Ibryam, Roland Huß
- Container design patterns
- Best practices for K8s
Blogs and Websites:
- AWS Blog (DevOps section)
- Azure DevOps Blog
- Google Cloud Blog
- HashiCorp Blog
- CNCF Blog
- DevOps.com
- SRE Weekly newsletter
Practice Platforms:
- Katacoda (interactive scenarios)
- Kubernetes.io tutorials
- AWS Workshops
- Azure Citadel
- Google Cloud Skills Boost
- TerraformTutorials.com
Communities:
- DevOps subreddit
- SRE subreddit
- CNCF Slack
- Kubernetes Slack
- HashiCorp Community Forum
- Stack Overflow
This roadmap provides a comprehensive guide for DevOps Engineers and Site Reliability Engineers to build expertise in automation, CI/CD, monitoring, and Infrastructure as Code. The certification paths for AWS, Azure, and GCP offer structured learning objectives and industry recognition.
Key Takeaways:
- Start with fundamentals: Linux, networking, and programming form the foundation
- Master core tools: Docker, Kubernetes, Terraform, and CI/CD platforms
- Choose a cloud platform: Deep dive into AWS, Azure, or GCP
- Implement observability: Monitoring, logging, and tracing are critical
- Embrace automation: Infrastructure as Code and CI/CD pipelines
- Practice SRE principles: SLOs, error budgets, and incident response
- Continuous learning: Technology evolves rapidly; stay current
Success Metrics:
- Reduced deployment time
- Increased deployment frequency
- Lower change failure rate
- Faster mean time to recovery (MTTR)
- Improved system reliability
- Enhanced security posture
The journey from junior engineer to senior SRE/DevOps architect takes dedication, hands-on practice, and continuous learning. Use this roadmap as a guide, but adapt it to your specific interests and career goals. Focus on practical experience alongside certifications, and contribute to open-source projects to build your portfolio.
Remember: DevOps and SRE are not just about tools and technologies - they're about culture, collaboration, and continuous improvement. Technical skills will get you started, but soft skills like communication, problem-solving, and teamwork will accelerate your career growth.