MTTR (Mean Time to Recovery) Calculator

Overview

This tool calculates the Mean Time to Recovery (MTTR) metric, one of the four key DORA metrics. MTTR measures how quickly teams can recover from failures in production, directly impacting customer experience and system reliability.

Features

Multiple incident management platform support:
- PagerDuty
- OpsGenie
- AWS CloudWatch Alarms
Severity-based analysis
Service-level MTTR tracking
Statistical analysis with percentiles
Trend analysis over time
Performance level classification

Prerequisites

Python 3.8+
API access to your incident management platform

Installation

pip install -r requirements.txt

Configuration

Create a config.yaml file:

# Incident source: pagerduty, opsgenie, cloudwatch
incident_source: pagerduty

time_range:
  start_date: "30d"
  end_date: "now"

# Optional: Filter by severity
severity_filter: ["P1", "P2"]

pagerduty:
  api_key: ${PAGERDUTY_TOKEN}

Usage

# Generate report
python mttr_calculator.py

# Export as JSON
python mttr_calculator.py --output json --output-file mttr.json

# With custom config
python mttr_calculator.py --config production_config.yaml

Understanding MTTR

Calculation

MTTR = (Incident Resolved Time - Incident Created Time) / Number of Incidents

Performance Levels

Level	MTTR
Elite	Less than one hour
High	Less than one day
Medium	Less than one week
Low	More than one week

Key Metrics

Mean MTTR: Average recovery time
Median MTTR: Middle value (less affected by outliers)
P90/P95: 90th/95th percentile - worst-case scenarios
By Severity: Breakdown by incident priority
By Service: Identify problematic services

API Configuration

PagerDuty

Generate API key: PagerDuty API Access Keys
Required permissions: Read access to incidents
Set environment variable: export PAGERDUTY_TOKEN=your_token

OpsGenie

Create API key: Settings → API key management
Required permissions: Read access to alerts
Set environment variable: export OPSGENIE_TOKEN=your_token

AWS CloudWatch

Configure AWS CLI: aws configure

Required IAM permissions:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "cloudwatch:DescribeAlarmHistory",
      "cloudwatch:DescribeAlarms"
    ],
    "Resource": "*"
  }]
}

Output Example

MEAN TIME TO RECOVERY (MTTR) REPORT
==================================================

Period: 2024-01-01T00:00:00 to 2024-01-31T23:59:59

Summary:
  Total Incidents: 47
  Performance Level: High

MTTR Statistics:
  Mean: 145.3 minutes (2.4 hours)
  Median: 87.5 minutes (1.5 hours)
  Min: 5.2 minutes
  Max: 1440.7 minutes
  Std Dev: 234.1 minutes

MTTR by Severity:
┌─────────────┬─────────────┬───────┐
│ Severity    │ Avg Minutes │ Count │
├─────────────┼─────────────┼───────┤
│ P1-Critical │ 45.2        │ 8     │
│ P2-High     │ 132.7       │ 15    │
│ P3-Medium   │ 198.4       │ 24    │
└─────────────┴─────────────┴───────┘

Best Practices

Incident Classification: Ensure consistent severity classification
Automation: Integrate with your incident response workflow
Regular Reviews: Analyze MTTR trends in post-mortems
Service Ownership: Track MTTR by service and team
Runbooks: Create runbooks for common incidents to reduce MTTR

Integration Examples

Grafana Dashboard

# Average MTTR by severity
avg by (severity) (incident_recovery_time_minutes)

# MTTR trend
avg_over_time(incident_recovery_time_minutes[7d])

Slack Notification

# Send weekly MTTR report
0 9 * * MON python mttr_calculator.py --output json | \
  curl -X POST -H 'Content-type: application/json' \
  --data @- https://hooks.slack.com/services/YOUR/WEBHOOK/URL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTTR (Mean Time to Recovery) Calculator

Overview

Features

Prerequisites

Installation

Configuration

Usage

Understanding MTTR

Calculation

Performance Levels

Key Metrics

API Configuration

PagerDuty

OpsGenie

AWS CloudWatch

Output Example

Best Practices

Integration Examples

Grafana Dashboard

Slack Notification

Troubleshooting

No incidents found

API rate limits

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MTTR (Mean Time to Recovery) Calculator

Overview

Features

Prerequisites

Installation

Configuration

Usage

Understanding MTTR

Calculation

Performance Levels

Key Metrics

API Configuration

PagerDuty

OpsGenie

AWS CloudWatch

Output Example

Best Practices

Integration Examples

Grafana Dashboard

Slack Notification

Troubleshooting

No incidents found

API rate limits