AutoSRE

Self-healing cloud operations agent — detect, diagnose, remediate, and report incidents using AI reasoning and automation.

Overview

AutoSRE is an agentic AI system that closes the loop from alert to recovery: it consumes incidents (simulated or from CloudWatch), reasons over logs and deployment history with Amazon Nova, plans corrective actions, executes them via UI automation (Nova Act) or direct AWS APIs, verifies recovery, and publishes a post-mortem to Slack.

Goal: Reduce mean time to recovery (MTTR) from tens of minutes to under two minutes for common failure modes.

Step	What AutoSRE does
Detect	Ingest one incident (simulated or CloudWatch alarm).
Analyze	Root-cause analysis with Nova over logs and deployment history → diagnosis and recommended action.
Plan	Map diagnosis to concrete actions (e.g. rollback, restart, scale up).
Execute	Run actions on an operations dashboard (Nova Act) or via AWS (e.g. Lambda alias rollback).
Verify	Poll health endpoint or CloudWatch until healthy or timeout.
Report	Publish a post-mortem to Slack (when configured).

Features

Dual run modes — Simulated incidents + demo dashboard, or real AWS (CloudWatch alarms + Lambda rollback).
Pluggable reasoning — Amazon Nova (Bedrock) for root-cause analysis, or stub for CI/local runs without AWS.
UI and API remediation — Nova Act for browser-based dashboards, or direct boto3 for Lambda alias rollback.
Recovery verification — Configurable health polling with timeout and recovery-time tracking.
Slack post-mortems — Block Kit reports with timeline, root cause, and action taken.
Optional persistence — File-backed incident/log storage for RCA context.

Requirements

Python 3.11+ (3.12 recommended)
Optional: AWS account and Bedrock access for Nova reasoning
Optional: Slack app (Bot Token + channel) for post-mortems
Optional: Nova Act API key for real UI automation (otherwise stub)

Quick Start

# Clone and enter project
git clone <repository-url>
cd AutoSRE

# Virtual environment
python -m venv .venv
.venv\Scripts\activate   # Windows
# source .venv/bin/activate   # macOS / Linux

# Install (editable + dev tools)
pip install -e ".[dev]"

# Configure (copy and edit)
cp .env.example .env
# Set OPERATIONS_DASHBOARD_URL and optionally AWS/Slack

# Start demo dashboard (separate terminal)
cd dashboard && python -m uvicorn app:app --host 0.0.0.0 --port 3000

# Run one cycle (simulated incident → stub reasoning → stub UI → verify → report)
autosre

Deterministic demo (fixed incident ID, optional narrative):

autosre --demo

Exit code: 0 on successful recovery, 1 on failure or escalation.

Installation

Method	Command
Editable (recommended)	`pip install -e .`
With dev dependencies	`pip install -e ".[dev]"`
From repo root	Ensures `autosre` CLI and `python -m autosre` work

The CLI is registered as autosre; use autosre --version to confirm.

Configuration

All configuration is via environment variables (and optionally a .env file in the project root). Copy .env.example to .env and adjust.

Essential

Variable	Description	Default
`OPERATIONS_DASHBOARD_URL`	Base URL of the operations dashboard	`http://localhost:3000`
`REASONING_USE_BEDROCK`	Use Amazon Nova for RCA (`true`/`false`)	`false`
`UI_STUB`	Stub UI automation only (`true`/`false`)	`true`

AWS (Bedrock / real integration)

Variable	Description	Default
`AWS_REGION`	AWS region	`us-east-1`
`NOVA_MODEL_ID`	Bedrock model for reasoning	`us.amazon.nova-2-lite-v1:0`
`USE_AWS_INTEGRATION`	Use CloudWatch + Lambda instead of dashboard	`false`
`CLOUDWATCH_ALARM_NAMES`	Comma-separated alarm names	—
`LAMBDA_FUNCTION_NAME`	Lambda name for rollback	—
`LAMBDA_ALIAS_NAME`	Alias to roll back (e.g. `live`)	`live`
`LAMBDA_LOG_GROUP_NAME`	Log group for RCA (optional)	—

Slack

Variable	Description
`SLACK_BOT_TOKEN`	Bot token (e.g. `xoxb-...`)
`SLACK_CHANNEL_ID`	Channel ID (e.g. `C...`)

Other

Variable	Description	Default
`METRICS_URL`	Health URL for verification (empty = dashboard + `/api/health`)	—
`LOG_STORAGE_DATA_DIR`	Directory for incident/log persistence	—
`REASONING_MAX_RETRIES`	Retries for reasoning agent	`2`
`RECOVERY_VERIFY_TIMEOUT_SECONDS`	Max wait for healthy	`120.0`
`NOVA_ACT_API_KEY`	API key for Nova Act (when not using stub)	—

See .env.example for the full list and comments.

Usage

CLI

Command	Description
`autosre`	One cycle: default incident type `latency_spike`, generated ID.
`autosre --demo`	Deterministic run with incident `inc-demo0001`; optional `demo_narrative.txt` in cwd.
`autosre --incident-type <type>`	One cycle with given type: `latency_spike`, `crash_loop`, `memory_leak`, `deployment_failure`.
`autosre --version`	Print version.

Demo narrative (optional)

For autosre --demo, create demo_narrative.txt in the current directory (gitignored) with [section] blocks:

[intro] — Title or intro line
[scenario] — Scenario description
[dashboard_note] — Shown if dashboard health check fails
[running] — Text while workflow runs
[success] / [failure] — Result messages

If the file is missing, minimal default text is used.

Operations dashboard (demo target)

The included dashboard is a demo UI for the agent (or manual testing): login, list services, open a service, and trigger a rollback. It is not suitable for production.

Security: The dashboard has no real authentication (any credentials accepted) and no authorization on API routes. Use only locally; do not expose to the internet.

Start:

cd dashboard
python -m uvicorn app:app --host 0.0.0.0 --port 3000

Then open http://localhost:3000. Flow: Login → Services → Service detail → Deployments tab → Rollback.
GET /api/health returns {"status": "degraded"} until a rollback is executed, then {"status": "healthy"}.

Method	Path	Description
GET	`/`, `/services`, `/services/{id}`	Static pages (login, services, service detail).
POST	`/api/login`	Demo login (redirects to `/services`).
GET	`/api/services`, `/api/services/{id}`, `/api/services/{id}/deployments`	JSON data.
POST	`/api/services/{id}/rollback`	Body: `{"to_version": "v1.4.1"}`.
GET	`/api/health`	`{"status": "degraded" \| "healthy"}`.

Real AWS integration

Set USE_AWS_INTEGRATION=true and configure:

Incidents — CloudWatch alarms (optionally filtered by CLOUDWATCH_ALARM_NAMES) in ALARM state.
Logs — CloudWatch Logs from LAMBDA_LOG_GROUP_NAME (or default /aws/lambda/<LAMBDA_FUNCTION_NAME>) for the incident window.
Remediation — Lambda alias rollback: alias (e.g. live) is pointed to the previous published version.
Verification — Poll CloudWatch until alarm(s) return to OK or timeout.

Minimal IAM: cloudwatch:DescribeAlarms, logs:FilterLogEvents, lambda:GetAlias, lambda:ListVersionsByFunction, lambda:UpdateAlias.

Setup: Deploy a Lambda with at least two versions and an alias; create a CloudWatch alarm (e.g. on Errors); set the env vars above; run autosre.

Workflow (single run)

Detect — One incident from stream (simulated or CloudWatch).
Store — Record incident in LogStore; load logs and deployment history for the service.
Analyze — Reasoning agent (Nova or stub) → Diagnosis (summary, confidence, recommended action). Retries on failure; fallback to escalate.
Plan — Planner → list of PlannedAction (e.g. navigate, click_rollback). Escalate → empty list.
Execute — UI agent (stub or Nova Act) or AWS executor runs actions.
Verify — Recovery monitor polls health until healthy or RECOVERY_VERIFY_TIMEOUT_SECONDS.
Report — Post-mortem to Slack (if token and channel set). Success only when status is recovered.

Project structure

AutoSRE/
├── dashboard/                 # Demo operations UI (FastAPI + static HTML)
│   ├── app.py
│   └── static/                # login, services, service detail
├── src/autosre/
│   ├── cli.py                 # Entry point: autosre, --demo, --incident-type
│   ├── config.py              # Pydantic settings from env
│   ├── models.py              # IncidentEvent, Diagnosis, PlannedAction, PostMortemReport, etc.
│   ├── workflow.py            # Closed loop: detect → analyze → plan → act → verify → report
│   ├── incident_detection/    # Simulated or CloudWatch incident stream
│   ├── log_storage/           # Incidents, logs, deployment history (optional file persistence)
│   ├── reasoning_agent/       # Root-cause analysis (Nova or stub)
│   ├── planner/               # Diagnosis → PlannedAction list
│   ├── ui_automation/         # Nova Act or stub
│   ├── remediation/           # AWS executor (Lambda rollback)
│   ├── recovery_verification/  # Health polling / CloudWatch
│   └── slack_reporter/        # Post-mortem to Slack
├── tests/
├── scripts/                   # e.g. CloudFormation demo
├── pyproject.toml
├── requirements.txt
└── .env.example

Development

Tests:

pytest tests/ -v

Lint / format:

ruff check src dashboard tests
ruff format --check src dashboard tests

CI (GitHub Actions): On push/PR to main (and push to phase-2-operations-dashboard): matrix Python 3.11 and 3.12, pip install -e ".[dev]", pytest, then ruff check and ruff format --check.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoSRE

Overview

Features

Requirements

Quick Start

Installation

Configuration

Essential

AWS (Bedrock / real integration)

Slack

Other

Usage

CLI

Demo narrative (optional)

Operations dashboard (demo target)

Real AWS integration

Workflow (single run)

Project structure

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
dashboard		dashboard
scripts		scripts
src/autosre		src/autosre
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AutoSRE

Overview

Features

Requirements

Quick Start

Installation

Configuration

Essential

AWS (Bedrock / real integration)

Slack

Other

Usage

CLI

Demo narrative (optional)

Operations dashboard (demo target)

Real AWS integration

Workflow (single run)

Project structure

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages