π€Overview | πQuick Start | π¦Installation | βοΈUsage | πProject Structure | πHow to Cite
AIOpsLab is a holistic framework to enable the design, development, and evaluation of autonomous AIOps agents that, additionally, serve the purpose of building reproducible, standardized, interoperable and scalable benchmarks. AIOpsLab can deploy microservice cloud environments, inject faults, generate workloads, and export telemetry data, while orchestrating these components and providing interfaces for interacting with and evaluating agents.
Moreover, AIOpsLab provides a built-in benchmark suite with a set of problems to evaluate AIOps agents in an interactive environment. This suite can be easily extended to meet user-specific needs. See the problem list here.
- Python >= 3.11
- Additional requirements depend on the deployment option selected, which is explained in the next section
Recommended installation:
sudo apt install python3.11 python3.11-venv python3.11-dev python3-pip # poetry requires python >= 3.11We recommend Poetry for managing dependencies. You can also use a standard pip install -e . to install the dependencies.
git clone --recurse-submodules <CLONE_PATH_TO_THE_REPO>
cd AIOpsLab
poetry env use python3.11
export PATH="$HOME/.local/bin:$PATH" # export poetry to PATH if needed
poetry install # -vvv for verbose output
poetry self add poetry-plugin-shell # installs poetry shell plugin
poetry shellChoose either a) or b) to set up your cluster and then proceed to the next steps.
AIOpsLab can be run on a local simulated cluster using kind on your local machine. Please look at this README for a list of prerequisites.
# For x86 machines
kind create cluster --config kind/kind-config-x86.yaml
# For ARM machines
kind create cluster --config kind/kind-config-arm.yamlIf you're running into issues, consider building a Docker image for your machine by following this README. Please also open an issue.
After finishing cluster creation, proceed to the next "Update config.yml" step.
AIOpsLab supports any remote kubernetes cluster that your kubectl context is set to, whether it's a cluster from a cloud provider or one you build yourself. We have some Ansible playbooks we have to setup clusters on providers like CloudLab and our own machines. Follow this README to set up your own cluster, and then proceed to the next "Update config.yml" step.
cd aiopslab
cp config.yml.example config.ymlUpdate your config.yml so that k8s_host is the host name of the control plane node of your cluster. Update k8s_user to be your username on the control plane node. If you are using a kind cluster, your k8s_host should be kind. If you're running AIOpsLab on cluster, your k8s_host should be localhost.
Human as the agent:
python3 cli.py
(aiopslab) $ start misconfig_app_hotel_res-detection-1 # or choose any problem you want to solve
# ... wait for the setup ...
(aiopslab) $ submit("Yes") # submit solutionRun GPT-4 baseline agent:
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
python3 clients/gpt.py # you can also change the problem to solve in the main() functionYou can check the running status of the cluster using k9s or other cluster monitoring tools conveniently.
This section documents the modifications needed to run AIOpsLab with Azure OpenAI instead of standard OpenAI.
- Azure OpenAI Resource: You need an Azure OpenAI resource with a deployed model (e.g., GPT-4)
- Environment Configuration: Set up proper environment variables for Azure OpenAI authentication
Create a .env file in the project root with your Azure OpenAI credentials:
# Environment variables for AIOpsLab - Azure OpenAI Configuration
OPENAI_API_KEY="your-azure-openai-api-key"
OPENAI_API_BASE="https://your-resource-name.openai.azure.com/"
OPENAI_API_TYPE="azure"
OPENAI_API_VERSION="2025-01-01-preview"
AZURE_OPENAI_ENDPOINT="https://your-resource-name.openai.azure.com/"
AZURE_OPENAI_DEPLOYMENT_NAME="gpt-4"Important: Replace the following with your actual values:
your-azure-openai-api-key: Your Azure OpenAI API keyyour-resource-name: Your Azure OpenAI resource namegpt-4: Your actual deployment name from Azure OpenAI Studio
The following files have been modified to support Azure OpenAI:
- Added Azure OpenAI client support alongside standard OpenAI
- Automatic detection of Azure vs standard OpenAI based on environment variables
- Proper model name mapping for Azure deployments
- Fixed authentication to use
OPENAI_API_KEYinstead ofOPENAI_KEY - Added Azure OpenAI support for qualitative evaluation
- Proper error handling for Azure endpoints
Added missing storage classes required for MongoDB pods:
# Created storage classes: geo-storage, profile-storage, rate-storage,
# recommendation-storage, reservation-storage, user-storage
# All pointing to openebs.io/local provisionerLoad environment variables in PowerShell:
Get-Content .env | ForEach-Object {
if($_ -match '^([^#][^=]+)=(.+)$') {
$name = $matches[1]; $value = $matches[2].Trim('"');
[Environment]::SetEnvironmentVariable($name, $value, 'Process')
}
}- Set up environment variables (load
.envfile as shown above) - Enable qualitative evaluation in
config.yml:qualitative_eval: true
- Run agents normally:
# GPT Agent poetry run python clients/gpt.py # Flash Agent (single scenario) poetry run python test_flash_single.py # Flash Agent (all scenarios) poetry run python clients/flash.py
If pods get stuck in Pending state due to PVC issues:
kubectl apply -f fix-storage-classes.yaml- Verify API key is correct and not expired
- Ensure deployment name matches your Azure OpenAI Studio deployment
- Check that API version is supported
Make sure to reload environment variables in each new PowerShell session:
# Load from .env file
Get-Content .env | ForEach-Object { if($_ -match '^([^#][^=]+)=(.+)$') { [Environment]::SetEnvironmentVariable($matches[1], $matches[2].Trim('"'), 'Process') } }Save agent execution logs for analysis:
# Save output to file
poetry run python clients/gpt.py > gpt_output.log 2>&1
# Save with display (using Tee-Object)
poetry run python test_flash_single.py | Tee-Object -FilePath flash_output.logSecurity Note: The .gitignore file has been updated to exclude all log files and environment files to prevent accidental commit of sensitive information.
This section documents all the changes and fixes needed to run AIOpsLab successfully on Windows with kind cluster and Azure OpenAI.
Before starting, ensure you have:
- β Python 3.11+ installed
- β Poetry installed and configured
- β Docker Desktop running
- β kind installed and available in PATH
- β kubectl installed and available in PATH
- β Helm installed and available in PATH
- β Azure OpenAI resource with deployed model (e.g., GPT-4)
# Clone with submodules
git clone --recurse-submodules <repository-url>
cd AIOpsLab
# Install dependencies
poetry install
poetry shell# Create kind cluster (choose based on your architecture)
kind create cluster --config kind/kind-config-x86.yaml # for x86
# OR
kind create cluster --config kind/kind-config-arm.yaml # for ARM
# Verify cluster is running
kubectl cluster-infogit submodule init
git submodule updateInstall Helm and add repositories:
# Add required Helm repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add openebs https://openebs.github.io/charts
helm repo update
# Install OpenEBS (required for storage)
helm install openebs openebs/openebs --namespace openebs --create-namespaceInstall Prometheus monitoring stack:
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespaceCreate and apply storage class fix:
# Apply the storage class fix
kubectl apply -f fix-storage-classes.yamlThe fix-storage-classes.yaml file creates the following storage classes:
- geo-storage
- profile-storage
- rate-storage
- recommendation-storage
- reservation-storage
- user-storage
All pointing to openebs.io/local provisioner.
# Deploy the hotel reservation microservices
kubectl apply -f aiopslab-applications/hotelReservation/kubernetes/Create .env file with Azure OpenAI credentials:
# .env file content
OPENAI_API_KEY="your-azure-openai-api-key"
OPENAI_API_BASE="https://your-resource-name.openai.azure.com/"
OPENAI_API_TYPE="azure"
OPENAI_API_VERSION="2025-01-01-preview"
AZURE_OPENAI_ENDPOINT="https://your-resource-name.openai.azure.com/"
AZURE_OPENAI_DEPLOYMENT_NAME="gpt-4"Load environment variables in PowerShell:
Get-Content .env | ForEach-Object {
if($_ -match '^([^#][^=]+)=(.+)$') {
$name = $matches[1]; $value = $matches[2].Trim('"');
[Environment]::SetEnvironmentVariable($name, $value, 'Process')
}
}cd aiopslab
cp config.yml.example config.ymlUpdate config.yml:
- Set
k8s_host: kind(for kind cluster) - Set
k8s_user: <your-username> - Set
qualitative_eval: true(to enable Azure OpenAI evaluation)
- Added detection for Azure vs standard OpenAI configuration
- Implemented proper Azure OpenAI client initialization
- Added support for deployment name mapping
- Fixed authentication to use correct environment variable (
OPENAI_API_KEY) - Added Azure OpenAI client support for evaluation system
- Enhanced error handling for Azure endpoints
Created missing storage classes required for MongoDB persistence:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: geo-storage
provisioner: openebs.io/local
# ... (similar for other storage classes)Added patterns to exclude sensitive files:
# Environment files
.env*
!.env.example
# Log files
*.log
*_output.log
# Common secret patterns
*secret*
*key*
*token*# Load environment first
Get-Content .env | ForEach-Object { if($_ -match '^([^#][^=]+)=(.+)$') { [Environment]::SetEnvironmentVariable($matches[1], $matches[2].Trim('"'), 'Process') } }
# Run GPT agent
poetry run python clients/gpt.py# Run single Flash scenario
poetry run python test_flash_single.py# Run all Flash scenarios
poetry run python clients/flash.pyProblem: MongoDB pods stuck in Pending due to missing storage classes Solution: Apply storage class fix
kubectl apply -f fix-storage-classes.yamlProblem: "Invalid API key" or "Resource not found" Solution:
- Verify API key is correct and active
- Ensure deployment name matches Azure OpenAI Studio
- Check API version compatibility
Problem: Azure OpenAI credentials not recognized Solution: Reload environment in each new PowerShell session:
Get-Content .env | ForEach-Object { if($_ -match '^([^#][^=]+)=(.+)$') { [Environment]::SetEnvironmentVariable($matches[1], $matches[2].Trim('"'), 'Process') } }Problem: Application deployments fail due to missing charts Solution: Initialize and update submodules:
git submodule init
git submodule updateProblem: Charts fail to install due to missing repositories Solution: Add and update Helm repositories:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add openebs https://openebs.github.io/charts
helm repo updateCheck cluster status:
# Check all pods are running
kubectl get pods --all-namespaces
# Check storage classes
kubectl get storageclass
# Check persistent volume claims
kubectl get pvc --all-namespaces
# Monitor pod status with watch
kubectl get pods -wCheck Helm deployments:
# List all Helm releases
helm list --all-namespaces
# Check Prometheus status
helm status prometheus -n monitoring
# Check OpenEBS status
helm status openebs -n openebsSave agent execution logs:
# Save with output display
poetry run python clients/gpt.py | Tee-Object -FilePath gpt_output.log
# Save without display
poetry run python test_flash_single.py > flash_output.log 2>&1- Never commit sensitive files:
.env, log files, and API keys are excluded via.gitignore - Check git status before commits: Verify no secrets are staged
- Use environment variables: Store all credentials in
.envfile - Review logs before sharing: Use search tools to check for sensitive data
AIOpsLab can be used in the following ways:
AIOpsLab makes it extremely easy to develop and evaluate your agents. You can onboard your agent to AIOpsLab in 3 simple steps:
-
Create your agent: You are free to develop agents using any framework of your choice. The only requirements are:
-
Wrap your agent in a Python class, say
Agent -
Add an async method
get_actionto the class:# given current state and returns the agent's action async def get_action(self, state: str) -> str: # <your agent's logic here>
-
-
Register your agent with AIOpsLab: You can now register the agent with AIOpsLab's orchestrator. The orchestrator will manage the interaction between your agent and the environment:
from aiopslab.orchestrator import Orchestrator agent = Agent() # create an instance of your agent orch = Orchestrator() # get AIOpsLab's orchestrator orch.register_agent(agent) # register your agent with AIOpsLab
-
Evaluate your agent on a problem:
-
Initialize a problem: AIOpsLab provides a list of problems that you can evaluate your agent on. Find the list of available problems here or using
orch.probs.get_problem_ids(). Now initialize a problem by its ID:problem_desc, instructs, apis = orch.init_problem("k8s_target_port-misconfig-mitigation-1")
-
Set agent context: Use the problem description, instructions, and APIs available to set context for your agent. (This step depends on your agent's design and is left to the user)
-
Start the problem: Start the problem by calling the
start_problemmethod. You can specify the maximum number of steps too:import asyncio asyncio.run(orch.start_problem(max_steps=30))
-
This process will create a Session with the orchestrator, where the agent will solve the problem. The orchestrator will evaluate your agent's solution and provide results (stored under data/results/). You can use these to improve your agent.
AIOpsLab provides a default list of applications to evaluate agents for operations tasks. However, as a developer you can add new applications to AIOpsLab and design problems around them.
Note: for auto-deployment of some apps with K8S, we integrate Helm charts (you can also use
kubectlto install as HotelRes application). More on Helm here.
To add a new application to AIOpsLab with Helm, you need to:
-
Add application metadata
-
Application metadata is a JSON object that describes the application.
-
Include any field such as the app's name, desc, namespace, etc.
-
We recommend also including a special
Helm Configfield, as follows:"Helm Config": { "release_name": "<name for the Helm release to deploy>", "chart_path": "<path to the Helm chart of the app>", "namespace": "<K8S namespace where app should be deployed>" }
Note: The
Helm Configis used by the orchestrator to auto-deploy your app when a problem associated with it is started.Note: The orchestrator will auto-provide all other fields as context to the agent for any problem associated with this app.
Create a JSON file with this metadata and save it in the
metadatadirectory. For example thesocial-networkapp: social-network.json -
-
Add application class
Extend the base class in a new Python file in the
appsdirectory:from aiopslab.service.apps.base import Application class MyApp(Application): def __init__(self): super().__init__("<path to app metadata JSON>")
The
Applicationclass provides a base implementation for the application. You can override methods as needed and add new ones to suit your application's requirements, but the base class should suffice for most applications.
Similar to applications, AIOpsLab provides a default list of problems to evaluate agents. However, as a developer you can add new problems to AIOpsLab and design them around your applications.
Each problem in AIOpsLab has 5 components:
- Application: The application on which the problem is based.
- Task: The AIOps task that the agent needs to perform. Currently we support: Detection, Localization, Analysis, and Mitigation.
- Fault: The fault being introduced in the application.
- Workload: The workload that is generated for the application.
- Evaluator: The evaluator that checks the agent's performance.
To add a new problem to AIOpsLab, create a new Python file
in the problems directory, as follows:
-
Setup. Import your chosen application (say
MyApp) and task (sayLocalizationTask):from aiopslab.service.apps.myapp import MyApp from aiopslab.orchestrator.tasks.localization import LocalizationTask
-
Define. To define a problem, create a class that inherits from your chosen
Task, and defines 3 methods:start_workload,inject_fault, andeval:class MyProblem(LocalizationTask): def __init__(self): self.app = MyApp() def start_workload(self): # <your workload logic here> def inject_fault(self) # <your fault injection logic here> def eval(self, soln, trace, duration): # <your evaluation logic here>
-
Register. Finally, add your problem to the orchestrator's registry here.
See a full example of a problem here.
Click to show the description of the problem in detail
-
start_workload: Initiates the application's workload. Use your own generator or AIOpsLab's default, which is based on wrk2:from aiopslab.generator.workload.wrk import Wrk wrk = Wrk(rate=100, duration=10) wrk.start_workload(payload="<wrk payload script>", url="<app URL>")
Relevant Code: aiopslab/generators/workload/wrk.py
-
inject_fault: Introduces a fault into the application. Use your own injector or AIOpsLab's built-in one which you can also extend. E.g., a misconfig in the K8S layer:from aiopslab.generators.fault.inject_virtual import * inj = VirtualizationFaultInjector(testbed="<namespace>") inj.inject_fault(microservices=["<service-name>"], fault_type="misconfig")
Relevant Code: aiopslab/generators/fault
-
eval: Evaluates the agent's solution using 3 params: (1) soln: agent's submitted solution if any, (2) trace: agent's action trace, and (3) duration: time taken by the agent.Here, you can use built-in default evaluators for each task and/or add custom evaluations. The results are stored in
self.results:def eval(self, soln, trace, duration) -> dict: super().eval(soln, trace, duration) # default evaluation self.add_result("myMetric", my_metric(...)) # add custom metric return self.results
Note: When an agent starts a problem, the orchestrator creates a
Sessionobject that stores the agent's interaction. Thetraceparameter is this session's recorded trace.Relevant Code: aiopslab/orchestrator/evaluators/
aiopslabGenerators
generators - the problem generators for aiopslab
βββ fault - the fault generator organized by fault injection level
β βββ base.py
β βββ inject_app.py
β ...
β βββ inject_virtual.py
βββ workload - the workload generator organized by workload type
βββ wrk.py - wrk tool interface
Orchestrator
orchestrator
βββ orchestrator.py - the main orchestration engine
βββ parser.py - parser for agent responses
βββ evaluators - eval metrics in the system
β βββ prompts.py - prompts for LLM-as-a-Judge
β βββ qualitative.py - qualitative metrics
β βββ quantitative.py - quantitative metrics
βββ problems - problem definitions in aiopslab
β βββ k8s_target_port_misconfig - e.g., A K8S TargetPort misconfig problem
β ...
β βββ registry.py
βββ actions - actions that agents can perform organized by AIOps task type
β βββ base.py
β βββ detection.py
β βββ localization.py
β βββ analysis.py
β βββ mitigation.py
βββ tasks - individual AIOps task definition that agents need to solve
βββ base.py
βββ detection.py
βββ localization.py
βββ analysis.py
βββ mitigation.py
Service
service βββ apps - interfaces/impl. of each app βββ helm.py - helm interface to interact with the cluster βββ kubectl.py - kubectl interface to interact with the cluster βββ shell.py - shell interface to interact with the cluster βββ metadata - metadata and configs for each apps βββ telemetry - observability tools besides observer, e.g., in-memory log telemetry for the agent
Observer
observer βββ filebeat - Filebeat installation βββ logstash - Logstash installation βββ prometheus - Prometheus installation βββ log_api.py - API to store the log data on disk βββ metric_api.py - API to store the metrics data on disk βββ trace_api.py - API to store the traces data on disk
Utils
βββ config.yml - aiopslab configs
βββ config.py - config parser
βββ paths.py - paths and constants
βββ session.py - aiopslab session manager
βββ utils
βββ actions.py - helpers for actions that agents can perform
βββ cache.py - cache manager
βββ status.py - aiopslab status, error, and warnings
cli.py: A command line interface to interact with AIOpsLab, e.g., used by human operators.@inproceedings{
chen2025aiopslab,
title={{AIO}psLab: A Holistic Framework to Evaluate {AI} Agents for Enabling Autonomous Clouds},
author={Yinfang Chen and Manish Shetty and Gagan Somashekar and Minghua Ma and Yogesh Simmhan and Jonathan Mace and Chetan Bansal and Rujia Wang and Saravan Rajmohan},
booktitle={Eighth Conference on Machine Learning and Systems},
year={2025},
url={https://openreview.net/forum?id=3EXBLwGxtq}
}
@inproceedings{shetty2024building,
title = {Building AI Agents for Autonomous Clouds: Challenges and Design Principles},
author = {Shetty, Manish and Chen, Yinfang and Somashekar, Gagan and Ma, Minghua and Simmhan, Yogesh and Zhang, Xuchao and Mace, Jonathan and Vandevoorde, Dax and Las-Casas, Pedro and Gupta, Shachee Mishra and Nath, Suman and Bansal, Chetan and Rajmohan, Saravan},
year = {2024},
booktitle = {Proceedings of 15th ACM Symposium on Cloud Computing},
}This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoftβs Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-partyβs policies.
