Skip to content

Commit 97c2b23

Browse files
committed
Add aks-sreclaw extension - AI troubleshooting assistant for AKS
1 parent 2718c5d commit 97c2b23

74 files changed

Lines changed: 36341 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

src/aks-sreclaw/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Ignore Poetry artifacts
2+
poetry.lock
3+
pyproject.toml

src/aks-sreclaw/HISTORY.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
.. :changelog:
2+
3+
Release History
4+
===============
5+
6+
Guidance
7+
++++++++
8+
If there is no rush to release a new version, please just add a description of the modification under the *Pending* section.
9+
10+
To release a new version, please select a new version number (usually plus 1 to last patch version, X.Y.Z -> Major.Minor.Patch, more details in `\doc <https://semver.org/>`_), and then add a new section named as the new version number in this file, the content should include the new modifications and everything from the *Pending* section. Finally, update the `VERSION` variable in `setup.py` with this new version number.
11+
12+
Pending
13+
+++++++
14+
15+
1.0.0b1
16+
+++++++
17+
* Add AKS SREClaw `az aks claw`.

src/aks-sreclaw/README.rst

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
Azure CLI AKS SREClaw Extension
2+
================================
3+
4+
This extension provides commands to manage AKS SREClaw, an autonomous AI-powered troubleshooting assistant for Azure Kubernetes Service clusters.
5+
6+
Installation
7+
------------
8+
9+
To install the extension:
10+
11+
.. code-block:: bash
12+
13+
az extension add --name aks-sreclaw
14+
15+
Usage
16+
-----
17+
18+
Deploy SREClaw to your AKS cluster
19+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
20+
21+
Initialize and deploy SREClaw with interactive LLM configuration:
22+
23+
.. code-block:: bash
24+
25+
az aks claw create --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system
26+
27+
This command will:
28+
29+
1. Prompt you to select an LLM provider (Azure OpenAI or OpenAI)
30+
2. Guide you through entering model names and API credentials
31+
3. Validate the connection to your LLM provider
32+
4. Prompt for a Kubernetes service account name
33+
5. Deploy the SREClaw helm chart to your cluster
34+
6. Wait for pods to be ready (up to 5 minutes)
35+
36+
Deploy without waiting for completion
37+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38+
39+
.. code-block:: bash
40+
41+
az aks claw create --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system --no-wait
42+
43+
Check deployment status
44+
~~~~~~~~~~~~~~~~~~~~~~~
45+
46+
View the current status of your SREClaw deployment:
47+
48+
.. code-block:: bash
49+
50+
az aks claw status --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system
51+
52+
This displays:
53+
54+
- Helm release status
55+
- Deployment replica counts
56+
- Pod status and readiness
57+
- Configured LLM providers with models and API endpoints
58+
59+
Connect to SREClaw service
60+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
61+
62+
Establish a port-forward connection to access the SREClaw web interface:
63+
64+
.. code-block:: bash
65+
66+
az aks claw connect --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system
67+
68+
The command will:
69+
70+
- Display the gateway authentication token
71+
- Create a port-forward to localhost:18789
72+
- Provide instructions to open the service in your browser
73+
74+
To use a different local port:
75+
76+
.. code-block:: bash
77+
78+
az aks claw connect --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system --local-port 8080
79+
80+
Press Ctrl+C to stop the port-forwarding.
81+
82+
Delete SREClaw
83+
~~~~~~~~~~~~~~
84+
85+
Uninstall SREClaw and clean up all resources:
86+
87+
.. code-block:: bash
88+
89+
az aks claw delete --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system
90+
91+
This command will:
92+
93+
1. Prompt for confirmation
94+
2. Uninstall the SREClaw helm chart
95+
3. Delete all associated secrets and configurations
96+
4. Wait for pods to be removed
97+
98+
To delete without waiting:
99+
100+
.. code-block:: bash
101+
102+
az aks claw delete --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system --no-wait
103+
104+
LLM Provider Configuration
105+
---------------------------
106+
107+
Azure OpenAI
108+
~~~~~~~~~~~~
109+
110+
When prompted during deployment, select Azure OpenAI and provide:
111+
112+
- **Models**: Comma-separated model names (e.g., ``gpt-5.4,gpt-5.1``)
113+
- **API Key**: Your Azure OpenAI API key
114+
- **API Base**: Your Azure OpenAI endpoint (e.g., ``https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/``)
115+
116+
OpenAI
117+
~~~~~~
118+
119+
When prompted during deployment, select OpenAI and provide:
120+
121+
- **Models**: Comma-separated model names (e.g., ``gpt-5.4,gpt-5.1``)
122+
- **API Key**: Your OpenAI API key
123+
124+
Prerequisites
125+
-------------
126+
127+
- Azure CLI installed
128+
- An AKS cluster
129+
- kubectl configured to access your cluster
130+
- Appropriate permissions to deploy resources to your AKS cluster
131+
- An LLM provider account (Azure OpenAI or OpenAI) with API access
132+
133+
Service Account Requirements
134+
-----------------------------
135+
136+
SREClaw requires a Kubernetes service account with:
137+
138+
- Appropriate Role and RoleBinding in the target namespace
139+
- For Azure resource access: annotation with ``azure.workload.identity/client-id: <managed-identity-client-id>``
140+
141+
Ensure you create these before running ``az aks claw create``.
142+
143+
Troubleshooting
144+
---------------
145+
146+
Check deployment status
147+
~~~~~~~~~~~~~~~~~~~~~~~
148+
149+
.. code-block:: bash
150+
151+
az aks claw status --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system
152+
153+
View pod logs
154+
~~~~~~~~~~~~~
155+
156+
.. code-block:: bash
157+
158+
kubectl logs -n kube-system -l app.kubernetes.io/name=aks-sreclaw
159+
160+
Verify helm release
161+
~~~~~~~~~~~~~~~~~~~
162+
163+
.. code-block:: bash
164+
165+
helm list -n kube-system
166+
167+
Uninstall and reinstall
168+
~~~~~~~~~~~~~~~~~~~~~~~~
169+
170+
If you encounter issues:
171+
172+
.. code-block:: bash
173+
174+
az aks claw delete --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system
175+
az aks claw create --resource-group MyResourceGroup --name MyAKSCluster --namespace kube-system
176+
177+
Support
178+
-------
179+
180+
For issues and feature requests, please visit:
181+
https://github.com/Azure/azure-cli-extensions
182+
183+
License
184+
-------
185+
186+
This extension is licensed under the MIT License. See LICENSE.txt for details.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
from azext_aks_sreclaw._client_factory import CUSTOM_MGMT_AKS
7+
8+
# pylint: disable=unused-import
9+
from azure.cli.core import AzCommandsLoader
10+
from azure.cli.core.profiles import register_resource_type
11+
12+
13+
def register_aks_sreclaw_resource_type():
14+
register_resource_type(
15+
"latest",
16+
CUSTOM_MGMT_AKS,
17+
None,
18+
)
19+
20+
21+
class ContainerServiceCommandsLoader(AzCommandsLoader):
22+
23+
def __init__(self, cli_ctx=None):
24+
from azure.cli.core.commands import CliCommandType
25+
register_aks_sreclaw_resource_type()
26+
27+
aks_sreclaw_custom = CliCommandType(operations_tmpl='azext_aks_sreclaw.custom#{}')
28+
super().__init__(
29+
cli_ctx=cli_ctx,
30+
custom_command_type=aks_sreclaw_custom,
31+
)
32+
33+
def load_command_table(self, args):
34+
super().load_command_table(args)
35+
from azext_aks_sreclaw.commands import load_command_table
36+
load_command_table(self, args)
37+
return self.command_table
38+
39+
def load_arguments(self, command):
40+
super().load_arguments(command)
41+
from azext_aks_sreclaw._params import load_arguments
42+
load_arguments(self, command)
43+
44+
45+
COMMAND_LOADER_CLS = ContainerServiceCommandsLoader
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
from azure.cli.core.commands.client_factory import get_mgmt_service_client
7+
from azure.cli.core.profiles import CustomResourceType
8+
9+
CUSTOM_MGMT_AKS = CustomResourceType('azext_aks_sreclaw.vendored_sdks.azure_mgmt_containerservice.2025_10_01',
10+
'ContainerServiceClient')
11+
12+
# Note: cf_xxx, as the client_factory option value of a command group at command declaration, it should ignore
13+
# parameters other than cli_ctx; get_xxx_client is used as the client of other services in the command implementation,
14+
# and usually accepts subscription_id as a parameter to reconfigure the subscription when sending the request
15+
16+
17+
# container service clients
18+
def get_container_service_client(cli_ctx, subscription_id=None):
19+
return get_mgmt_service_client(cli_ctx, CUSTOM_MGMT_AKS, subscription_id=subscription_id)
20+
21+
22+
def cf_managed_clusters(cli_ctx, *_):
23+
return get_container_service_client(cli_ctx).managed_clusters
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# --------------------------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for license information.
4+
# --------------------------------------------------------------------------------------------
5+
6+
import os
7+
8+
# Configuration paths
9+
home_dir = os.path.expanduser("~")
10+
11+
AGENT_NAMESPACE = "kube-system"
12+
AKS_SRECLAW_LABEL_SELECTOR = "app.kubernetes.io/name=aks-sreclaw"
13+
14+
# Kubernetes WebSocket exec protocol constants
15+
RESIZE_CHANNEL = 4 # WebSocket channel for terminal resize messages
16+
# WebSocket heartbeat configuration (matching kubectl client-go)
17+
# Based on kubernetes/client-go/tools/remotecommand/websocket.go#L59-L65
18+
# pingPeriod = 5 * time.Second
19+
# pingReadDeadline = (pingPeriod * 12) + (1 * time.Second)
20+
# The read deadline is calculated to allow up to 12 missed pings plus 1 second buffer
21+
# This provides tolerance for network delays while detecting actual connection failures
22+
HEARTBEAT_INTERVAL = 5.0 # pingPeriod: 5 seconds between pings
23+
HEARTBEAT_TIMEOUT = (HEARTBEAT_INTERVAL * 12) + 1 # pingReadDeadline: 61 seconds total timeout
24+
25+
# AKS SREClaw Version (shared by helm chart and docker image)
26+
AKS_SRECLAW_VERSION = "0.0.0"
27+
28+
# Helm Configuration
29+
HELM_VERSION = "3.16.0"

0 commit comments

Comments
 (0)