Skip to content

Commit 830db18

Browse files
authored
Merge branch 'master' into grafana-alerts-docs
2 parents 3ef5097 + 603f252 commit 830db18

5 files changed

Lines changed: 254 additions & 0 deletions

File tree

docs/configuration/holmesgpt/builtin_toolsets.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Builtin Toolsets
1010
toolsets/aws
1111
toolsets/confluence
1212
toolsets/coralogix_logs
13+
toolsets/datadog_logs
1314
toolsets/datetime
1415
toolsets/docker
1516
toolsets/grafanaloki
@@ -62,6 +63,11 @@ by the user by providing credentials or API keys to external systems.
6263
:link: toolsets/coralogix_logs
6364
:link-type: doc
6465

66+
.. grid-item-card:: :octicon:`cpu;1em;` Datadog logs
67+
:class-card: sd-bg-light sd-bg-text-light
68+
:link: toolsets/datadog_logs
69+
:link-type: doc
70+
6571
.. grid-item-card:: :octicon:`cpu;1em;` Datetime
6672
:class-card: sd-bg-light sd-bg-text-light
6773
:link: toolsets/datetime

docs/configuration/holmesgpt/toolsets/_toolsets_that_provide_logging.inc.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,6 @@ HolmesGPT provides several out-of-the-box alternatives for log access. You can s
22

33
* :ref:`kubernetes/logs <toolset_kubernetes_logs>`: Access logs directly through Kubernetes. **This is the default toolset.**
44
* :ref:`coralogix/logs <toolset_coralogix_logs>`: Access logs through Coralogix.
5+
* :ref:`datadog/logs <toolset_datadog_logs>`: Access logs through Datadog.
56
* :ref:`grafana/loki <toolset_grafana_loki>`: Access Loki logs by proxying through a Grafana instance.
67
* :ref:`opensearch/logs <toolset_opensearch_logs>`: Access logs through OpenSearch.
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
.. _toolset_datadog_logs:
2+
3+
Datadog logs
4+
============
5+
6+
By enabling this toolset, HolmesGPT will fetch pod logs from `Datadog <https://www.datadoghq.com/>`_.
7+
8+
You **should** enable this toolset to replace the default :ref:`kubernetes/logs <toolset_kubernetes_logs>`
9+
toolset if all your kubernetes pod logs are consolidated inside Datadog. It will make it easier for HolmesGPT
10+
to fetch incident logs, including the ability to precisely consult past logs.
11+
12+
13+
.. include:: ./_toolsets_that_provide_logging.inc.rst
14+
15+
Configuration
16+
^^^^^^^^^^^^^
17+
18+
.. md-tab-set::
19+
20+
.. md-tab-item:: Robusta Helm Chart
21+
22+
.. code-block:: yaml
23+
24+
holmes:
25+
toolsets:
26+
datadog/logs:
27+
enabled: true
28+
config:
29+
dd_api_key: <your-datadog-api-key> # Required. Your Datadog API key
30+
dd_app_key: <your-datadog-app-key> # Required. Your Datadog Application key
31+
site_api_url: https://api.datadoghq.com # Required. Your Datadog site URL (e.g. https://api.us3.datadoghq.com for US3)
32+
indexes: ["*"] # Optional. List of Datadog indexes to search. Default: ["*"]
33+
storage_tiers: ["indexes"] # Optional. Ordered list of storage tiers to query (fallback mechanism). Options: "indexes", "online-archives", "flex". Default: ["indexes"]
34+
labels: # Optional. Map Datadog labels to Kubernetes resources
35+
pod: "pod_name"
36+
namespace: "kube_namespace"
37+
page_size: 300 # Optional. Number of logs per API page. Default: 300
38+
default_limit: 1000 # Optional. Default maximum logs to fetch when limit not specified by the LLM. Default: 1000
39+
request_timeout: 60 # Optional. API request timeout in seconds. Default: 60
40+
41+
kubernetes/logs:
42+
enabled: false # HolmesGPT's default logging mechanism MUST be disabled
43+
44+
45+
.. include:: ./_toolset_configuration.inc.rst
46+
47+
.. md-tab-item:: Holmes CLI
48+
49+
Add the following to **~/.holmes/config.yaml**, creating the file if it doesn't exist:
50+
51+
.. code-block:: yaml
52+
53+
toolsets:
54+
datadog/logs:
55+
enabled: true
56+
config:
57+
dd_api_key: <your-datadog-api-key> # Required. Your Datadog API key
58+
dd_app_key: <your-datadog-app-key> # Required. Your Datadog Application key
59+
site_api_url: https://api.datadoghq.com # Required. Your Datadog site URL (e.g. https://api.us3.datadoghq.com for US3)
60+
indexes: ["*"] # Optional. List of Datadog indexes to search. Default: ["*"]
61+
storage_tiers: ["indexes"] # Optional. Ordered list of storage tiers to query (fallback mechanism). Options: "indexes", "online-archives", "flex". Default: ["indexes"]
62+
labels: # Optional. Map Datadog labels to Kubernetes resources
63+
pod: "pod_name"
64+
namespace: "kube_namespace"
65+
page_size: 300 # Optional. Number of logs per API page. Default: 300
66+
default_limit: 1000 # Optional. Default maximum logs to fetch when limit not specified by the LLM. Default: 1000
67+
request_timeout: 60 # Optional. API request timeout in seconds. Default: 60
68+
69+
kubernetes/logs:
70+
enabled: false # HolmesGPT's default logging mechanism MUST be disabled
71+
72+
Getting API and Application Keys
73+
********************************
74+
75+
To use this toolset, you need both a Datadog API key and Application key:
76+
77+
1. **API Key**: Go to Organization Settings > API Keys in your Datadog console
78+
79+
* The API key must have the ``logs_read_data`` permission scope
80+
* When creating a new key, ensure this permission is enabled
81+
82+
2. **Application Key**: Go to Organization Settings > Application Keys in your Datadog console
83+
84+
For more information, see the `Datadog API documentation <https://docs.datadoghq.com/api/latest/authentication/>`_.
85+
86+
Configuring Site URL
87+
********************
88+
89+
The ``site_api_url`` must match your Datadog site. Common values include:
90+
91+
* ``https://api.datadoghq.com`` - US1
92+
* ``https://api.us3.datadoghq.com`` - US3
93+
* ``https://api.us5.datadoghq.com`` - US5
94+
* ``https://api.datadoghq.eu`` - EU
95+
* ``https://api.ap1.datadoghq.com`` - AP1
96+
97+
For a complete list of site URLs, see the `Datadog site documentation <https://docs.datadoghq.com/getting_started/site/>`_.
98+
99+
Configuring Storage Tiers
100+
*************************
101+
102+
Datadog offers different storage tiers for logs with varying retention and costs:
103+
104+
.. list-table::
105+
:header-rows: 1
106+
:widths: 20 40 40
107+
108+
* - Storage Tier
109+
- Description
110+
- Use Case
111+
* - indexes
112+
- Hot storage for recent logs (default)
113+
- Real-time analysis and alerting
114+
* - online-archives
115+
- Warm storage for older logs
116+
- Historical log analysis
117+
* - flex
118+
- Cost-effective storage
119+
- Long-term retention
120+
121+
The toolset uses storage tiers as a fallback mechanism. Subsequent tiers are queried only if the previous tier yielded no result.
122+
For example if the toolset is configured with storage_tiers ``["indexes", "online-archives"]``, then:
123+
124+
* Holmes first runs a query using storage_tier ``indexes``
125+
* If there are no results at all, Holmes will then query ``online-archives``
126+
127+
Handling Rate Limits
128+
********************
129+
130+
If you encounter rate limiting issues with Datadog (visible as warning messages in Holmes logs), you can adjust the following parameters:
131+
132+
* **page_size**: Reduce this value to fetch fewer logs per API request. This helps avoid hitting rate limits on individual requests.
133+
* **default_limit**: Lower this value to reduce the total number of logs fetched when no explicit limit is specified.
134+
135+
Example configuration for rate-limited environments:
136+
137+
.. code-block:: yaml
138+
139+
toolsets:
140+
datadog/logs:
141+
enabled: true
142+
config:
143+
page_size: 100 # Reduced from default 300
144+
default_limit: 500 # Reduced from default 1000
145+
146+
When rate limiting occurs, Holmes will automatically retry with exponential backoff. You'll see warnings like:
147+
``DataDog logs toolset is rate limited/throttled. Waiting X.Xs until reset time``
148+
149+
Configuring Labels
150+
******************
151+
152+
You can customize the labels used by the toolset to identify Kubernetes resources. This is **optional** and only needed if your
153+
Datadog logs use different field names than the defaults.
154+
155+
.. code-block:: yaml
156+
157+
toolsets:
158+
datadog/logs:
159+
enabled: true
160+
config:
161+
labels:
162+
pod: "pod_name" # The field name for Kubernetes pod name in your Datadog logs
163+
namespace: "kube_namespace" # The field name for Kubernetes namespace in your Datadog logs
164+
165+
To find the correct field names in your Datadog logs:
166+
167+
1. Go to Logs > Search in your Datadog console
168+
2. View a sample log entry
169+
3. Identify the field names used for pod name and namespace
170+
4. Update the labels configuration accordingly
171+
172+
.. include:: ./_disable_default_logging_toolset.inc.rst
173+
174+
175+
Capabilities
176+
^^^^^^^^^^^^
177+
178+
.. include:: ./_toolset_capabilities.inc.rst
179+
180+
.. list-table::
181+
:header-rows: 1
182+
:widths: 30 70
183+
184+
* - Tool Name
185+
- Description
186+
* - fetch_pod_logs
187+
- Retrieve logs from Datadog with support for filtering, time ranges, and multiple storage tiers

helm/robusta/values.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,7 @@ lightActions:
9191
- holmes_issue_chat
9292
- holmes_chat
9393
- holmes_workload_chat
94+
- list_pods
9495

9596
# install prometheus, alert-manager, and grafana along with Robusta?
9697
enablePrometheusStack: false

playbooks/robusta_playbooks/k8s_resource_enrichments.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -380,3 +380,62 @@ def status_enricher(event: KubernetesResourceEvent, params: StatusEnricherParams
380380
),
381381
]
382382
)
383+
384+
385+
class ListPodsParams(ActionParams):
386+
"""
387+
:var name: Filter pods with name that contains this paramater
388+
:var namespace: Filter pods in this namespace. If ommitted, pods from all namespaces are returned
389+
:var limit: Max number of pods to return
390+
"""
391+
392+
name: str
393+
namespace: Optional[str] = None
394+
limit: int = 50
395+
396+
397+
@action
398+
def list_pods(event: ExecutionBaseEvent, params: ListPodsParams):
399+
"""
400+
List pods by name, and potentially namespace
401+
"""
402+
cluster = event.get_context().cluster_name
403+
filtered_pods = []
404+
continue_token = None
405+
batch_size = 300 # Load pods in batches
406+
407+
# Keep fetching until we have enough matching pods or no more pods
408+
while len(filtered_pods) < params.limit:
409+
if params.namespace:
410+
pod_list = client.CoreV1Api().list_namespaced_pod(
411+
namespace=params.namespace,
412+
limit=batch_size,
413+
_continue=continue_token
414+
)
415+
else:
416+
pod_list = client.CoreV1Api().list_pod_for_all_namespaces(
417+
limit=batch_size,
418+
_continue=continue_token
419+
)
420+
# Filter pods by name from current batch
421+
batch_filtered = [
422+
pod for pod in pod_list.items
423+
if params.name.lower() in pod.metadata.name.lower()
424+
]
425+
filtered_pods.extend(batch_filtered)
426+
427+
# Check if we have more pods to fetch
428+
continue_token = getattr(pod_list.metadata, 'continue', None)
429+
if not continue_token:
430+
break
431+
432+
# Apply final limit
433+
limited_pods = filtered_pods[:params.limit]
434+
435+
# Convert to RelatedPod format (same as related_pods action)
436+
pod_objects = [to_pod_obj(pod, cluster, include_raw_data=False) for pod in limited_pods]
437+
438+
# Return as JSON
439+
event.add_enrichment([
440+
JsonBlock(json.dumps([pod.dict() for pod in pod_objects], default=str))
441+
])

0 commit comments

Comments
 (0)