Skip to content

Commit 009042b

Browse files
dkirov-ddsteveny91
andauthored
[New integration] Hugging Face TGI (DataDog#20905)
* Add skeleton * Add payload fixture * Add tests * Update models * Fill out metadata.csv * Fix tests * Implement python logic * Fix missing import * Fix E2E test * Fix ddev validations * Run linter * Rename changelog * Fix metadata.csv metric types and unit names * Fix manifest.json * Fix metric types in metadata.csv again * Sync CI * Update README * Add metrics check to manifest * Sync configuration files * Sync configuration files * Fix metadata.csv for validation * Sync labeler * Attempt a real E2E env * Revert "Attempt a real E2E env" This reverts commit 11c00ea. * Remove images * Bump minimum base package version * Remove `le` label rename --------- Co-authored-by: Steven Yuen <steven.yuen@datadoghq.com>
1 parent d65319f commit 009042b

29 files changed

Lines changed: 3593 additions & 0 deletions

.codecov.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,10 @@ coverage:
286286
target: 75
287287
flags:
288288
- hazelcast
289+
Hugging_Face_TGI:
290+
target: 75
291+
flags:
292+
- hugging_face_tgi
289293
IBM_ACE:
290294
target: 75
291295
flags:
@@ -1183,6 +1187,11 @@ flags:
11831187
paths:
11841188
- http_check/datadog_checks/http_check
11851189
- http_check/tests
1190+
hugging_face_tgi:
1191+
carryforward: true
1192+
paths:
1193+
- hugging_face_tgi/datadog_checks/hugging_face_tgi
1194+
- hugging_face_tgi/tests
11861195
ibm_ace:
11871196
carryforward: true
11881197
paths:

.github/workflows/config/labeler.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,8 @@ integration/hubspot_content_hub:
299299
- hubspot_content_hub/**/*
300300
integration/hudi:
301301
- hudi/**/*
302+
integration/hugging_face_tgi:
303+
- hugging_face_tgi/**/*
302304
integration/hyperv:
303305
- hyperv/**/*
304306
integration/iam_access_analyzer:

.github/workflows/test-all.yml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1522,6 +1522,25 @@ jobs:
15221522
minimum-base-package: ${{ inputs.minimum-base-package }}
15231523
pytest-args: ${{ inputs.pytest-args }}
15241524
secrets: inherit
1525+
jc3781e1:
1526+
uses: ./.github/workflows/test-target.yml
1527+
with:
1528+
job-name: Hugging Face TGI
1529+
target: hugging_face_tgi
1530+
platform: linux
1531+
runner: '["ubuntu-22.04"]'
1532+
repo: "${{ inputs.repo }}"
1533+
python-version: "${{ inputs.python-version }}"
1534+
latest: ${{ inputs.latest }}
1535+
agent-image: "${{ inputs.agent-image }}"
1536+
agent-image-py2: "${{ inputs.agent-image-py2 }}"
1537+
agent-image-windows: "${{ inputs.agent-image-windows }}"
1538+
agent-image-windows-py2: "${{ inputs.agent-image-windows-py2 }}"
1539+
test-py2: ${{ inputs.test-py2 }}
1540+
test-py3: ${{ inputs.test-py3 }}
1541+
minimum-base-package: ${{ inputs.minimum-base-package }}
1542+
pytest-args: ${{ inputs.pytest-args }}
1543+
secrets: inherit
15251544
j5a9585a:
15261545
uses: ./.github/workflows/test-target.yml
15271546
with:

hugging_face_tgi/CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# CHANGELOG - Hugging Face TGI
2+
3+
<!-- towncrier release notes start -->
4+

hugging_face_tgi/README.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Agent Check: Hugging Face TGI
2+
3+
## Overview
4+
5+
This check monitors [Hugging Face Text Generation Inference (TGI)][1] through the Datadog Agent. TGI is a toolkit for deploying and serving Large Language Models (LLMs) optimized for text generation with features like continuous batching, tensor parallelism, token streaming, and production-ready optimizations.
6+
7+
The integration provides comprehensive monitoring of your TGI servers by collecting:
8+
- Request performance metrics including latency, throughput, and token generation rates
9+
- Batch processing metrics for inference optimization insights
10+
- Queue depth and request flow monitoring
11+
- Model serving health and operational metrics
12+
13+
This enables teams to optimize LLM inference performance, track resource utilization, troubleshoot bottlenecks, and ensure reliable model serving at scale.
14+
15+
## Setup
16+
17+
Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the [Autodiscovery Integration Templates][3] for guidance on applying these instructions.
18+
19+
### Installation
20+
21+
The Hugging Face TGI check is included in the [Datadog Agent][2] package.
22+
No additional installation is needed on your server.
23+
24+
### Configuration
25+
26+
#### Metrics
27+
28+
1. Ensure that your TGI server is exposing Prometheus metrics on the default metrics endpoint. TGI automatically exposes metrics at `/metrics` endpoint when running. For more information about TGI monitoring, see the [official documentation][10].
29+
30+
2. Edit the `hugging_face_tgi.d/conf.yaml` file, which is located in the `conf.d/` folder at the root of your [Agent's configuration directory][11], to start collecting your Hugging Face TGI performance data. See the [sample hugging_face_tgi.d/conf.yaml][4] for all available configuration options.
31+
32+
```yaml
33+
instances:
34+
- openmetrics_endpoint: http://localhost:80/metrics
35+
```
36+
37+
3. [Restart the Agent][5].
38+
39+
### Validation
40+
41+
[Run the Agent's status subcommand][6] and look for `hugging_face_tgi` under the Checks section.
42+
43+
## Data Collected
44+
45+
### Metrics
46+
47+
See [metadata.csv][7] for a list of metrics provided by this integration.
48+
49+
Key metrics include:
50+
51+
- **Request metrics**: Total requests, successful requests, failed requests, and request duration
52+
- **Queue metrics**: Queue size and queue duration for monitoring throughput bottlenecks
53+
- **Token metrics**: Generated tokens, input length, and mean time per token for performance analysis
54+
- **Batch metrics**: Batch size, batch concatenation, and batch processing durations for optimization insights
55+
- **Inference metrics**: Forward pass duration, decode duration, and filter duration for model performance monitoring
56+
57+
### Events
58+
59+
The Hugging Face TGI integration does not include any events.
60+
61+
### Service Checks
62+
63+
See [service_checks.json][8] for a list of service checks provided by this integration.
64+
65+
## Troubleshooting
66+
67+
In containerized environments, ensure that the Agent has network access to the TGI metrics endpoint specified in the `hugging_face_tgi.d/conf.yaml` file.
68+
69+
Need help? Contact [Datadog support][9].
70+
71+
72+
[1]: https://huggingface.co/docs/text-generation-inference/index
73+
[2]: /account/settings/agent/latest
74+
[3]: https://docs.datadoghq.com/agent/kubernetes/integrations/
75+
[4]: https://github.com/DataDog/integrations-core/blob/master/hugging_face_tgi/datadog_checks/hugging_face_tgi/data/conf.yaml.example
76+
[5]: https://docs.datadoghq.com/agent/guide/agent-commands/#start-stop-and-restart-the-agent
77+
[6]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information
78+
[7]: https://github.com/DataDog/integrations-core/blob/master/hugging_face_tgi/metadata.csv
79+
[8]: https://github.com/DataDog/integrations-core/blob/master/hugging_face_tgi/assets/service_checks.json
80+
[9]: https://docs.datadoghq.com/help/
81+
[10]: https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/monitoring
82+
[11]: https://docs.datadoghq.com/agent/configuration/agent-configuration-files/#agent-configuration-directory
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
name: Hugging Face TGI
2+
files:
3+
- name: hugging_face_tgi.yaml
4+
options:
5+
- template: init_config
6+
options:
7+
- template: init_config/openmetrics
8+
- template: instances
9+
options:
10+
- template: instances/openmetrics
11+
overrides:
12+
openmetrics_endpoint.value.example: http://localhost:80/metrics
13+
openmetrics_endpoint.description: |
14+
Endpoint exposing Hugging Face TGI's Prometheus metrics. For more information, refer to
15+
https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/monitoring
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Initial Release
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# (C) Datadog, Inc. 2025-present
2+
# All rights reserved
3+
# Licensed under a 3-clause BSD style license (see LICENSE)
4+
__version__ = '0.0.1'
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# (C) Datadog, Inc. 2025-present
2+
# All rights reserved
3+
# Licensed under a 3-clause BSD style license (see LICENSE)
4+
from .__about__ import __version__
5+
from .check import HuggingFaceTgiCheck
6+
7+
__all__ = ['__version__', 'HuggingFaceTgiCheck']
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# (C) Datadog, Inc. 2025-present
2+
# All rights reserved
3+
# Licensed under a 3-clause BSD style license (see LICENSE)
4+
from datadog_checks.base import OpenMetricsBaseCheckV2
5+
from datadog_checks.hugging_face_tgi.metrics import METRIC_MAP, RENAME_LABELS_MAP
6+
7+
8+
class HuggingFaceTgiCheck(OpenMetricsBaseCheckV2):
9+
__NAMESPACE__ = 'hugging_face_tgi'
10+
11+
DEFAULT_METRIC_LIMIT = 0
12+
13+
def __init__(self, name, init_config, instances=None):
14+
super(HuggingFaceTgiCheck, self).__init__(
15+
name,
16+
init_config,
17+
instances,
18+
)
19+
20+
def get_default_config(self):
21+
return {
22+
'metrics': [METRIC_MAP],
23+
'rename_labels': RENAME_LABELS_MAP,
24+
}

0 commit comments

Comments
 (0)