|
| 1 | +# Agent Check: Hugging Face TGI |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This check monitors [Hugging Face Text Generation Inference (TGI)][1] through the Datadog Agent. TGI is a toolkit for deploying and serving Large Language Models (LLMs) optimized for text generation with features like continuous batching, tensor parallelism, token streaming, and production-ready optimizations. |
| 6 | + |
| 7 | +The integration provides comprehensive monitoring of your TGI servers by collecting: |
| 8 | +- Request performance metrics including latency, throughput, and token generation rates |
| 9 | +- Batch processing metrics for inference optimization insights |
| 10 | +- Queue depth and request flow monitoring |
| 11 | +- Model serving health and operational metrics |
| 12 | + |
| 13 | +This enables teams to optimize LLM inference performance, track resource utilization, troubleshoot bottlenecks, and ensure reliable model serving at scale. |
| 14 | + |
| 15 | +## Setup |
| 16 | + |
| 17 | +Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the [Autodiscovery Integration Templates][3] for guidance on applying these instructions. |
| 18 | + |
| 19 | +### Installation |
| 20 | + |
| 21 | +The Hugging Face TGI check is included in the [Datadog Agent][2] package. |
| 22 | +No additional installation is needed on your server. |
| 23 | + |
| 24 | +### Configuration |
| 25 | + |
| 26 | +#### Metrics |
| 27 | + |
| 28 | +1. Ensure that your TGI server is exposing Prometheus metrics on the default metrics endpoint. TGI automatically exposes metrics at `/metrics` endpoint when running. For more information about TGI monitoring, see the [official documentation][10]. |
| 29 | + |
| 30 | +2. Edit the `hugging_face_tgi.d/conf.yaml` file, which is located in the `conf.d/` folder at the root of your [Agent's configuration directory][11], to start collecting your Hugging Face TGI performance data. See the [sample hugging_face_tgi.d/conf.yaml][4] for all available configuration options. |
| 31 | + |
| 32 | + ```yaml |
| 33 | + instances: |
| 34 | + - openmetrics_endpoint: http://localhost:80/metrics |
| 35 | + ``` |
| 36 | +
|
| 37 | +3. [Restart the Agent][5]. |
| 38 | +
|
| 39 | +### Validation |
| 40 | +
|
| 41 | +[Run the Agent's status subcommand][6] and look for `hugging_face_tgi` under the Checks section. |
| 42 | + |
| 43 | +## Data Collected |
| 44 | + |
| 45 | +### Metrics |
| 46 | + |
| 47 | +See [metadata.csv][7] for a list of metrics provided by this integration. |
| 48 | + |
| 49 | +Key metrics include: |
| 50 | + |
| 51 | +- **Request metrics**: Total requests, successful requests, failed requests, and request duration |
| 52 | +- **Queue metrics**: Queue size and queue duration for monitoring throughput bottlenecks |
| 53 | +- **Token metrics**: Generated tokens, input length, and mean time per token for performance analysis |
| 54 | +- **Batch metrics**: Batch size, batch concatenation, and batch processing durations for optimization insights |
| 55 | +- **Inference metrics**: Forward pass duration, decode duration, and filter duration for model performance monitoring |
| 56 | + |
| 57 | +### Events |
| 58 | + |
| 59 | +The Hugging Face TGI integration does not include any events. |
| 60 | + |
| 61 | +### Service Checks |
| 62 | + |
| 63 | +See [service_checks.json][8] for a list of service checks provided by this integration. |
| 64 | + |
| 65 | +## Troubleshooting |
| 66 | + |
| 67 | +In containerized environments, ensure that the Agent has network access to the TGI metrics endpoint specified in the `hugging_face_tgi.d/conf.yaml` file. |
| 68 | + |
| 69 | +Need help? Contact [Datadog support][9]. |
| 70 | + |
| 71 | + |
| 72 | +[1]: https://huggingface.co/docs/text-generation-inference/index |
| 73 | +[2]: /account/settings/agent/latest |
| 74 | +[3]: https://docs.datadoghq.com/agent/kubernetes/integrations/ |
| 75 | +[4]: https://github.com/DataDog/integrations-core/blob/master/hugging_face_tgi/datadog_checks/hugging_face_tgi/data/conf.yaml.example |
| 76 | +[5]: https://docs.datadoghq.com/agent/guide/agent-commands/#start-stop-and-restart-the-agent |
| 77 | +[6]: https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information |
| 78 | +[7]: https://github.com/DataDog/integrations-core/blob/master/hugging_face_tgi/metadata.csv |
| 79 | +[8]: https://github.com/DataDog/integrations-core/blob/master/hugging_face_tgi/assets/service_checks.json |
| 80 | +[9]: https://docs.datadoghq.com/help/ |
| 81 | +[10]: https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/monitoring |
| 82 | +[11]: https://docs.datadoghq.com/agent/configuration/agent-configuration-files/#agent-configuration-directory |
0 commit comments