Skip to content
This repository was archived by the owner on Mar 21, 2026. It is now read-only.

Commit b4adbf2

Browse files
KOKOSdeFahad Alghanimalvarobartt
authored
docs: add AWS (EC2/SageMaker) deployment + benchmarking guide (#3352)
* docs: add AWS (EC2/SageMaker) deployment + benchmarking guide - Add an AWS deployment tutorial for EC2 + SageMaker - Fix SageMaker example indentation and link to the new guide - Add the new guide to the docs toctree * docs: address AWS deployment review feedback * Apply suggestion from code review --------- Co-authored-by: Fahad Alghanim <fkalghan@email.sc.edu> Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>
1 parent db931fc commit b4adbf2

3 files changed

Lines changed: 139 additions & 19 deletions

File tree

docs/source/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@
3636
title: Serving Private & Gated Models
3737
- local: basic_tutorials/using_cli
3838
title: Using TGI CLI
39+
- local: basic_tutorials/deploy_aws
40+
title: Deploying on AWS (EC2 and SageMaker)
3941
- local: basic_tutorials/non_core_models
4042
title: Non-core Model Serving
4143
- local: basic_tutorials/safety
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Deploying TGI on AWS (EC2 and SageMaker)
2+
3+
This guide shows how to deploy **Text Generation Inference (TGI)** on AWS and how to benchmark it in a way that is useful for capacity planning.
4+
5+
## Deploy on EC2 (Docker)
6+
7+
For most setups, the simplest path is to run the official container on an EC2 GPU instance.
8+
9+
1. **Launch an EC2 GPU instance** (for example `g5.*` for NVIDIA GPUs).
10+
2. **Install Docker + NVIDIA Container Toolkit** (see [Using TGI with Nvidia GPUs](../installation_nvidia) and NVIDIA’s installation docs).
11+
3. **Run TGI**:
12+
13+
```bash
14+
model=HuggingFaceH4/zephyr-7b-beta
15+
volume=$PWD/data
16+
17+
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
18+
ghcr.io/huggingface/text-generation-inference:3.3.5 \
19+
--model-id "$model"
20+
```
21+
22+
4. **Send Chat Completions API request**:
23+
24+
```bash
25+
curl 127.0.0.1:8080/v1/chat/completions \
26+
-X POST \
27+
-H 'Content-Type: application/json' \
28+
-d '{"model":"tgi","messages":[{"role":"user","content":"What is Deep Learning?"}]}'
29+
```
30+
31+
## Deploy on SageMaker (real-time endpoint)
32+
33+
```bash
34+
pip install "sagemaker<3.0.0" --upgrade --quiet
35+
```
36+
37+
> [!WARNING]
38+
> [SageMaker Python SDK v3 has been recently released](https://github.com/aws/sagemaker-python-sdk), so unless specified otherwise, all the documentation and tutorials are still using the [SageMaker Python SDK v2](https://github.com/aws/sagemaker-python-sdk/tree/master-v2). We are actively working on updating all the tutorials and examples, but in the meantime make sure to install the SageMaker SDK as `pip install "sagemaker<3.0.0"`.
39+
TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. The `/invocations` route forwards requests to `/v1/chat/completions` underneath.
40+
41+
> **Warning:** For this flow, use the AWS SageMaker SDK `< 3.0`. For example: `pip install "sagemaker<3"`.
42+
43+
If you are using Hugging Face’s SageMaker integration (recommended), you typically only need to set the model environment variables:
44+
45+
- **`HF_MODEL_ID`**: model id on the Hub (required)
46+
- **`HF_MODEL_REVISION`**: optional revision
47+
- **`SM_NUM_GPUS`**: number of GPUs (SageMaker sets this)
48+
- **`HF_MODEL_QUANTIZE`**: optional quantization
49+
- **`HF_MODEL_TRUST_REMOTE_CODE`**: optional trust remote code flag
50+
51+
For a minimal example using the Hugging Face SageMaker SDK and the official TGI image URI:
52+
53+
```python
54+
import json
55+
import boto3
56+
import sagemaker
57+
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
58+
59+
try:
60+
role = sagemaker.get_execution_role()
61+
except ValueError:
62+
iam = boto3.client("iam")
63+
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
64+
65+
hub = {
66+
"HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
67+
# SageMaker expects SM_NUM_GPUS to be a JSON-encoded int
68+
"SM_NUM_GPUS": json.dumps(1),
69+
}
70+
71+
huggingface_model = HuggingFaceModel(
72+
image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"),
73+
env=hub,
74+
role=role,
75+
)
76+
77+
predictor = huggingface_model.deploy(
78+
initial_instance_count=1,
79+
instance_type="ml.g5.2xlarge",
80+
container_startup_health_check_timeout=300,
81+
)
82+
83+
predictor.predict(
84+
{
85+
"messages": [
86+
{"role": "system", "content": "You are a helpful assistant."},
87+
{"role": "user", "content": "What is deep learning?"},
88+
]
89+
}
90+
)
91+
```
92+
93+
## Benchmarking (what to measure, and how)
94+
95+
For meaningful benchmarks, measure both:
96+
97+
- **Client-visible latency** (end-to-end): p50/p95, time-to-first-token (TTFT), tokens/sec
98+
- **Server-side performance metrics** (to attribute bottlenecks): see [metrics](../reference/metrics)
99+
100+
### End-to-end HTTP benchmark (recommended for EC2/SageMaker)
101+
102+
Use a load generator from *outside* the instance/endpoint VPC when possible (to include network overhead), and run a warmup phase before measuring.
103+
104+
You can use [inference-benchmarker](https://github.com/huggingface/inference-benchmarker) for end-to-end HTTP benchmarking.
105+
106+
Example approach:
107+
108+
1. Warm up with a small number of requests.
109+
2. Run a fixed-duration load test at a target concurrency.
110+
3. Record p50/p95 latency, error rate, and generated tokens/sec.
111+
112+
### Microbenchmark (model server only)
113+
114+
TGI also provides `text-generation-benchmark` (see the [benchmarking tool README](https://github.com/huggingface/text-generation-inference/tree/main/benchmark#readme)). This tool connects directly to the model server over a Unix socket and bypasses the router, so it’s useful for low-level profiling and batch-size sweeps, but it is **not** an end-to-end benchmark for SageMaker/HTTP.

docs/source/reference/api_reference.md

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,9 @@ TGI can be deployed on various cloud providers for scalable and robust text gene
141141

142142
## Amazon SageMaker
143143

144-
Amazon Sagemaker natively supports the message API:
144+
Amazon SageMaker natively supports the Chat Completions API.
145+
146+
For a complete deployment + benchmarking guide (including EC2), see [Deploying on AWS (EC2 and SageMaker)](../basic_tutorials/deploy_aws).
145147

146148
```python
147149
import json
@@ -150,36 +152,38 @@ import boto3
150152
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
151153

152154
try:
153-
role = sagemaker.get_execution_role()
155+
role = sagemaker.get_execution_role()
154156
except ValueError:
155-
iam = boto3.client('iam')
156-
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
157+
iam = boto3.client("iam")
158+
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
157159

158160
# Hub Model configuration. https://huggingface.co/models
159161
hub = {
160-
'HF_MODEL_ID':'HuggingFaceH4/zephyr-7b-beta',
161-
'SM_NUM_GPUS': json.dumps(1),
162+
"HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta",
163+
"SM_NUM_GPUS": json.dumps(1),
162164
}
163165

164166
# create Hugging Face Model Class
165167
huggingface_model = HuggingFaceModel(
166-
image_uri=get_huggingface_llm_image_uri("huggingface",version="3.3.5"),
167-
env=hub,
168-
role=role,
168+
image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"),
169+
env=hub,
170+
role=role,
169171
)
170172

171173
# deploy model to SageMaker Inference
172174
predictor = huggingface_model.deploy(
173-
initial_instance_count=1,
174-
instance_type="ml.g5.2xlarge",
175-
container_startup_health_check_timeout=300,
176-
)
175+
initial_instance_count=1,
176+
instance_type="ml.g5.2xlarge",
177+
container_startup_health_check_timeout=300,
178+
)
177179

178180
# send request
179-
predictor.predict({
180-
"messages": [
181-
{"role": "system", "content": "You are a helpful assistant." },
182-
{"role": "user", "content": "What is deep learning?"}
183-
]
184-
})
181+
predictor.predict(
182+
{
183+
"messages": [
184+
{"role": "system", "content": "You are a helpful assistant."},
185+
{"role": "user", "content": "What is deep learning?"},
186+
]
187+
}
188+
)
185189
```

0 commit comments

Comments
 (0)