|
| 1 | +# Deploying TGI on AWS (EC2 and SageMaker) |
| 2 | + |
| 3 | +This guide shows how to deploy **Text Generation Inference (TGI)** on AWS and how to benchmark it in a way that is useful for capacity planning. |
| 4 | + |
| 5 | +## Deploy on EC2 (Docker) |
| 6 | + |
| 7 | +For most setups, the simplest path is to run the official container on an EC2 GPU instance. |
| 8 | + |
| 9 | +1. **Launch an EC2 GPU instance** (for example `g5.*` for NVIDIA GPUs). |
| 10 | +2. **Install Docker + NVIDIA Container Toolkit** (see [Using TGI with Nvidia GPUs](../installation_nvidia) and NVIDIA’s installation docs). |
| 11 | +3. **Run TGI**: |
| 12 | + |
| 13 | +```bash |
| 14 | +model=HuggingFaceH4/zephyr-7b-beta |
| 15 | +volume=$PWD/data |
| 16 | + |
| 17 | +docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \ |
| 18 | + ghcr.io/huggingface/text-generation-inference:3.3.5 \ |
| 19 | + --model-id "$model" |
| 20 | +``` |
| 21 | + |
| 22 | +4. **Send Chat Completions API request**: |
| 23 | + |
| 24 | +```bash |
| 25 | +curl 127.0.0.1:8080/v1/chat/completions \ |
| 26 | + -X POST \ |
| 27 | + -H 'Content-Type: application/json' \ |
| 28 | + -d '{"model":"tgi","messages":[{"role":"user","content":"What is Deep Learning?"}]}' |
| 29 | +``` |
| 30 | + |
| 31 | +## Deploy on SageMaker (real-time endpoint) |
| 32 | + |
| 33 | +```bash |
| 34 | +pip install "sagemaker<3.0.0" --upgrade --quiet |
| 35 | +``` |
| 36 | + |
| 37 | +> [!WARNING] |
| 38 | +> [SageMaker Python SDK v3 has been recently released](https://github.com/aws/sagemaker-python-sdk), so unless specified otherwise, all the documentation and tutorials are still using the [SageMaker Python SDK v2](https://github.com/aws/sagemaker-python-sdk/tree/master-v2). We are actively working on updating all the tutorials and examples, but in the meantime make sure to install the SageMaker SDK as `pip install "sagemaker<3.0.0"`. |
| 39 | +TGI includes a SageMaker compatibility route (`POST /invocations`) and a SageMaker entrypoint (`sagemaker-entrypoint.sh`) that maps SageMaker environment variables to TGI launcher settings. The `/invocations` route forwards requests to `/v1/chat/completions` underneath. |
| 40 | + |
| 41 | +> **Warning:** For this flow, use the AWS SageMaker SDK `< 3.0`. For example: `pip install "sagemaker<3"`. |
| 42 | +
|
| 43 | +If you are using Hugging Face’s SageMaker integration (recommended), you typically only need to set the model environment variables: |
| 44 | + |
| 45 | +- **`HF_MODEL_ID`**: model id on the Hub (required) |
| 46 | +- **`HF_MODEL_REVISION`**: optional revision |
| 47 | +- **`SM_NUM_GPUS`**: number of GPUs (SageMaker sets this) |
| 48 | +- **`HF_MODEL_QUANTIZE`**: optional quantization |
| 49 | +- **`HF_MODEL_TRUST_REMOTE_CODE`**: optional trust remote code flag |
| 50 | + |
| 51 | +For a minimal example using the Hugging Face SageMaker SDK and the official TGI image URI: |
| 52 | + |
| 53 | +```python |
| 54 | +import json |
| 55 | +import boto3 |
| 56 | +import sagemaker |
| 57 | +from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri |
| 58 | + |
| 59 | +try: |
| 60 | + role = sagemaker.get_execution_role() |
| 61 | +except ValueError: |
| 62 | + iam = boto3.client("iam") |
| 63 | + role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"] |
| 64 | + |
| 65 | +hub = { |
| 66 | + "HF_MODEL_ID": "HuggingFaceH4/zephyr-7b-beta", |
| 67 | + # SageMaker expects SM_NUM_GPUS to be a JSON-encoded int |
| 68 | + "SM_NUM_GPUS": json.dumps(1), |
| 69 | +} |
| 70 | + |
| 71 | +huggingface_model = HuggingFaceModel( |
| 72 | + image_uri=get_huggingface_llm_image_uri("huggingface", version="3.3.5"), |
| 73 | + env=hub, |
| 74 | + role=role, |
| 75 | +) |
| 76 | + |
| 77 | +predictor = huggingface_model.deploy( |
| 78 | + initial_instance_count=1, |
| 79 | + instance_type="ml.g5.2xlarge", |
| 80 | + container_startup_health_check_timeout=300, |
| 81 | +) |
| 82 | + |
| 83 | +predictor.predict( |
| 84 | + { |
| 85 | + "messages": [ |
| 86 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 87 | + {"role": "user", "content": "What is deep learning?"}, |
| 88 | + ] |
| 89 | + } |
| 90 | +) |
| 91 | +``` |
| 92 | + |
| 93 | +## Benchmarking (what to measure, and how) |
| 94 | + |
| 95 | +For meaningful benchmarks, measure both: |
| 96 | + |
| 97 | +- **Client-visible latency** (end-to-end): p50/p95, time-to-first-token (TTFT), tokens/sec |
| 98 | +- **Server-side performance metrics** (to attribute bottlenecks): see [metrics](../reference/metrics) |
| 99 | + |
| 100 | +### End-to-end HTTP benchmark (recommended for EC2/SageMaker) |
| 101 | + |
| 102 | +Use a load generator from *outside* the instance/endpoint VPC when possible (to include network overhead), and run a warmup phase before measuring. |
| 103 | + |
| 104 | +You can use [inference-benchmarker](https://github.com/huggingface/inference-benchmarker) for end-to-end HTTP benchmarking. |
| 105 | + |
| 106 | +Example approach: |
| 107 | + |
| 108 | +1. Warm up with a small number of requests. |
| 109 | +2. Run a fixed-duration load test at a target concurrency. |
| 110 | +3. Record p50/p95 latency, error rate, and generated tokens/sec. |
| 111 | + |
| 112 | +### Microbenchmark (model server only) |
| 113 | + |
| 114 | +TGI also provides `text-generation-benchmark` (see the [benchmarking tool README](https://github.com/huggingface/text-generation-inference/tree/main/benchmark#readme)). This tool connects directly to the model server over a Unix socket and bypasses the router, so it’s useful for low-level profiling and batch-size sweeps, but it is **not** an end-to-end benchmark for SageMaker/HTTP. |
0 commit comments