Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) on A4x GKE Node Pool

This document outlines the steps to serve and benchmark various Large Language Models (LLMs) using the NVIDIA TensorRT-LLM framework on a single A4x GKE Node pool.

This guide walks you through setting up the necessary cloud infrastructure, configuring your environment, and deploying a high-performance LLM for inference.

1. Test Environment
2. High-Level Architecture
3. Environment Setup (One-Time)
4. Run the Recipe
- 4.1. Supported Models
- 4.2. Deploy and Benchmark a Model
5. Monitoring and Troubleshooting
- 5.1. Check Deployment Status
- 5.2. View Logs
6. Cleanup

1. Test Environment

The recipe uses the following setup:

Orchestration: Google Kubernetes Engine (GKE)
Deployment Configuration: A Helm chart is used to configure and deploy a Kubernetes Deployment. This deployment encapsulates the inference of the target LLM using the TensorRT-LLM framework.

This recipe has been optimized for and tested with the following configuration:

GKE Cluster:
- A regional standard cluster version: 1.33.4-gke.1036000 or later.
- A GPU node pool with 1 a4x-highgpu-4g machine.
- Workload Identity Federation for GKE enabled.
- Cloud Storage FUSE CSI driver for GKE enabled.
- DCGM metrics enabled.
- Kueue and JobSet APIs installed.
- Kueue configured to support Topology Aware Scheduling.
A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs.

Important

To prepare the required environment, see the GKE environment setup guide. Provisioning a new GKE cluster is a long-running operation and can take 20-30 minutes.

2. High-Level Flow

Here is a simplified diagram of the flow that we follow in this recipe:

---
config:
  layout: dagre
---
flowchart TD
 subgraph workstation["Client Workstation"]
    T["Cluster Toolkit"]
    B("Kubernetes API")
    A["helm install"]
  end
 subgraph huggingface["Hugging Face Hub"]
    I["Model Weights"]
  end
 subgraph gke["GKE Cluster (A4x)"]
    C["Deployment"]
    D["Pod"]
    E["TensorRT-LLM container"]
    F["Service"]
  end
 subgraph storage["Cloud Storage"]
    J["Bucket"]
  end

    %% Logical/actual flow
    T -- Create Cluster --> gke
    A --> B
    B --> C & F
    C --> D
    D --> E
    F --> C
    E -- Downloads at runtime --> I
    E -- Write logs --> J


    %% Layout control
    gke

helm: A package manager for Kubernetes to define, install, and upgrade applications. It's used here to configure and deploy the Kubernetes Deployment.
Deployment: Manages the lifecycle of your model server pod, ensuring it stays running.
Service: Provides a stable network endpoint (a DNS name and IP address) to access your model server.
Pod: The smallest deployable unit in Kubernetes. The Triton server container with TensorRT-LLM runs inside this pod on a GPU-enabled node.
Cloud Storage: A Cloud Storage bucket to store benchmark logs and other artifacts.

3. Environment Setup (One-Time)

First, you'll configure your local environment. These steps are required once before you can deploy any models.

3.1. Clone the Repository

git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=$(pwd)
export RECIPE_ROOT=$REPO_ROOT/inference/a4x/single-host-serving/tensorrt-llm

3.2. Configure Environment Variables

This is the most critical step. These variables are used in subsequent commands to target the correct resources.

export PROJECT_ID=<PROJECT_ID>
export CLUSTER_REGION=<REGION_of_your_cluster>
export CLUSTER_NAME=<YOUR_GKE_CLUSTER_NAME>
export KUEUE_NAME=<YOUR_KUEUE_NAME>
export GCS_BUCKET=<your-gcs-bucket-for-logs>
export TRTLLM_VERSION=1.3.0rc5

# Set the project for gcloud commands
gcloud config set project $PROJECT_ID

Replace the following values:

Variable	Description	Example
`PROJECT_ID`	Your Google Cloud Project ID.	`gcp-project-12345`
`CLUSTER_REGION`	The GCP region where your GKE cluster is located.	`us-central1`
`CLUSTER_NAME`	The name of your GKE cluster.	`a4x-cluster`
`KUEUE_NAME`	The name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Verify the name in your cluster.	`a4x`
`ARTIFACT_REGISTRY`	Full path to your Artifact Registry repository.	`us-central1-docker.pkg.dev/gcp-project-12345/my-repo`
`GCS_BUCKET`	Name of your GCS bucket (do not include `gs://`).	`my-benchmark-logs-bucket`
`TRTLLM_VERSION`	The tag/version for the Docker image. Other verions can be found at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release	`1.2.0rc2`

3.3. Connect to your GKE Cluster

Fetch credentials for kubectl to communicate with your cluster.

gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION

3.4. Get Hugging Face token

To access models through Hugging Face, you'll need a Hugging Face token.

Create a Hugging Face account if you don't have one.
For gated models like Llama 4, ensure you have requested and been granted access on Hugging Face before proceeding.
Generate an Access Token: Go to Your Profile > Settings > Access Tokens.
Select New Token.
Specify a Name and a Role of at least Read.
Select Generate a token.
Copy the generated token to your clipboard. You'll use this later.

3.5. Create Hugging Face Kubernetes Secret

Create a Kubernetes Secret with your Hugging Face token to enable the pod to download model checkpoints from Hugging Face.

# Paste your Hugging Face token here
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>

kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN} \
--dry-run=client -o yaml | kubectl apply -f -

4. Run the recipe

Note

After running the recipe with helm install, it can take up to 30 minutes for the deployment to become fully available. This is because the GKE node must first pull the Docker image and then download the model weights from Hugging Face.

4.1. Supported Models

This recipe supports the following models. You can easily swap between them by changing the environment variables in the next step.

Running TRTLLM inference benchmarking on these models are only tested and validated on A4X GKE nodes with certain combination of TP, PP, EP, number of GPU chips, input & output sequence length, precision, etc.

Example model configuration YAML files included in this repo only show a certain combination of parallelism hyperparameters and configs for benchmarking purposes. Input and output length in gpu-recipes/inference/a4x/single-host-serving/tensorrt-llm/values.yaml need to be adjusted according to the model and its configs.

Model Name	Hugging Face ID	Configuration File	Release Name Suffix
DeepSeek R1 671B	`nvidia/DeepSeek-R1-NVFP4-v2`	`deepseek-r1-nvfp4.yaml`	`deepseek-r1`
Llama 3.1 405B (FP8)	`meta-llama/Llama-3.1-405B-Instruct-FP8`	`llama-3.1-405b.yaml`	`llama-3-1-405b`
Llama 3.1 405B (NVFP4)	`nvidia/Llama-3.1-405B-Instruct-NVFP4`	`llama-3.1-405b.yaml`	`llama-3-1-405b`
Llama 3.1 70B	`meta-llama/Llama-3.1-70B-Instruct`	`llama-3.1-70b.yaml`	`llama-3-1-70b`
Llama 3.1 8B	`meta-llama/Llama-3.1-8B-Instruct`	`llama-3.1-8b.yaml`	`llama-3-1-8b`
Qwen 2.5 VL 7B (FP8)	`Qwen/Qwen2.5-VL-7B-Instruct`	`qwen2-5-vl-7b-fp8.yaml`	`qwen2-5-vl-7b`
Qwen 2.5 VL 7B (NVFP4)	`nvidia/Qwen2.5-VL-7B-Instruct-NVFP4`	`qwen2-5-vl-7b-nvfp4.yaml`	`qwen2-5-vl-7b`
Qwen 3 235B A22B (FP8)	`Qwen/Qwen3-235B-A22B-FP8`	`qwen3-235b-a22b-fp8.yaml`	`qwen3-235b-a22b`
Qwen 3 235B A22B (NVFP4)	`nvidia/Qwen3-235B-A22B-NVFP4`	`qwen3-235b-a22b-nvfp4.yaml`	`qwen3-235b-a22b`
Qwen 3 32B	`Qwen/Qwen3-32B`	`qwen3-32b.yaml`	`qwen3-32b`
Qwen 3 4B	`Qwen/Qwen3-4B`	`qwen3-4b.yaml`	`qwen3-4b`

Tip

DeepSeek R1 671B uses Nvidia's pre-quantized FP4 checkpoint. For more information, see the Hugging Face model card.

Tip

You can use the NVIDIA Model Optimizer to quantize these models to FP8 or NVFP4 for improved performance.

4.2. Deploy and Benchmark a Model

The recipe uses trtllm-bench, a command-line tool from NVIDIA to benchmark the performance of TensorRT-LLM engine.

Configure model-specific variables. Choose a model from the table above and set the variables:

# Example for DeepSeek R1 NVFP4
export HF_MODEL_ID="nvidia/DeepSeek-R1-NVFP4-v2"
export CONFIG_FILE="deepseek-r1-nvfp4.yaml"
export RELEASE_NAME="$USER-serving-deepseek-r1"

Install the helm chart:

cd $RECIPE_ROOT
helm install -f values.yaml \
--set-file workload_launcher=$REPO_ROOT/src/launchers/trtllm-launcher.sh \
--set-file serving_config=$REPO_ROOT/src/frameworks/a4x/trtllm-configs/${CONFIG_FILE} \
--set queue=${KUEUE_NAME} \
--set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \
--set workload.model.name=${HF_MODEL_ID} \
--set workload.image=nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_VERSION} \
--set workload.framework=trtllm \
${RELEASE_NAME} \
$REPO_ROOT/src/helm-charts/a4x/inference-templates/deployment

Check the deployment status:
```
kubectl get deployment/${RELEASE_NAME}
```
Wait until the READY column shows 1/1. See the Monitoring and Troubleshooting section to view the deployment logs.

5. Monitoring and Troubleshooting

After the model is deployed via Helm as described in the sections above, use the following steps to monitor the deployment and interact with the model. Replace <deployment-name> and <service-name> with the appropriate names from the model-specific deployment instructions (e.g., $USER-serving-deepseek-r1 and $USER-serving-deepseek-r1-svc).

5.1. Check Deployment Status

Check the status of your deployment. Replace the name if you deployed a different model.

# Example for DeepSeek R1 671B
kubectl get deployment/$USER-serving-deepseek-r1

Wait until the READY column shows 1/1. If it shows 0/1, the pod is still starting up.

Note

In the GKE UI on Cloud Console, you might see a status of "Does not have minimum availability" during startup. This is normal and will resolve once the pod is ready.

5.2. View Logs

To see the logs from the TRTLLM server (useful for debugging), use the -f flag to follow the log stream:

kubectl logs -f deployment/$USER-serving-deepseek-r1

You should see logs indicating preparing the model, and then running the throughput benchmark test, similar to this:

Running benchmark for nvidia/DeepSeek-R1-NVFP4-v2 with ISL=128, OSL=128, TP=4, EP=4, PP=1

===========================================================
= PYTORCH BACKEND
===========================================================
Model:			nvidia/DeepSeek-R1-NVFP4-v2
Model Path:		/ssd/nvidia/DeepSeek-R1-NVFP4-v2
TensorRT LLM Version:	1.2
Dtype:			bfloat16
KV Cache Dtype:		FP8
Quantization:		NVFP4

===========================================================
= REQUEST DETAILS 
===========================================================
Number of requests:             1000
Number of concurrent requests:  985.9849
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 128.0000
===========================================================
= WORLD + RUNTIME INFORMATION 
===========================================================
TP Size:                4
PP Size:                1
EP Size:                4
Max Runtime Batch Size: 2304
Max Runtime Tokens:     4608
Scheduling Policy:      GUARANTEED_NO_EVICT
KV Memory Percentage:   85.00%
Issue Rate (req/sec):   8.3913E+13

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     X.XX
Total Output Throughput (tokens/sec):             X.XX
Total Token Throughput (tokens/sec):              X.XX
Total Latency (ms):                               X.XX
Average request latency (ms):                     X.XX
Per User Output Throughput [w/ ctx] (tps/user):   X.XX
Per GPU Output Throughput (tps/gpu):              X.XX

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : X.XX
[Latency] P90    : X.XX
[Latency] P95    : X.XX
[Latency] P99    : X.XX
[Latency] MINIMUM: X.XX
[Latency] MAXIMUM: X.XX
[Latency] AVERAGE: X.XX

===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /ssd/token-norm-dist_DeepSeek-R1-NVFP4-v2_128_128_tp4.json
Number of Sequences:  1000

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   128.0000           128.0000           256.0000
MAX:   128.0000           128.0000           256.0000
AVG:   128.0000           128.0000           256.0000
P50:   128.0000           128.0000           256.0000
P90:   128.0000           128.0000           256.0000
P95:   128.0000           128.0000           256.0000
P99:   128.0000           128.0000           256.0000
===========================================================

6. Cleanup

To avoid incurring further charges, clean up the resources you created.

Uninstall the Helm Release:

First, list your releases to get the deployed models:
```
# list deployed models
helm list --filter $USER-serving-
```
Then, uninstall the desired release:
```
# uninstall the deployed model
helm uninstall <release_name>
```
Replace <release_name> with the helm release names listed.

Delete the Kubernetes Secret:

kubectl delete secret hf-secret --ignore-not-found=true

(Optional) Delete the built Docker image from Artifact Registry if no longer needed.
(Optional) Delete Cloud Build logs.
(Optional) Clean up files in your GCS bucket if benchmarking was performed.
(Optional) Delete the test environment provisioned including GKE cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) on A4x GKE Node Pool

Table of Contents

1. Test Environment

2. High-Level Flow

3. Environment Setup (One-Time)

3.1. Clone the Repository

3.2. Configure Environment Variables

3.3. Connect to your GKE Cluster

3.4. Get Hugging Face token

3.5. Create Hugging Face Kubernetes Secret

4. Run the recipe

4.1. Supported Models

4.2. Deploy and Benchmark a Model

5. Monitoring and Troubleshooting

5.1. Check Deployment Status

5.2. View Logs

6. Cleanup

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) on A4x GKE Node Pool

Table of Contents

1. Test Environment

2. High-Level Flow

3. Environment Setup (One-Time)

3.1. Clone the Repository

3.2. Configure Environment Variables

3.3. Connect to your GKE Cluster

3.4. Get Hugging Face token

3.5. Create Hugging Face Kubernetes Secret

4. Run the recipe

4.1. Supported Models

4.2. Deploy and Benchmark a Model

5. Monitoring and Troubleshooting

5.1. Check Deployment Status

5.2. View Logs

6. Cleanup