You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 15, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: quickstart/README.md
+85-38Lines changed: 85 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,11 +4,25 @@ Getting Started with llm-d on Kubernetes.
4
4
5
5
## Overview
6
6
7
-
This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster. llm-d consists of the following components:
7
+
This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster.
8
8
9
-
- Gateway API Inference Extension (GIE) - This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.
9
+
**What is llm-d?**
10
+
11
+
llm-d is an open source project providing distributed inferencing for GenAI runtimes on any Kubernetes cluster. Its highly performant, scalable architecture helps reduce costs through a spectrum of hardware efficiency improvements. The project prioritizes ease of deployment+use as well as SRE needs + day 2 operations associated with running large GPU clusters.
12
+
13
+
It includes:
14
+
15
+
- Prefill/decode disaggregation
16
+
- KV Cache distribution, offloading and storage hierarchy
17
+
- AI-aware router with plug points for customizable scorers
18
+
- Operational telemetry for production, prometheus/grafana
19
+
- Kubernetes-based, works on OCP, minikube, and other k8s distributions
20
+
- NIXL inference transfer library
21
+
22
+
**llm-d consists of the following components:**
10
23
11
-
The inference gateway:
24
+
- Gateway API Inference Extension (GIE) - This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.
25
+
The inference gateway:
12
26
- Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extendable request scheduling algorithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
13
27
- Provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades
14
28
- Adds end to end observability around service objective attainment
@@ -27,6 +41,7 @@ This guide will walk you through the steps to install and deploy llm-d on a Kube
27
41
- Supports independent scaling of prefill and decode instances
28
42
- Supports independent node affinities for prefill and decode instances
29
43
- Supports model loading from OCI images, HuggingFace public and private registries, and PVCs
44
+
30
45
- Metrics Service (Prometheus)
31
46
32
47
### Architecture
@@ -35,6 +50,17 @@ This guide will walk you through the steps to install and deploy llm-d on a Kube
35
50
36
51
## Hardware Profiles
37
52
53
+
Tested on:
54
+
55
+
- Minikube on AWS
56
+
- single g6e.12xlarge
57
+
- Red Hat OpenShift on AWS
58
+
- 6 x m5.4xlarge
59
+
- 2 x g6e.2xlarge
60
+
- OpenShift 4.17.21
61
+
- NVIDIA GPU Operator 24.9.2
62
+
- OpenShift Data Foundation 4.17.6
63
+
38
64
## Client Configuration
39
65
40
66
### Required tools
@@ -48,9 +74,10 @@ Following prerequisite are required for the installer to work.
48
74
-[Kustomize – official install docs](https://kubectl.docs.kubernetes.io/installation/kustomize/)
-[HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) with download access for the model you want to use. By default the sample application will use [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
90
+
> ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and
91
+
> accept the usage terms to pull this with your HF token if you have not already done so.
63
92
64
93
Registry Authentication: The installer looks for an auth file in:
> ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and
89
-
> accept the usage terms to pull this with your HF token if you have not already done so.
90
-
91
117
### Target Platforms
92
118
93
119
#### Kubernetes
@@ -105,20 +131,31 @@ podman run --rm --security-opt=label=disable --device=nvidia.com/gpu=all ubuntu
105
131
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
106
132
```
107
133
108
-
```bash
109
-
./llmd-installer.sh
110
-
```
111
-
112
134
#### OpenShift
113
135
114
136
- OpenShift - This quickstart was tested on OpenShift 4.18. Older versions may work but have not been tested.
115
-
- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html)
137
+
- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html).
138
+
- OpenShift Data Foundation - The installation instructions can be found [here](https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/deploying_and_managing_openshift_data_foundation_using_red_hat_openstack_platform/deploying_openshift_data_foundation_on_red_hat_openstack_platform_in_internal_mode). OF is not required, but a ReadWriteMany storage class is required.
139
+
- NO Service Mesh or Istio installation as it will conflict with the gateway
116
140
117
141
## llm-d Installation
118
142
119
-
The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory. After this, it will apply all the manifests in order to bring up the cluster.
143
+
The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory.
144
+
After this, it will apply all the manifests in order to bring up the cluster.
145
+
146
+
The llmd-installer.sh script aims to simplify the installation of llm-d using the llm-d-deployer as it's main function. It scripts as many of the steps as possible to make the installation process more streamlined. This includes:
120
147
121
-
Before proceeding with the installation, ensure you have installed the required dependencies
148
+
- Installing the GAIE infrastructure
149
+
- Creating the namespace with any special configurations
150
+
- Creating the pull secret to download the images
151
+
- Creating storage and downloading the model
152
+
- Creating the model service CRDs
153
+
- Applying the helm charts
154
+
- Deploying the sample app (model service)
155
+
156
+
It also supports uninstalling the llm-d infrastructure and the sample app.
157
+
158
+
Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue kubectl commands to your cluster by configuring your `~/.kube/config` file or by using the `oc login` command.
122
159
123
160
### Usage
124
161
@@ -145,7 +182,7 @@ The installer needs to be run from the `llm-d-deployer/quickstart` directory.
145
182
146
183
### Install llm-d on an Existing Kubernetes Cluster
147
184
148
-
The storage class used is `efs-sc`. Modify [model-storage-rwx-pvc.yaml](../helpers/k8s/model-storage-rwx-pvc.yaml)
185
+
The storage class used for AWS ec2 is `efs-sc`. Modify [model-storage-rwx-pvc.yaml](../helpers/k8s/model-storage-rwx-pvc.yaml)
The installer will create a ReadWriteMany PVC and download the model to it, if you are using OF, you can pass in the `--storage-class ocs-storagecluster-cephfs` flag.
0 commit comments