Skip to content
This repository was archived by the owner on Oct 15, 2025. It is now read-only.

Commit 6af7dec

Browse files
authored
Merge pull request #26 from cfchase/qs-readme-update
quickstart README updates
2 parents b615289 + 743bf35 commit 6af7dec

2 files changed

Lines changed: 128 additions & 38 deletions

File tree

quickstart/README.md

Lines changed: 85 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,25 @@ Getting Started with llm-d on Kubernetes.
44

55
## Overview
66

7-
This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster. llm-d consists of the following components:
7+
This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster.
88

9-
- Gateway API Inference Extension (GIE) - This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.
9+
**What is llm-d?**
10+
11+
llm-d is an open source project providing distributed inferencing for GenAI runtimes on any Kubernetes cluster. Its highly performant, scalable architecture helps reduce costs through a spectrum of hardware efficiency improvements. The project prioritizes ease of deployment+use as well as SRE needs + day 2 operations associated with running large GPU clusters.
12+
13+
It includes:
14+
15+
- Prefill/decode disaggregation
16+
- KV Cache distribution, offloading and storage hierarchy
17+
- AI-aware router with plug points for customizable scorers
18+
- Operational telemetry for production, prometheus/grafana
19+
- Kubernetes-based, works on OCP, minikube, and other k8s distributions
20+
- NIXL inference transfer library
21+
22+
**llm-d consists of the following components:**
1023

11-
The inference gateway:
24+
- Gateway API Inference Extension (GIE) - This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.
25+
The inference gateway:
1226
- Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extendable request scheduling algorithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
1327
- Provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades
1428
- Adds end to end observability around service objective attainment
@@ -27,6 +41,7 @@ This guide will walk you through the steps to install and deploy llm-d on a Kube
2741
- Supports independent scaling of prefill and decode instances
2842
- Supports independent node affinities for prefill and decode instances
2943
- Supports model loading from OCI images, HuggingFace public and private registries, and PVCs
44+
3045
- Metrics Service (Prometheus)
3146

3247
### Architecture
@@ -35,6 +50,17 @@ This guide will walk you through the steps to install and deploy llm-d on a Kube
3550

3651
## Hardware Profiles
3752

53+
Tested on:
54+
55+
- Minikube on AWS
56+
- single g6e.12xlarge
57+
- Red Hat OpenShift on AWS
58+
- 6 x m5.4xlarge
59+
- 2 x g6e.2xlarge
60+
- OpenShift 4.17.21
61+
- NVIDIA GPU Operator 24.9.2
62+
- OpenShift Data Foundation 4.17.6
63+
3864
## Client Configuration
3965

4066
### Required tools
@@ -48,9 +74,10 @@ Following prerequisite are required for the installer to work.
4874
- [Kustomize – official install docs](https://kubectl.docs.kubernetes.io/installation/kustomize/)
4975
- [kubectl – install & setup](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
5076

51-
You can use the installer script that installs all the required dependencies.
77+
You can use the installer script that installs all the required dependencies. Currently only Linux is supported.
5278

5379
```bash
80+
# Currently Linux only
5481
./install-deps.sh
5582
```
5683

@@ -59,7 +86,9 @@ You can use the installer script that installs all the required dependencies.
5986
- [llm-d-deployer GitHub repo – clone here](https://github.com/neuralmagic/llm-d-deployer.git)
6087
- [Quay.io Registry – sign-up & credentials](https://quay.io/)
6188
- [Red Hat Registry – terms & access](https://access.redhat.com/registry/)
62-
- [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens)
89+
- [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) with download access for the model you want to use. By default the sample application will use [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
90+
> ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and
91+
> accept the usage terms to pull this with your HF token if you have not already done so.
6392
6493
Registry Authentication: The installer looks for an auth file in:
6594

@@ -85,9 +114,6 @@ podman login quay.io --authfile ~/.config/containers/auth.json
85114
podman login registry.redhat.io --authfile ~/.config/containers/auth.json
86115
```
87116

88-
> ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and
89-
> accept the usage terms to pull this with your HF token if you have not already done so.
90-
91117
### Target Platforms
92118

93119
#### Kubernetes
@@ -105,20 +131,31 @@ podman run --rm --security-opt=label=disable --device=nvidia.com/gpu=all ubuntu
105131
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
106132
```
107133

108-
```bash
109-
./llmd-installer.sh
110-
```
111-
112134
#### OpenShift
113135

114136
- OpenShift - This quickstart was tested on OpenShift 4.18. Older versions may work but have not been tested.
115-
- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html)
137+
- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html).
138+
- OpenShift Data Foundation - The installation instructions can be found [here](https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/deploying_and_managing_openshift_data_foundation_using_red_hat_openstack_platform/deploying_openshift_data_foundation_on_red_hat_openstack_platform_in_internal_mode). OF is not required, but a ReadWriteMany storage class is required.
139+
- NO Service Mesh or Istio installation as it will conflict with the gateway
116140

117141
## llm-d Installation
118142

119-
The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory. After this, it will apply all the manifests in order to bring up the cluster.
143+
The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory.
144+
After this, it will apply all the manifests in order to bring up the cluster.
145+
146+
The llmd-installer.sh script aims to simplify the installation of llm-d using the llm-d-deployer as it's main function. It scripts as many of the steps as possible to make the installation process more streamlined. This includes:
120147

121-
Before proceeding with the installation, ensure you have installed the required dependencies
148+
- Installing the GAIE infrastructure
149+
- Creating the namespace with any special configurations
150+
- Creating the pull secret to download the images
151+
- Creating storage and downloading the model
152+
- Creating the model service CRDs
153+
- Applying the helm charts
154+
- Deploying the sample app (model service)
155+
156+
It also supports uninstalling the llm-d infrastructure and the sample app.
157+
158+
Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue kubectl commands to your cluster by configuring your `~/.kube/config` file or by using the `oc login` command.
122159

123160
### Usage
124161

@@ -145,7 +182,7 @@ The installer needs to be run from the `llm-d-deployer/quickstart` directory.
145182

146183
### Install llm-d on an Existing Kubernetes Cluster
147184

148-
The storage class used is `efs-sc`. Modify [model-storage-rwx-pvc.yaml](../helpers/k8s/model-storage-rwx-pvc.yaml)
185+
The storage class used for AWS ec2 is `efs-sc`. Modify [model-storage-rwx-pvc.yaml](../helpers/k8s/model-storage-rwx-pvc.yaml)
149186
for a different type.
150187

151188
```bash
@@ -155,48 +192,58 @@ export HF_TOKEN="your-token"
155192

156193
### Install on OpenShift with OF installed
157194

195+
Before running the installer, ensure you have logged into the cluster. For example:
196+
158197
```bash
159-
export HF_TOKEN="your-token"
160-
./llmd-installer.sh --storage-class ocs-storagecluster-cephfs --storage-size 15Gi
198+
oc login --token=sha256~yourtoken --server=https://api.yourcluster.com:6443
161199
```
162200

163-
## Model Service
201+
The installer will create a ReadWriteMany PVC and download the model to it, if you are using OF, you can pass in the `--storage-class ocs-storagecluster-cephfs` flag.
164202

165-
### Customizing the ModelService
166-
167-
The ModelService looks like:
168-
169-
```yaml
170-
kind: ModelService
171-
metadata:
172-
spec:
203+
```bash
204+
export HF_TOKEN="your-token"
205+
./llmd-installer.sh --storage-class ocs-storagecluster-cephfs --storage-size 15Gi
173206
```
174207

175-
### Creating a New Model Service
208+
### Validation
176209

177-
To create a new model service, you can edit the ModelService custom resource for your needs. Examples have been included.
210+
#### A simple request
211+
212+
For GPU-enabled clusters, you can quickly verify the setup. Once both the prefill and
213+
decode pods are running and ready, simply send a curl request to the gateway to confirm that chat
214+
completions are working. You can execute the `test-request.sh` script to test the chat completions, or run the following on your own.
215+
If everything is working as expected, you should receive a response. You should also see activity in the epp pod.
178216

179217
```bash
180-
kubectl apply -f modelservice.yaml
218+
NAMESPACE=llm-d
219+
MODEL_ID=Llama-32-3B-Instruct
220+
GATEWAY_ADDRESS=$(kubectl get gateway -n ${NAMESPACE} | tail -n 1 | awk '{print $3}')
221+
kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
222+
curl -X POST \
223+
"http://${GATEWAY_ADDRESS}/v1/chat/completions" \
224+
-H 'accept: application/json' \
225+
-H 'Content-Type: application/json' \
226+
-d '{
227+
"model": "'${MODEL_ID}'",
228+
"messages": [{"content": "Who are you?", "role": "user"}],
229+
"stream": false
230+
}'
181231
```
182232

183-
### Validation
184-
185-
For GPU-enabled clusters, you can quickly verify the setup. Once both the prefill and
186-
decode pods are running and ready, simply send a curl request to the decode endpoint to confirm that chat
187-
completions are working.
233+
For additional troubleshooting, you can check to see if the prefill and decode pods responding to requests.
188234

189235
```bash
190236
NAMESPACE=llm-d
191-
POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep llama-32-3b-instruct-model-service-decode | awk '{print $2}')
237+
MODEL_ID=Llama-32-3B-Instruct
238+
POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep decode | awk '{print $2}')
192239
kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
193240
curl -X POST \
194241
"http://${POD_IP}:8000/v1/chat/completions" \
195242
-H 'accept: application/json' \
196243
-H 'Content-Type: application/json' \
197244
-d '{
198-
"model": "Llama-3.2-3B-Instruct",
199-
"messages": [{"content": "Who won the World Series in 1986?", "role": "user"}],
245+
"model": "'${MODEL_ID}'",
246+
"messages": [{"content": "Who are you?", "role": "user"}],
200247
"stream": false
201248
}'
202249
```

quickstart/test-request.sh

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
NAMESPACE=${1:-llm-d}
2+
MODEL_ID=${2:-Llama-32-3B-Instruct}
3+
4+
POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep decode | awk '{print $2}')
5+
echo "Testing request to ${NAMESPACE} at Pod IP ${POD_IP}"
6+
7+
kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
8+
curl -X GET \
9+
"http://${POD_IP}:8000/v1/models" \
10+
-H 'accept: application/json' \
11+
-H 'Content-Type: application/json'
12+
13+
14+
kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
15+
curl -X POST \
16+
"http://${POD_IP}:8000/v1/chat/completions" \
17+
-H 'accept: application/json' \
18+
-H 'Content-Type: application/json' \
19+
-d '{
20+
"model": "'${MODEL_ID}'",
21+
"messages": [{"content": "Who are you?", "role": "user"}],
22+
"stream": false
23+
}'
24+
25+
GATEWAY_ADDRESS=$(kubectl get gateway -n ${NAMESPACE} | tail -n 1 | awk '{print $3}')
26+
echo "Testing request to ${NAMESPACE} at Gateway IP ${GATEWAY_ADDRESS}"
27+
28+
kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
29+
curl -X GET \
30+
"http://${GATEWAY_ADDRESS}/v1/models" \
31+
-H 'accept: application/json' \
32+
-H 'Content-Type: application/json'
33+
34+
kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
35+
curl -X POST \
36+
"http://${GATEWAY_ADDRESS}/v1/chat/completions" \
37+
-H 'accept: application/json' \
38+
-H 'Content-Type: application/json' \
39+
-d '{
40+
"model": "'${MODEL_ID}'",
41+
"messages": [{"content": "Who are you?", "role": "user"}],
42+
"stream": false
43+
}'

0 commit comments

Comments
 (0)