Skip to content
This repository was archived by the owner on Oct 15, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 85 additions & 38 deletions quickstart/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,25 @@ Getting Started with llm-d on Kubernetes.

## Overview

This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster. llm-d consists of the following components:
This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster.

- Gateway API Inference Extension (GIE) - This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.
**What is llm-d?**

llm-d is an open source project providing distributed inferencing for GenAI runtimes on any Kubernetes cluster. Its highly performant, scalable architecture helps reduce costs through a spectrum of hardware efficiency improvements. The project prioritizes ease of deployment+use as well as SRE needs + day 2 operations associated with running large GPU clusters.

It includes:

- Prefill/decode disaggregation
- KV Cache distribution, offloading and storage hierarchy
- AI-aware router with plug points for customizable scorers
- Operational telemetry for production, prometheus/grafana
- Kubernetes-based, works on OCP, minikube, and other k8s distributions
- NIXL inference transfer library

**llm-d consists of the following components:**

The inference gateway:
- Gateway API Inference Extension (GIE) - This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee.
The inference gateway:
- Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extendable request scheduling algorithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
- Provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades
- Adds end to end observability around service objective attainment
Expand All @@ -27,6 +41,7 @@ This guide will walk you through the steps to install and deploy llm-d on a Kube
- Supports independent scaling of prefill and decode instances
- Supports independent node affinities for prefill and decode instances
- Supports model loading from OCI images, HuggingFace public and private registries, and PVCs

- Metrics Service (Prometheus)

### Architecture
Expand All @@ -35,6 +50,17 @@ This guide will walk you through the steps to install and deploy llm-d on a Kube

## Hardware Profiles

Tested on:

- Minikube on AWS
- single g6e.12xlarge
- Red Hat OpenShift on AWS
- 6 x m5.4xlarge
- 2 x g6e.2xlarge
- OpenShift 4.17.21
- NVIDIA GPU Operator 24.9.2
- OpenShift Data Foundation 4.17.6

## Client Configuration

### Required tools
Expand All @@ -48,9 +74,10 @@ Following prerequisite are required for the installer to work.
- [Kustomize – official install docs](https://kubectl.docs.kubernetes.io/installation/kustomize/)
- [kubectl – install & setup](https://kubernetes.io/docs/tasks/tools/install-kubectl/)

You can use the installer script that installs all the required dependencies.
You can use the installer script that installs all the required dependencies. Currently only Linux is supported.

```bash
# Currently Linux only
./install-deps.sh
```

Expand All @@ -59,7 +86,9 @@ You can use the installer script that installs all the required dependencies.
- [llm-d-deployer GitHub repo – clone here](https://github.com/neuralmagic/llm-d-deployer.git)
- [Quay.io Registry – sign-up & credentials](https://quay.io/)
- [Red Hat Registry – terms & access](https://access.redhat.com/registry/)
- [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens)
- [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) with download access for the model you want to use. By default the sample application will use [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
> ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and
> accept the usage terms to pull this with your HF token if you have not already done so.

Registry Authentication: The installer looks for an auth file in:

Expand All @@ -85,9 +114,6 @@ podman login quay.io --authfile ~/.config/containers/auth.json
podman login registry.redhat.io --authfile ~/.config/containers/auth.json
```

> ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and
> accept the usage terms to pull this with your HF token if you have not already done so.

### Target Platforms

#### Kubernetes
Expand All @@ -105,20 +131,31 @@ podman run --rm --security-opt=label=disable --device=nvidia.com/gpu=all ubuntu
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
```

```bash
./llmd-installer.sh
```

#### OpenShift

- OpenShift - This quickstart was tested on OpenShift 4.18. Older versions may work but have not been tested.
- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html)
- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html).
- OpenShift Data Foundation - The installation instructions can be found [here](https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/deploying_and_managing_openshift_data_foundation_using_red_hat_openstack_platform/deploying_openshift_data_foundation_on_red_hat_openstack_platform_in_internal_mode). OF is not required, but a ReadWriteMany storage class is required.
- NO Service Mesh or Istio installation as it will conflict with the gateway

## llm-d Installation

The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory. After this, it will apply all the manifests in order to bring up the cluster.
The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory.
After this, it will apply all the manifests in order to bring up the cluster.

The llmd-installer.sh script aims to simplify the installation of llm-d using the llm-d-deployer as it's main function. It scripts as many of the steps as possible to make the installation process more streamlined. This includes:

Before proceeding with the installation, ensure you have installed the required dependencies
- Installing the GAIE infrastructure
- Creating the namespace with any special configurations
- Creating the pull secret to download the images
- Creating storage and downloading the model
- Creating the model service CRDs
- Applying the helm charts
- Deploying the sample app (model service)

It also supports uninstalling the llm-d infrastructure and the sample app.

Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue kubectl commands to your cluster by configuring your `~/.kube/config` file or by using the `oc login` command.

### Usage

Expand All @@ -145,7 +182,7 @@ The installer needs to be run from the `llm-d-deployer/quickstart` directory.

### Install llm-d on an Existing Kubernetes Cluster

The storage class used is `efs-sc`. Modify [model-storage-rwx-pvc.yaml](../helpers/k8s/model-storage-rwx-pvc.yaml)
The storage class used for AWS ec2 is `efs-sc`. Modify [model-storage-rwx-pvc.yaml](../helpers/k8s/model-storage-rwx-pvc.yaml)
for a different type.

```bash
Expand All @@ -155,48 +192,58 @@ export HF_TOKEN="your-token"

### Install on OpenShift with OF installed

Before running the installer, ensure you have logged into the cluster. For example:

```bash
export HF_TOKEN="your-token"
./llmd-installer.sh --storage-class ocs-storagecluster-cephfs --storage-size 15Gi
oc login --token=sha256~yourtoken --server=https://api.yourcluster.com:6443
```

## Model Service
The installer will create a ReadWriteMany PVC and download the model to it, if you are using OF, you can pass in the `--storage-class ocs-storagecluster-cephfs` flag.

### Customizing the ModelService

The ModelService looks like:

```yaml
kind: ModelService
metadata:
spec:
```bash
export HF_TOKEN="your-token"
./llmd-installer.sh --storage-class ocs-storagecluster-cephfs --storage-size 15Gi
```

### Creating a New Model Service
### Validation

To create a new model service, you can edit the ModelService custom resource for your needs. Examples have been included.
#### A simple request

For GPU-enabled clusters, you can quickly verify the setup. Once both the prefill and
decode pods are running and ready, simply send a curl request to the gateway to confirm that chat
completions are working. You can execute the `test-request.sh` script to test the chat completions, or run the following on your own.
If everything is working as expected, you should receive a response. You should also see activity in the epp pod.

```bash
kubectl apply -f modelservice.yaml
NAMESPACE=llm-d
MODEL_ID=Llama-32-3B-Instruct
GATEWAY_ADDRESS=$(kubectl get gateway -n ${NAMESPACE} | tail -n 1 | awk '{print $3}')
kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
curl -X POST \
"http://${GATEWAY_ADDRESS}/v1/chat/completions" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "'${MODEL_ID}'",
"messages": [{"content": "Who are you?", "role": "user"}],
"stream": false
}'
```

### Validation

For GPU-enabled clusters, you can quickly verify the setup. Once both the prefill and
decode pods are running and ready, simply send a curl request to the decode endpoint to confirm that chat
completions are working.
For additional troubleshooting, you can check to see if the prefill and decode pods responding to requests.

```bash
NAMESPACE=llm-d
POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep llama-32-3b-instruct-model-service-decode | awk '{print $2}')
MODEL_ID=Llama-32-3B-Instruct
POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep decode | awk '{print $2}')
kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
curl -X POST \
"http://${POD_IP}:8000/v1/chat/completions" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "Llama-3.2-3B-Instruct",
"messages": [{"content": "Who won the World Series in 1986?", "role": "user"}],
"model": "'${MODEL_ID}'",
"messages": [{"content": "Who are you?", "role": "user"}],
"stream": false
}'
```
Expand Down
43 changes: 43 additions & 0 deletions quickstart/test-request.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
NAMESPACE=${1:-llm-d}
MODEL_ID=${2:-Llama-32-3B-Instruct}

POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep decode | awk '{print $2}')
echo "Testing request to ${NAMESPACE} at Pod IP ${POD_IP}"

kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
curl -X GET \
"http://${POD_IP}:8000/v1/models" \
-H 'accept: application/json' \
-H 'Content-Type: application/json'


kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
curl -X POST \
"http://${POD_IP}:8000/v1/chat/completions" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "'${MODEL_ID}'",
"messages": [{"content": "Who are you?", "role": "user"}],
"stream": false
}'

GATEWAY_ADDRESS=$(kubectl get gateway -n ${NAMESPACE} | tail -n 1 | awk '{print $3}')
echo "Testing request to ${NAMESPACE} at Gateway IP ${GATEWAY_ADDRESS}"

kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
curl -X GET \
"http://${GATEWAY_ADDRESS}/v1/models" \
-H 'accept: application/json' \
-H 'Content-Type: application/json'

kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \
curl -X POST \
"http://${GATEWAY_ADDRESS}/v1/chat/completions" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "'${MODEL_ID}'",
"messages": [{"content": "Who are you?", "role": "user"}],
"stream": false
}'
Loading