diff --git a/quickstart/README.md b/quickstart/README.md index 5f9967b..55c8a23 100644 --- a/quickstart/README.md +++ b/quickstart/README.md @@ -4,11 +4,25 @@ Getting Started with llm-d on Kubernetes. ## Overview -This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster. llm-d consists of the following components: +This guide will walk you through the steps to install and deploy llm-d on a Kubernetes cluster. -- Gateway API Inference Extension (GIE) - This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee. +**What is llm-d?** + +llm-d is an open source project providing distributed inferencing for GenAI runtimes on any Kubernetes cluster. Its highly performant, scalable architecture helps reduce costs through a spectrum of hardware efficiency improvements. The project prioritizes ease of deployment+use as well as SRE needs + day 2 operations associated with running large GPU clusters. + +It includes: + +- Prefill/decode disaggregation +- KV Cache distribution, offloading and storage hierarchy +- AI-aware router with plug points for customizable scorers +- Operational telemetry for production, prometheus/grafana +- Kubernetes-based, works on OCP, minikube, and other k8s distributions +- NIXL inference transfer library + +**llm-d consists of the following components:** - The inference gateway: +- Gateway API Inference Extension (GIE) - This extension upgrades an ext-proc-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an inference gateway - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local OpenAI-compatible chat completion endpoints to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level AI Gateway like LiteLLM, Solo AI Gateway, or Apigee. + The inference gateway: - Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extendable request scheduling algorithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases - Provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades - Adds end to end observability around service objective attainment @@ -27,6 +41,7 @@ This guide will walk you through the steps to install and deploy llm-d on a Kube - Supports independent scaling of prefill and decode instances - Supports independent node affinities for prefill and decode instances - Supports model loading from OCI images, HuggingFace public and private registries, and PVCs + - Metrics Service (Prometheus) ### Architecture @@ -35,6 +50,17 @@ This guide will walk you through the steps to install and deploy llm-d on a Kube ## Hardware Profiles +Tested on: + +- Minikube on AWS + - single g6e.12xlarge +- Red Hat OpenShift on AWS + - 6 x m5.4xlarge + - 2 x g6e.2xlarge + - OpenShift 4.17.21 + - NVIDIA GPU Operator 24.9.2 + - OpenShift Data Foundation 4.17.6 + ## Client Configuration ### Required tools @@ -48,9 +74,10 @@ Following prerequisite are required for the installer to work. - [Kustomize – official install docs](https://kubectl.docs.kubernetes.io/installation/kustomize/) - [kubectl – install & setup](https://kubernetes.io/docs/tasks/tools/install-kubectl/) -You can use the installer script that installs all the required dependencies. +You can use the installer script that installs all the required dependencies. Currently only Linux is supported. ```bash +# Currently Linux only ./install-deps.sh ``` @@ -59,7 +86,9 @@ You can use the installer script that installs all the required dependencies. - [llm-d-deployer GitHub repo – clone here](https://github.com/neuralmagic/llm-d-deployer.git) - [Quay.io Registry – sign-up & credentials](https://quay.io/) - [Red Hat Registry – terms & access](https://access.redhat.com/registry/) -- [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) +- [HuggingFace HF_TOKEN](https://huggingface.co/docs/hub/en/security-tokens) with download access for the model you want to use. By default the sample application will use [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). + > ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and + > accept the usage terms to pull this with your HF token if you have not already done so. Registry Authentication: The installer looks for an auth file in: @@ -85,9 +114,6 @@ podman login quay.io --authfile ~/.config/containers/auth.json podman login registry.redhat.io --authfile ~/.config/containers/auth.json ``` -> ⚠️ You may need to visit Hugging Face [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and -> accept the usage terms to pull this with your HF token if you have not already done so. - ### Target Platforms #### Kubernetes @@ -105,20 +131,31 @@ podman run --rm --security-opt=label=disable --device=nvidia.com/gpu=all ubuntu sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi ``` -```bash -./llmd-installer.sh -``` - #### OpenShift - OpenShift - This quickstart was tested on OpenShift 4.18. Older versions may work but have not been tested. -- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html) +- NVIDIA GPU Operator and NFD Operator - The installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/steps-overview.html). +- OpenShift Data Foundation - The installation instructions can be found [here](https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/deploying_and_managing_openshift_data_foundation_using_red_hat_openstack_platform/deploying_openshift_data_foundation_on_red_hat_openstack_platform_in_internal_mode). OF is not required, but a ReadWriteMany storage class is required. +- NO Service Mesh or Istio installation as it will conflict with the gateway ## llm-d Installation -The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory. After this, it will apply all the manifests in order to bring up the cluster. +The llm-d-deployer contains all the helm charts necessary to deploy llm-d. To facilitate the installation of the helm charts, the `llmd-installer.sh` script is provided. This script will populate the necessary manifests in the `manifests` directory. +After this, it will apply all the manifests in order to bring up the cluster. + +The llmd-installer.sh script aims to simplify the installation of llm-d using the llm-d-deployer as it's main function. It scripts as many of the steps as possible to make the installation process more streamlined. This includes: -Before proceeding with the installation, ensure you have installed the required dependencies +- Installing the GAIE infrastructure +- Creating the namespace with any special configurations +- Creating the pull secret to download the images +- Creating storage and downloading the model +- Creating the model service CRDs +- Applying the helm charts +- Deploying the sample app (model service) + +It also supports uninstalling the llm-d infrastructure and the sample app. + +Before proceeding with the installation, ensure you have completed the prerequisites and are able to issue kubectl commands to your cluster by configuring your `~/.kube/config` file or by using the `oc login` command. ### Usage @@ -145,7 +182,7 @@ The installer needs to be run from the `llm-d-deployer/quickstart` directory. ### Install llm-d on an Existing Kubernetes Cluster -The storage class used is `efs-sc`. Modify [model-storage-rwx-pvc.yaml](../helpers/k8s/model-storage-rwx-pvc.yaml) +The storage class used for AWS ec2 is `efs-sc`. Modify [model-storage-rwx-pvc.yaml](../helpers/k8s/model-storage-rwx-pvc.yaml) for a different type. ```bash @@ -155,48 +192,58 @@ export HF_TOKEN="your-token" ### Install on OpenShift with OF installed +Before running the installer, ensure you have logged into the cluster. For example: + ```bash -export HF_TOKEN="your-token" -./llmd-installer.sh --storage-class ocs-storagecluster-cephfs --storage-size 15Gi +oc login --token=sha256~yourtoken --server=https://api.yourcluster.com:6443 ``` -## Model Service +The installer will create a ReadWriteMany PVC and download the model to it, if you are using OF, you can pass in the `--storage-class ocs-storagecluster-cephfs` flag. -### Customizing the ModelService - -The ModelService looks like: - -```yaml -kind: ModelService -metadata: -spec: +```bash +export HF_TOKEN="your-token" +./llmd-installer.sh --storage-class ocs-storagecluster-cephfs --storage-size 15Gi ``` -### Creating a New Model Service +### Validation -To create a new model service, you can edit the ModelService custom resource for your needs. Examples have been included. +#### A simple request + +For GPU-enabled clusters, you can quickly verify the setup. Once both the prefill and +decode pods are running and ready, simply send a curl request to the gateway to confirm that chat +completions are working. You can execute the `test-request.sh` script to test the chat completions, or run the following on your own. +If everything is working as expected, you should receive a response. You should also see activity in the epp pod. ```bash -kubectl apply -f modelservice.yaml +NAMESPACE=llm-d +MODEL_ID=Llama-32-3B-Instruct +GATEWAY_ADDRESS=$(kubectl get gateway -n ${NAMESPACE} | tail -n 1 | awk '{print $3}') +kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \ + curl -X POST \ + "http://${GATEWAY_ADDRESS}/v1/chat/completions" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "'${MODEL_ID}'", + "messages": [{"content": "Who are you?", "role": "user"}], + "stream": false + }' ``` -### Validation - -For GPU-enabled clusters, you can quickly verify the setup. Once both the prefill and -decode pods are running and ready, simply send a curl request to the decode endpoint to confirm that chat -completions are working. +For additional troubleshooting, you can check to see if the prefill and decode pods responding to requests. ```bash NAMESPACE=llm-d -POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep llama-32-3b-instruct-model-service-decode | awk '{print $2}') +MODEL_ID=Llama-32-3B-Instruct +POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep decode | awk '{print $2}') kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \ curl -X POST \ "http://${POD_IP}:8000/v1/chat/completions" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ - "model": "Llama-3.2-3B-Instruct", - "messages": [{"content": "Who won the World Series in 1986?", "role": "user"}], + "model": "'${MODEL_ID}'", + "messages": [{"content": "Who are you?", "role": "user"}], "stream": false }' ``` diff --git a/quickstart/test-request.sh b/quickstart/test-request.sh new file mode 100755 index 0000000..72573d9 --- /dev/null +++ b/quickstart/test-request.sh @@ -0,0 +1,43 @@ +NAMESPACE=${1:-llm-d} +MODEL_ID=${2:-Llama-32-3B-Instruct} + +POD_IP=$(kubectl get pods -n ${NAMESPACE} -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.podIP}{"\n"}{end}' | grep decode | awk '{print $2}') +echo "Testing request to ${NAMESPACE} at Pod IP ${POD_IP}" + +kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \ + curl -X GET \ + "http://${POD_IP}:8000/v1/models" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' + + +kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \ + curl -X POST \ + "http://${POD_IP}:8000/v1/chat/completions" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "'${MODEL_ID}'", + "messages": [{"content": "Who are you?", "role": "user"}], + "stream": false + }' + +GATEWAY_ADDRESS=$(kubectl get gateway -n ${NAMESPACE} | tail -n 1 | awk '{print $3}') +echo "Testing request to ${NAMESPACE} at Gateway IP ${GATEWAY_ADDRESS}" + +kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \ + curl -X GET \ + "http://${GATEWAY_ADDRESS}/v1/models" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' + +kubectl run --rm -i curl-temp --image=curlimages/curl --restart=Never -- \ + curl -X POST \ + "http://${GATEWAY_ADDRESS}/v1/chat/completions" \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "'${MODEL_ID}'", + "messages": [{"content": "Who are you?", "role": "user"}], + "stream": false + }'