Skip to content

Commit b6c2989

Browse files
authored
398 build a unified llm gateway with litellm on cce (#401)
1 parent c1c1c4a commit b6c2989

33 files changed

Lines changed: 2256 additions & 0 deletions

docs/blueprints/by-use-case/ai/deploy-the-nvidia-gpu-operator-on-cce.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
id: deploy-the-nvidia-gpu-operator-on-cce
33
title: Deploy the NVIDIA GPU Operator on CCE
44
tags: [nvidia,nvidia-operator,gpu, ai]
5+
sidebar_position: 1
56
---
67

78
import Tabs from '@theme/Tabs';

docs/blueprints/by-use-case/ai/deploy-vllm-production-stack-on-cce.mdx

Lines changed: 1115 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{
2+
"label": "Build a Unified LLM Gateway with LiteLLM on CCE",
3+
"link": {
4+
"type": "doc",
5+
"id": "litellm"
6+
},
7+
"position": 2
8+
}

docs/blueprints/by-use-case/ai/litellm/build-a-unified-llm-gateway-with-litellm-on-cce.md

Lines changed: 203 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
---
2+
id: deploy-litellm-on-cce
3+
title: Deploy LiteLLM on CCE
4+
tags: [cce, llm, litellm, ai]
5+
sidebar_position: 3
6+
---
7+
8+
# Deploy LiteLLM on CCE
9+
10+
[LiteLLM](https://docs.litellm.ai/docs/) is a lightweight gateway that provides a unified interface for interacting with multiple large language model providers. It exposes an OpenAI-compatible API, allowing applications and tools to integrate once while abstracting the differences between various backends. In this role, LiteLLM sits between clients and the underlying inference layer and becomes the central control point for how models are consumed. It can route requests to different backends, such as local runtimes or external providers, without requiring changes on the client side. This enables flexibility in choosing where inference runs based on cost, performance, or data residency requirements.
11+
12+
Beyond simple routing, LiteLLM also introduces a layer for governance. It allows platform teams to control access to models, apply usage limits, and monitor consumption across different users or teams. This makes it possible to expose a curated set of models as a shared service within an organization, while maintaining visibility and control over cost and usage patterns.
13+
14+
Within CCE, LiteLLM is deployed as the central gateway for all LLM traffic. It enables a platform approach where models,whether hosted locally or accessed externally, can be offered to multiple teams through a single, consistent endpoint. This article focuses on deploying LiteLLM on CCE and preparing it to act as the control and access layer in a modular LLM architecture.
15+
16+
## Defining and Applying Configuration
17+
18+
Before proceeding to any deployment and configuration ensure that the necessary namespace is created, by using the following command:
19+
20+
```bash
21+
kubectl create namespace litellm
22+
```
23+
24+
### Creating the Secret
25+
26+
Before deploying LiteLLM, a Kubernetes `Secret` must be created, **litellm-secrets.yaml** to provide the required runtime configuration and credentials:
27+
28+
```yaml title="litellm-secrets.yaml"
29+
apiVersion: v1
30+
kind: Secret
31+
metadata:
32+
name: litellm-secrets
33+
type: Opaque
34+
stringData:
35+
LITELLM_MASTER_KEY: sk-<RANDOM_KEY>
36+
UI_USERNAME: "admin"
37+
UI_PASSWORD: <UI_PASSWORD>
38+
DATABASE_URL: <RDS_LITELLM_POSTGRESQL_DSN>
39+
HF_TOKEN: <HF_TOKEN>
40+
```
41+
42+
:::note
43+
Each key in this secret serves a specific purpose:
44+
45+
- `LITELLM_MASTER_KEY`: This is the primary authentication key used by LiteLLM to secure access to its API. Clients connecting to the gateway must present this key, making it the central mechanism for controlling who can use the service. **Caution**: You need to autogenerate the `RANDOM_KEY` part and retain the `sk-` prefix.
46+
- `UI_USERNAME` & `UI_PASSWORD`: These credentials are used to access the built-in LiteLLM user interface. They provide basic authentication for managing and interacting with the gateway through a browser. **Caution**: You need to autogenerate the `UI_PASSWORD` value.
47+
- `DATABASE_URL`: This defines the connection string to the RDS PostgreSQL cluster that will be used by LiteLLM. You can find the connection string in T Cloud Public Console.
48+
- `HF_TOKEN`: This token is used to authenticate against Hugging Face when accessing models or endpoints that require authorization. It enables LiteLLM to pull or interact with Hugging Face-hosted resources as part of its routing capabilities. It's been created in a previous step.
49+
50+
This secret centralizes all sensitive configuration required by LiteLLM and ensures that credentials are not hardcoded in deployment manifests.
51+
:::
52+
53+
54+
Ensure that the **litellm-secrets.yaml** file has been created and reviewed based on the previous steps. Once the configuration is in place, apply it to the cluster using the following command:
55+
56+
```bash
57+
kubectl apply -f litellm-secrets.yaml -n litellm
58+
```
59+
60+
### Creating the ConfigMap
61+
62+
LiteLLM allows you to define routing behavior, fallback strategies, logging, rate limiting, and access control in a file called **config.yaml**. The exact options depend on the features you want to enable, but the file is essentially the control plane for how LiteLLM behaves. In Kubernetes we provision this file via a `ConfigMap` and we then mount it to the respective path. For this blueprint, the configuration is intentionally kept minimal to focus on the integration with the inference backends. Additional settings can be introduced later once the basic gateway setup is validated.
63+
64+
```yaml title="litellm-config.yaml"
65+
apiVersion: v1
66+
kind: ConfigMap
67+
metadata:
68+
name: litellm-config
69+
data:
70+
config.yaml: |
71+
model_list:
72+
- model_name: llama3_1__8b
73+
litellm_params:
74+
model: ollama_chat/llama3.1:8b
75+
api_base: http://ollama.ollama.svc.cluster.local:11434
76+
keep_alive: "15m"
77+
78+
- model_name: qwen2_5__7b_coder
79+
litellm_params:
80+
model: ollama_chat/qwen2.5-coder:7b
81+
api_base: http://ollama.ollama.svc.cluster.local:11434
82+
keep_alive: "15m"
83+
84+
- model_name: gemma2__9b
85+
litellm_params:
86+
model: ollama_chat/gemma2:9b
87+
api_base: http://ollama.ollama.svc.cluster.local:11434
88+
keep_alive: "15m"
89+
90+
- model_name: deepseek_r1_distill_qwen_1_5b
91+
litellm_params:
92+
model: huggingface/together/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
93+
api_key: os.environ/HF_TOKEN
94+
95+
general_settings:
96+
master_key: os.environ/LITELLM_MASTER_KEY
97+
```
98+
99+
:::note
100+
1️⃣ The key section is `model_list`. Each entry represents a model that LiteLLM will expose to clients:
101+
102+
- `model_name`: This is the name that clients will use when sending requests to LiteLLM. It is an internal alias and **does not need to match the backend model name**
103+
- `litellm_params.model`: This defines the actual model and provider. In this case, `ollama_chat/...` tells LiteLLM to route the request to an Ollama backend using its chat interface
104+
- `api_base`: This is the endpoint of the Ollama service in the CCE cluster that exposes the Ollama API
105+
- `keep_alive`: This controls how long the model remains loaded in memory on the backend. Keeping models warm reduces latency for subsequent requests
106+
107+
2️⃣ All those entries are routing the requests to local inference backends. The last one though, is not served locally, but accessed through an external provider (with additional costs):
108+
109+
```yaml
110+
- model_name: deepseek_r1_distill_qwen_1_5b
111+
litellm_params:
112+
model: huggingface/together/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
113+
api_key: os.environ/HF_TOKEN
114+
```
115+
116+
- `model_name`: This is the alias exposed by LiteLLM. Clients will use this name when sending requests to the gateway.
117+
- `litellm_params.model`: This specifies the provider and model. In this case, the request is routed through Hugging Face (via Together AI) to the `DeepSeek-R1-Distill-Qwen-1.5B` model. Unlike the Ollama examples, this does not point to a local service but to an external inference backend.
118+
- `api_key`: This references the `HF_TOKEN` stored in the Kubernetes `Secret` we created in the previous step. It is used to authenticate requests against the Hugging Face.
119+
120+
3️⃣ If you used vLLM as you inference backend instead, following the blueprint [Deploy vLLM Production Stack on CCE](/docs/blueprints/by-use-case/ai/deploy-vllm-production-stack-on-cce):
121+
122+
- `model_name`: This is the name that clients will use when sending requests to LiteLLM. It is an internal alias and **does not need to match the backend model name**
123+
- `litellm_params.model`: This defines the actual model and provider. In this case, `hosted_vllm/...` tells LiteLLM to route the request to an vLLM backend using its chat interface
124+
- `api_base`: This is the endpoint of the vLLM Router service in the CCE cluster that exposes the vLLM OpenAI API endpoint.
125+
126+
:::
127+
128+
:::tip
129+
You can rely entirely on local inference backends; the Hugging Face example is included for completeness.
130+
:::
131+
132+
Ensure that the **litellm-config.yaml** file has been created and reviewed based on the previous steps. Once the configuration is in place, apply it to the cluster using the following command:
133+
134+
```bash
135+
kubectl apply -f litellm-config.yaml -n litellm
136+
```
137+
138+
## Creating the Deployment
139+
140+
Create the following deployment manifest and save it as **litellm-deployment.yaml**. Replace `LITELLM_PROXY_BASE_URL` with your own external endpoint:
141+
142+
```yaml title="litellm-deployment.yaml"
143+
apiVersion: apps/v1
144+
kind: Deployment
145+
metadata:
146+
name: litellm
147+
spec:
148+
replicas: 2
149+
selector:
150+
matchLabels:
151+
app: litellm
152+
template:
153+
metadata:
154+
labels:
155+
app: litellm
156+
spec:
157+
containers:
158+
- name: litellm
159+
image: ghcr.io/berriai/litellm:v1.83.7.rc.1
160+
imagePullPolicy: IfNotPresent
161+
args:
162+
- "--config"
163+
- "/app/proxy_config.yaml"
164+
ports:
165+
- name: http
166+
containerPort: 4000
167+
env:
168+
- name: LITELLM_MASTER_KEY
169+
valueFrom:
170+
secretKeyRef:
171+
name: litellm-secrets
172+
key: LITELLM_MASTER_KEY
173+
- name: UI_USERNAME
174+
valueFrom:
175+
secretKeyRef:
176+
name: litellm-secrets
177+
key: UI_USERNAME
178+
- name: UI_PASSWORD
179+
valueFrom:
180+
secretKeyRef:
181+
name: litellm-secrets
182+
key: UI_PASSWORD
183+
- name: DATABASE_URL
184+
valueFrom:
185+
secretKeyRef:
186+
name: litellm-secrets
187+
key: DATABASE_URL
188+
- name: PROXY_BASE_URL
189+
value: <LITELLM_PROXY_BASE_URL>
190+
- name: DOCS_URL
191+
value: "/docs"
192+
- name: ROOT_REDIRECT_URL
193+
value: "/ui"
194+
- name: FORCE_HTTPS
195+
value: "true"
196+
- name: STORE_MODEL_IN_DB
197+
value: "true"
198+
volumeMounts:
199+
- name: litellm-config
200+
mountPath: /app/proxy_config.yaml
201+
subPath: config.yaml
202+
readOnly: true
203+
readinessProbe:
204+
httpGet:
205+
path: /health/readiness
206+
port: 4000
207+
initialDelaySeconds: 20
208+
periodSeconds: 10
209+
livenessProbe:
210+
httpGet:
211+
path: /health/liveliness
212+
port: 4000
213+
initialDelaySeconds: 40
214+
periodSeconds: 15
215+
resources:
216+
requests:
217+
cpu: "250m"
218+
memory: "512Mi"
219+
limits:
220+
cpu: "1"
221+
memory: "2Gi"
222+
volumes:
223+
- name: litellm-config
224+
configMap:
225+
name: litellm-config
226+
```
227+
228+
:::warning
229+
To add or manage models through the LiteLLM Admin UI, enable database-backed model storage by setting `STORE_MODEL_IN_DB` to `true`. Without this setting, LiteLLM only loads models from the static configuration and UI-based model creation fails with. This setting requires a configured PostgreSQL database connection for the LiteLLM proxy (in our case we use an RDS PostgreSQL instance).
230+
:::
231+
232+
Ensure that the **litellm-deployment.yaml** file has been created and reviewed based on the previous steps. Once the configuration is in place, apply it to the cluster using the following command:
233+
234+
```bash
235+
kubectl apply -f litellm-deployment.yaml -n litellm
236+
```
237+
238+
## Creating the Service & Ingress
239+
240+
Create the following manifest and save it as **litellm-service-ingress.yaml**. Replace the `host`, `tls.hosts`, `tls.secretName` and `cert-manager.io/cluster-issuer` values with your own:
241+
242+
```yaml title="litellm-service-ingress.yaml"
243+
apiVersion: v1
244+
kind: Service
245+
metadata:
246+
name: litellm
247+
spec:
248+
selector:
249+
app: litellm
250+
ports:
251+
- name: http
252+
port: 4000
253+
targetPort: http
254+
type: ClusterIP
255+
---
256+
apiVersion: networking.k8s.io/v1
257+
kind: Ingress
258+
metadata:
259+
name: litellm
260+
annotations:
261+
cert-manager.io/cluster-issuer: opentelekomcloud-letsencrypt
262+
spec:
263+
ingressClassName: haproxy
264+
rules:
265+
- host: <LITELLM_PROXY_BASE_URL>
266+
http:
267+
paths:
268+
- path: /
269+
pathType: Prefix
270+
backend:
271+
service:
272+
name: litellm
273+
port:
274+
number: 4000
275+
tls:
276+
- hosts:
277+
- <LITELLM_PROXY_BASE_URL>
278+
secretName: litellm-proxy-base-url-tls
279+
```
280+
281+
Ensure that the **litellm-service-ingress.yaml** file has been created and reviewed based on the previous steps. Once the configuration is in place, apply it to the cluster using the following command:
282+
283+
```bash
284+
kubectl apply -f litellm-service-ingress.yaml -n litellm
285+
```
286+
287+
## Validation
288+
289+
Navigate to `LITELLM_PROXY_BASE_URL` address from your browser and login to LiteLLM using the UI credentials we created in **litellm-secrets.yaml**.
290+
291+
On the sidebar click *Models + Endpoints* -> *All Models* and inspect whether the models we configured in **config.yaml** are there:
292+
293+
![image](/img/docs/blueprints/by-use-case/ai/litellm/Screenshot_From_2026-04-29_10-33-05.png)
294+
295+
Change to tab *Health Status* and ensure that all models report back as `healthy`:
296+
297+
![image](/img/docs/blueprints/by-use-case/ai/litellm/Screenshot_From_2026-04-29_10-33-27.png)
298+

0 commit comments

Comments
 (0)