Skip to content

Commit 7992115

Browse files
Merge pull request #17 from kerthcet/doc/llmaz
Add post: llmaz-intro
2 parents ddc80d5 + cdb2c99 commit 7992115

10 files changed

Lines changed: 368 additions & 77 deletions

File tree

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
.PHONY: build
2+
build:
3+
hugo --gc
4+
5+
6+
.PHONY: launch
7+
launch: build
8+
hugo server

assets/images/llmaz/arch.png

139 KB
Loading

config/_default/hugo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ defaultContentLanguage = "en"
1717
disableLanguages = ["de", "nl"]
1818
defaultContentLanguageInSubdir = false
1919

20-
copyRight = "Copyright (c) 2024 InftyAI"
20+
copyRight = "Copyright (c) 2025 InftyAI"
2121

2222
[build.buildStats]
2323
enable = true

config/_default/menus/menus.en.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@
7474
# weight = 30
7575

7676
[[footer]]
77-
name = "Powered by Netlify, Hugo, and Doks. Copyright © 2024 InftyAI."
77+
name = "Powered by Hugo and Doks. Copyright © 2025 InftyAI."
7878
# url = "/privacy/"
7979
weight = 10
8080

content/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Exploring the ∞ possibilities of AI"
3-
description: "Exploring the ∞ possibilities of AI"
3+
# description: "Exploring the ∞ possibilities of AI"
44
# lead: "Exploring the ∞ possibilities of AI"
55
date: 2023-09-07T16:33:54+02:00
66
lastmod: 2023-09-07T16:33:54+02:00

content/blog/example/index.md

Lines changed: 0 additions & 25 deletions
This file was deleted.

content/blog/llmaz-intro.md

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
---
2+
title: "llmaz, a new inference platform for LLMs built for easy to use"
3+
description: ""
4+
summary: "A brief introduction to llmaz and the features published in the first minor release v0.1.0."
5+
date: 2025-01-26T15:00:00+08:00
6+
lastmod: 2025-01-26T15:00:00+08:00
7+
draft: false
8+
weight: 50
9+
categories: ["llmaz"]
10+
tags: ["inference", "release-note"]
11+
contributors: ["kerthcet"]
12+
pinned: false
13+
homepage: false
14+
seo:
15+
title: "llmaz" # custom title (optional)
16+
description: "inference platform" # custom description (recommended)
17+
canonical: "" # custom canonical URL (optional)
18+
noindex: false # false (default) or true
19+
---
20+
21+
With the GPT series models shocking the world, a new era of AI innovation has begun. Besides the model training, because of the large model size and high computational cost, the inference process is also a challenge, not only the cost, but also the performance and efficiency. So when we look back to the late of 2023, we see lots of communities are building the inference engines, like the vLLM, TGI, LMDeploy and more others less well-known. But there still lacks a platform to provide an unified interface to serve LLM workloads in cloud and it should work smoothly with these inference engines. That's the initial idea of llmaz. However, we didn't start the work until middle of 2024 due to some unavoidable commitments. Anyway, today we are proud to announce the first minor release v0.1.0 of llmaz.
22+
23+
>💙 To make sure you will not leave with disappointments, we don't have a lot of fancy features for v0.1.0, we just did a lot of dirty work to make sure it's a workable solution, but we promise you, we will bring more exciting features in the near future.
24+
25+
## Architecture
26+
27+
First of all, let's take a look at the architecture of llmaz: ![llmaz architecture](/images/llmaz/arch.png)
28+
29+
Basically, llmaz works as a platform on top of Kubernetes and provides an unified interface for various kinds of inference engines, it has four CRDs as defined:
30+
31+
- **OpenModel**: the model specification, which defines the model source, inference configurations and other metadata. It's a cluster scoped resource.
32+
- **Playground**: the facade to set the inference configurations, e.g. the model name, the replicas, the scaling policies, as simple as possible. It's a namespace scoped resource.
33+
- **Inference Service**: the full configurations for inference workload if Playground is not enough. Most of the time, you don't need it. A Playground will create a Service automatically and it's a namespace scoped resource.
34+
- **BackendRuntime**: the backend runtime represents the actual inference engines, their images, resource requirements, together with their boot configurations. It's a namespace scoped resource.
35+
36+
With the abstraction of these CRDs, llmaz provides a simple way to deploy and manage the inference workloads, offering features like:
37+
38+
- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
39+
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like *vLLM*, *Text-Generation-Inference*, *SGLang*, *llama.cpp*. Find the full list of supported backends here.
40+
- **Accelerator Fungibility**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
41+
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like Speculative Decoding to run on Kubernetes.
42+
- **Various Model Providers**: llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
43+
- **Multi-hosts Support**: llmaz supports both single-host and multi-hosts scenarios from day 0.
44+
- **Scaling Efficiency**: llmaz supports horizontal scaling with just 2-3 lines.
45+
46+
With llmaz v0.1.0, all these features are available. Next, I'll show you how to use llmaz.
47+
48+
## Quick Start
49+
50+
### Installation
51+
52+
First, you need to install the llmaz with helm charts, be note that the helm chart version is different with the llmaz version, 0.0.6 is exactly the version of llmaz v0.1.0.
53+
54+
```cmd
55+
helm repo add inftyai https://inftyai.github.io/llmaz
56+
helm repo update
57+
helm install llmaz inftyai/llmaz --namespace llmaz-system --create-namespace --version 0.0.6
58+
```
59+
60+
You can find more installation guides [here](https://github.com/InftyAI/llmaz/blob/main/docs/installation.md) like installing from source code.
61+
62+
### Deploy a Model
63+
64+
Here's the simplest way to deploy a model with llmaz.
65+
66+
1. First, you need to deploy a model with specifications:
67+
68+
```yaml
69+
apiVersion: llmaz.io/v1alpha1
70+
kind: OpenModel
71+
metadata:
72+
name: opt-125m
73+
spec:
74+
familyName: opt
75+
source:
76+
modelHub:
77+
modelID: facebook/opt-125m
78+
inferenceConfig:
79+
flavors:
80+
- name: default
81+
requests:
82+
nvidia.com/gpu: 1
83+
```
84+
85+
2. Then deploy a Playground:
86+
87+
```yaml
88+
apiVersion: inference.llmaz.io/v1alpha1
89+
kind: Playground
90+
metadata:
91+
name: opt-125m
92+
spec:
93+
replicas: 1
94+
modelClaim:
95+
modelName: opt-125m
96+
# To use elasticConfig, you need to add scaleTriggers to backendRuntime,
97+
# if not, comment the elasticConfig here.
98+
elasticConfig:
99+
minReplicas: 1
100+
maxReplicas: 3
101+
```
102+
103+
That's it! llmaz will launch a *opt-125m* service with the replicas ranging from 1 to 3. The service is served by vLLM by default.
104+
105+
106+
## Design Philosophy
107+
108+
We believe that the complexity of the system should be hidden from the users, we have two main roles in our system, **the user**, and **the platform runner**.
109+
110+
The user, who wants to deploy a LLM model should not know too much details of the Kubernetes (although llmaz is also deployed on Kubernetes), the only thing they need to do is to provide the model name, and llmaz should take care of the rest.
111+
112+
That's the reason why we have the Playground, it's a facade to the inference workload with model name, replicas configurations, we shift the complexity to the BackendRuntime instead. If you take a look at the vLLM BackendRuntime, the configuration is really long.
113+
114+
```yaml
115+
apiVersion: inference.llmaz.io/v1alpha1
116+
kind: BackendRuntime
117+
metadata:
118+
labels:
119+
app.kubernetes.io/name: backendruntime
120+
app.kubernetes.io/part-of: llmaz
121+
app.kubernetes.io/created-by: llmaz
122+
name: vllm
123+
spec:
124+
commands:
125+
- python3
126+
- -m
127+
- vllm.entrypoints.openai.api_server
128+
multiHostCommands:
129+
leader:
130+
- sh
131+
- -c
132+
- |
133+
ray start --head --disable-usage-stats --include-dashboard false
134+
135+
i=0
136+
while true; do
137+
active_nodes=`python3 -c 'import ray; ray.init(); print(sum(node["Alive"] for node in ray.nodes()))'`
138+
if [ $active_nodes -eq $(LWS_GROUP_SIZE) ]; then
139+
echo "All ray workers are active and the ray cluster is initialized successfully."
140+
break
141+
fi
142+
if [ $i -eq 60 ]; then
143+
echo "Initialization failed. Exiting..."
144+
exit 1
145+
fi
146+
echo "Wait for $active_nodes/$(LWS_GROUP_SIZE) workers to be active."
147+
i=$((i+1))
148+
sleep 5s;
149+
done
150+
151+
python3 -m vllm.entrypoints.openai.api_server
152+
worker:
153+
- sh
154+
- -c
155+
- |
156+
i=0
157+
while true; do
158+
ray start --address=$(LWS_LEADER_ADDRESS):6379 --block
159+
160+
if [ $? -eq 0 ]; then
161+
echo "Worker: Ray runtime started with head address $(LWS_LEADER_ADDRESS):6379"
162+
break
163+
fi
164+
if [ $i -eq 60 ]; then
165+
echo "Initialization failed. Exiting..."
166+
exit 1
167+
fi
168+
echo "Waiting until the ray worker is active..."
169+
sleep 5s;
170+
done
171+
image: vllm/vllm-openai
172+
version: v0.6.0
173+
# Do not edit the preset argument name unless you know what you're doing.
174+
# Free to add more arguments with your requirements.
175+
args:
176+
- name: default
177+
flags:
178+
- --model
179+
- "{{ .ModelPath }}"
180+
- --served-model-name
181+
- "{{ .ModelName }}"
182+
- --host
183+
- "0.0.0.0"
184+
- --port
185+
- "8080"
186+
- name: speculative-decoding
187+
flags:
188+
- --model
189+
- "{{ .ModelPath }}"
190+
- --served-model-name
191+
- "{{ .ModelName }}"
192+
- --speculative_model
193+
- "{{ .DraftModelPath }}"
194+
- --host
195+
- "0.0.0.0"
196+
- --port
197+
- "8080"
198+
- --num_speculative_tokens
199+
- "5"
200+
- -tp
201+
- "1"
202+
- name: model-parallelism
203+
flags:
204+
- --model
205+
- "{{ .ModelPath }}"
206+
- --served-model-name
207+
- "{{ .ModelName }}"
208+
- --host
209+
- "0.0.0.0"
210+
- --port
211+
- "8080"
212+
- --tensor-parallel-size
213+
- "{{ .TP }}"
214+
- --pipeline-parallel-size
215+
- "{{ .PP }}"
216+
resources:
217+
requests:
218+
cpu: 4
219+
memory: 8Gi
220+
limits:
221+
cpu: 4
222+
memory: 8Gi
223+
startupProbe:
224+
periodSeconds: 10
225+
failureThreshold: 30
226+
httpGet:
227+
path: /health
228+
port: 8080
229+
livenessProbe:
230+
initialDelaySeconds: 15
231+
periodSeconds: 10
232+
failureThreshold: 3
233+
httpGet:
234+
path: /health
235+
port: 8080
236+
readinessProbe:
237+
initialDelaySeconds: 5
238+
periodSeconds: 5
239+
failureThreshold: 3
240+
httpGet:
241+
path: /health
242+
port: 8080
243+
```
244+
245+
Basically, the BackendRuntime configures the boot commands, the resource requirements, the probes, all the stuff related to the inference engine, also part of the workload's Pod yaml. We believe it's workable for several reasons:
246+
247+
- User may not be familiar with inference engines, the parameters are really verbose and complex, the vLLM has 209 parameters in total the day we write this blog. A preset configuration template is helpful in this case.
248+
- On the other hand, the platform runner can help optimize the configurations, offering the best practices.
249+
- User can still override the configurations if they want to, the llmaz will merge the configurations from the Playground and the BackendRuntime.
250+
- User can provide their own BackendRuntime for extensibility as well and specify the backend name in the Playground for use.
251+
252+
Regarding to the OpenModel, we think model should be the first citizen in the cloud management, who has lots of properties, like the source address, the inference configurations, the metadata, etc.. We believe it's a good practice to separate the model from the inference workload, and we can reuse the model in different workloads.
253+
254+
For the long-term consideration, we may support model fine-tuning and model training in the future, so the OpenModel for serving is a good start.
255+
256+
And we would like to highlight the inference configs of OpenModel, particularly the inference flavors, in cloud, we claim a Nvidia GPU with requests like `nvidia.com/gpu: 1`, this is not good enough because GPU chips have different series, like P4, T4, L40S, A100, H100, H200, they have different memory bandwidth and compute capability, even the same chip series may have different types like the A100 has the 40GB and 80GB, and we can't tolerate to use low-end GPUs like the T4 to serve the SOFT models like llama3 405B or DeepSeek V3, so we need to specify the inference requirements in the model.
257+
258+
Here, I demonstrate how to deploy the llama3 405B with flavors configured, it will first try to scheduler the Pods to the nodes with the label `gpu.a100-80gb: true`, if not, fallback to the nodes with label `gpu.h100: true` (this requires to install our new written scheduler plugin, we'll reveal it in the following posts).
259+
260+
```yaml
261+
apiVersion: llmaz.io/v1alpha1
262+
kind: OpenModel
263+
metadata:
264+
name: llama3-405b-instruct
265+
spec:
266+
familyName: llama3
267+
source:
268+
modelHub:
269+
modelID: meta-llama/Llama-3.1-405B
270+
inferenceConfig:
271+
flavors:
272+
- name: a100-80gb
273+
requests:
274+
nvidia.com/gpu: 8 # single node request
275+
params:
276+
TP: "8" # 8 GPUs per node
277+
PP: "2" # 2 nodes
278+
nodeSelector:
279+
gpu.a100-80gb: true
280+
- name: h100
281+
requests:
282+
nvidia.com/gpu: 8 # single node request
283+
params:
284+
TP: "8"
285+
PP: "2"
286+
nodeSelector:
287+
gpu.h100: true
288+
---
289+
apiVersion: inference.llmaz.io/v1alpha1
290+
kind: Playground
291+
metadata:
292+
name: llama3-405b-instruct
293+
spec:
294+
replicas: 1
295+
modelClaim:
296+
modelName: llama3-405b-instruct
297+
backendRuntimeConfig:
298+
resources:
299+
requests:
300+
cpu: 4
301+
memory: 8Gi
302+
limits:
303+
cpu: 4
304+
memory: 16Gi
305+
```
306+
307+
Then llmaz will launch a multi-host inference service with 2 nodes, each node has 8 GPUs of A100 80GB/H100, the tensor parallelism is 8, the pipeline parallelism is 2, running by vLLM.
308+
309+
## RoamMap for V0.2.0
310+
311+
So this is our first minor release, as we mentioned, we did a lot of dirty work to make it easy to use, but we also left some unfinished work, especially the model distribution, this is a really pain-point, we have some on-going work but not ready for v0.1.0.
312+
313+
So here's the roadmap for v0.2.0:
314+
315+
- **Model Distribution**: Advanced model loading like model sharding, model caching, model pre-fetching etc..
316+
- **Observability**: We'll provide an out-of-the-box grafana dashboard for better monitoring.
317+
- **LLM-Focused Capacities**: We will provide more LLM-focused improvements, like Lora aware, KV-cache aware loadbalancing, disaggregated serving, etc..
318+
319+
And it's also great to have features like *scale-to-zero serving*, *python SDK* for code integration.
320+
321+
## Finally
322+
323+
We would like to thank all the contributors who helped us to make this release happen, it's really happy and grateful to have you all as a new open-source project.
324+
325+
And we are looking forward to user feedbacks as well, if you're interested with llmaz, feel free to have a try and if you have any problems or suggestions, don't hesitate to contact us, open an issue or PR on our [GitHub repository](https://github.com/InftyAI/llmaz) is also welcomed.
326+
327+
Last but not least, don't forget to 🌟️ our repository if you like it, it's a great encouragement for us.

0 commit comments

Comments
 (0)