InftyAI · InftyAI-Agent · Jun 10, 2025 · Jun 10, 2025 · Jun 10, 2025
diff --git a/README.md b/README.md
@@ -40,13 +40,12 @@ Easy, advanced inference platform for large language models on Kubernetes
 
 - **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
 - **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
-- **Heterogeneous Devices Support**: llmaz supports serving the same LLM with heterogeneous devices together with [InftyAI Kube-Scheduler](https://github.com/InftyAI/scheduler-plugins) for the sake of cost and performance.
+- **Heterogeneous Cluster Support**: llmaz supports serving the same LLM with heterogeneous devices together with [InftyAI Scheduler](https://github.com/InftyAI/scheduler-plugins) for the sake of cost and performance.
 - **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
-- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
+- **Distributed Inference**: Multi-host & homogeneous xPyD support with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. Will implement the heterogeneous xPyD in the future.
 - **AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/).
+- **Scaling Efficiency**: Horizontal Pod scaling with [HPA](./docs/examples/hpa/README.md) with LLM-based metrics and node(spot instance) autoscaling with [Karpenter](https://github.com/kubernetes-sigs/karpenter).
 - **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./site/content/en/docs/integrations/open-webui.md).
-- **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
-- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.
 
 ## Quick Start
 

diff --git a/site/content/en/_index.md b/site/content/en/_index.md
@@ -24,36 +24,35 @@ title: llmaz
 People can quick deploy a LLM service with minimal configurations.
 {{% /blocks/feature %}}
 
-{{% blocks/feature icon="fas fa-cogs" title="Broad Backends Support" %}}
+{{% blocks/feature icon="fas fa-cubes" title="Broad Backends Support" %}}
 llmaz supports a wide range of advanced inference backends for different scenarios, like <a href="https://github.com/vllm-project/vllm">vLLM</a>, <a href="https://github.com/huggingface/text-generation-inference">Text-Generation-Inference</a>, <a href="https://github.com/sgl-project/sglang">SGLang</a>, <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a>. Find the full list of supported backends <a href="/InftyAI/llmaz/blob/main/docs/support-backends.md">here</a>.
 {{% /blocks/feature %}}
 
-{{% blocks/feature icon="fas fa-exchange-alt" title="Accelerator Fungibility" %}}
+{{% blocks/feature icon="fas fa-random" title="Heterogeneous Cluster Support" %}}
 llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
 {{% /blocks/feature %}}
 
-{{% blocks/feature icon="fas fa-warehouse" title="Various Model Providers" %}}
+{{% blocks/feature icon="fas fa-list-alt" title="Various Model Providers" %}}
 llmaz supports a wide range of model providers, such as <a href="https://huggingface.co/" rel="nofollow">HuggingFace</a>, <a href="https://www.modelscope.cn" rel="nofollow">ModelScope</a>, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
 {{% /blocks/feature %}}
 
-{{% blocks/feature icon="fas fa-network-wired" title="Multi-Host Support" %}}
-llmaz supports both single-host and multi-host scenarios with <a href="https://github.com/kubernetes-sigs/lws">LWS</a> from day 0.
+{{% blocks/feature icon="fas fa-sitemap" title="Distributed Serving" %}}
+Multi-host & homogeneous xPyD distributed serving support with <a href="https://github.com/kubernetes-sigs/lws">LWS</a> from day 0. Will implement the heterogeneous xPyD in the future.
 {{% /blocks/feature %}}
 
 {{% blocks/feature icon="fas fa-door-open" title="AI Gateway Support" %}}
 Offering capabilities like token-based rate limiting, model routing with the integration of <a href="https://aigateway.envoyproxy.io/" rel="nofollow">Envoy AI Gateway</a>.
 {{% /blocks/feature %}}
 
-{{% blocks/feature icon="fas fa-comments" title="Build-in ChatUI" %}}
-Out-of-the-box chatbot support with the integration of <a href="https://github.com/open-webui/open-webui">Open WebUI</a>, offering capacities like function call, RAG, web search and more, see configurations <a href="/InftyAI/llmaz/blob/main/docs/open-webui.md">here</a>.
+{{% blocks/feature icon="fas fa-expand-arrows-alt" title="Scaling Efficiency" %}}
+Horizontal Pod scaling with <a href="/InftyAI/llmaz/blob/main/docs/examples/hpa/README.md">HPA</a> based on LLM-focused metrics and node(spot instance) autoscaling with <a href="https://github.com/kubernetes-sigs/karpenter">Karpenter</a>.
 {{% /blocks/feature %}}
 
-{{% blocks/feature icon="fas fa-expand-arrows-alt" title="Scaling Efficiency" %}}
-llmaz supports horizontal scaling with <a href="/InftyAI/llmaz/blob/main/docs/examples/hpa/README.md">HPA</a> by default and will integrate with autoscaling components like <a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler">Cluster-Autoscaler</a> or <a href="https://github.com/kubernetes-sigs/karpenter">Karpenter</a> for smart scaling across different clouds.
+{{% blocks/feature icon="fas fa-comments" title="Build-in ChatUI" %}}
+Out-of-the-box chatbot support with the integration of <a href="https://github.com/open-webui/open-webui">Open WebUI</a>, offering capacities like function call, RAG, web search and more, see configurations <a href="/InftyAI/llmaz/blob/main/docs/open-webui.md">here</a>.
 {{% /blocks/feature %}}
 
-{{% blocks/feature icon="fas fa-box-open" title="Efficient Model Distribution (WIP)" %}}
-Out-of-the-box model cache system support with <a href="https://github.com/InftyAI/Manta">Manta</a>, still under development right now with architecture reframing.
+{{% blocks/feature icon="fas fa-ellipsis-h" title="More in the future" %}}
 {{% /blocks/feature %}}
 
 {{% /blocks/section %}}
diff --git a/site/content/en/docs/develop.md b/site/content/en/docs/develop.md
@@ -1,6 +1,6 @@
 ---
 title: Develop Guidance
-weight: 3
+weight: 4
 description: >
   This section contains a develop guidance for people who want to learn more about this project.
 ---

diff --git a/.../en/docs/integrations/support-backends.md → ...ontent/en/docs/features/broad-backends.md b/.../en/docs/integrations/support-backends.md → ...ontent/en/docs/features/broad-backends.md
@@ -1,6 +1,6 @@
 ---
-title: Supported Inference Backends
-weight: 5
+title: Broad Inference Backends Support
+weight: 1
 ---
 
 If you want to integrate more backends into llmaz, please refer to this [PR](https://github.com/InftyAI/llmaz/pull/182). It's always welcomed.
@@ -9,6 +9,11 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
 
 [llama.cpp](https://github.com/ggerganov/llama.cpp) is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
 
+## ollama
+
+[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.
+
+
 ## SGLang
 
 [SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
@@ -21,10 +26,6 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt
 
 [text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
 
-## ollama
-
-[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.
-
 ## vLLM
 
 [vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs
diff --git a/site/content/en/docs/features/distributed_inference.md b/site/content/en/docs/features/distributed_inference.md
@@ -0,0 +1,6 @@
+---
+title: Distributed Inference
+weight: 3
+---
+
+Support multi-host & homogeneous xPyD distributed serving with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. Will implement the heterogeneous xPyD in the future.
diff --git a/site/content/en/docs/features/heterogeneous-cluster-support.md b/site/content/en/docs/features/heterogeneous-cluster-support.md
@@ -1,6 +1,6 @@
 ---
 title: Heterogeneous Cluster Support
-weight: 1
+weight: 2
 ---
 
 A `llama2-7B` model can be running on __1xA100__ GPU, also on __1xA10__ GPU, even on __1x4090__ and a variety of other types of GPUs as well, that's what we called resource fungibility. In practical scenarios, we may have a heterogeneous cluster with different GPU types, and high-end GPUs will stock out a lot, to meet the SLOs of the service as well as the cost, we need to schedule the workloads on different GPU types. With the [ResourceFungibility](https://github.com/InftyAI/scheduler-plugins/blob/main/pkg/plugins/resource_fungibility) in the InftyAI scheduler, we can simply achieve this with at most 8 alternative GPU types.
@@ -20,4 +20,4 @@ globalConfig:
     scheduler-name: inftyai-scheduler
 ```
 
-then run `make helm-upgrade` to install or upgrade llmaz.
+Run `make helm-upgrade` to install or upgrade llmaz.
diff --git a/site/content/en/docs/reference/_index.md b/site/content/en/docs/reference/_index.md
@@ -1,6 +1,6 @@
 ---
 title: Reference
-weight: 4
+weight: 5
 description: >
   This section contains the llmaz reference information.
 menu: