Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,12 @@ Easy, advanced inference platform for large language models on Kubernetes

- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
- **Broad Backends Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). Find the full list of supported backends [here](./site/content/en/docs/integrations/support-backends.md).
- **Heterogeneous Devices Support**: llmaz supports serving the same LLM with heterogeneous devices together with [InftyAI Kube-Scheduler](https://github.com/InftyAI/scheduler-plugins) for the sake of cost and performance.
- **Heterogeneous Cluster Support**: llmaz supports serving the same LLM with heterogeneous devices together with [InftyAI Scheduler](https://github.com/InftyAI/scheduler-plugins) for the sake of cost and performance.
- **Various Model Providers**: llmaz supports a wide range of model providers, such as [HuggingFace](https://huggingface.co/), [ModelScope](https://www.modelscope.cn), ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
- **Multi-Host Support**: llmaz supports both single-host and multi-host scenarios with [LWS](https://github.com/kubernetes-sigs/lws) from day 0.
- **Distributed Inference**: Multi-host & homogeneous xPyD support with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. Will implement the heterogeneous xPyD in the future.
- **AI Gateway Support**: Offering capabilities like token-based rate limiting, model routing with the integration of [Envoy AI Gateway](https://aigateway.envoyproxy.io/).
- **Scaling Efficiency**: Horizontal Pod scaling with [HPA](./docs/examples/hpa/README.md) with LLM-based metrics and node(spot instance) autoscaling with [Karpenter](https://github.com/kubernetes-sigs/karpenter).
- **Build-in ChatUI**: Out-of-the-box chatbot support with the integration of [Open WebUI](https://github.com/open-webui/open-webui), offering capacities like function call, RAG, web search and more, see configurations [here](./site/content/en/docs/integrations/open-webui.md).
- **Scaling Efficiency**: llmaz supports horizontal scaling with [HPA](./docs/examples/hpa/README.md) by default and will integrate with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) for smart scaling across different clouds.
- **Efficient Model Distribution (WIP)**: Out-of-the-box model cache system support with [Manta](https://github.com/InftyAI/Manta), still under development right now with architecture reframing.

## Quick Start

Expand Down
21 changes: 10 additions & 11 deletions site/content/en/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,36 +24,35 @@ title: llmaz
People can quick deploy a LLM service with minimal configurations.
{{% /blocks/feature %}}

{{% blocks/feature icon="fas fa-cogs" title="Broad Backends Support" %}}
{{% blocks/feature icon="fas fa-cubes" title="Broad Backends Support" %}}
llmaz supports a wide range of advanced inference backends for different scenarios, like <a href="https://github.com/vllm-project/vllm">vLLM</a>, <a href="https://github.com/huggingface/text-generation-inference">Text-Generation-Inference</a>, <a href="https://github.com/sgl-project/sglang">SGLang</a>, <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a>. Find the full list of supported backends <a href="/InftyAI/llmaz/blob/main/docs/support-backends.md">here</a>.
{{% /blocks/feature %}}

{{% blocks/feature icon="fas fa-exchange-alt" title="Accelerator Fungibility" %}}
{{% blocks/feature icon="fas fa-random" title="Heterogeneous Cluster Support" %}}
llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
{{% /blocks/feature %}}

{{% blocks/feature icon="fas fa-warehouse" title="Various Model Providers" %}}
{{% blocks/feature icon="fas fa-list-alt" title="Various Model Providers" %}}
llmaz supports a wide range of model providers, such as <a href="https://huggingface.co/" rel="nofollow">HuggingFace</a>, <a href="https://www.modelscope.cn" rel="nofollow">ModelScope</a>, ObjectStores. llmaz will automatically handle the model loading, requiring no effort from users.
{{% /blocks/feature %}}

{{% blocks/feature icon="fas fa-network-wired" title="Multi-Host Support" %}}
llmaz supports both single-host and multi-host scenarios with <a href="https://github.com/kubernetes-sigs/lws">LWS</a> from day 0.
{{% blocks/feature icon="fas fa-sitemap" title="Distributed Serving" %}}
Multi-host & homogeneous xPyD distributed serving support with <a href="https://github.com/kubernetes-sigs/lws">LWS</a> from day 0. Will implement the heterogeneous xPyD in the future.
{{% /blocks/feature %}}

{{% blocks/feature icon="fas fa-door-open" title="AI Gateway Support" %}}
Offering capabilities like token-based rate limiting, model routing with the integration of <a href="https://aigateway.envoyproxy.io/" rel="nofollow">Envoy AI Gateway</a>.
{{% /blocks/feature %}}

{{% blocks/feature icon="fas fa-comments" title="Build-in ChatUI" %}}
Out-of-the-box chatbot support with the integration of <a href="https://github.com/open-webui/open-webui">Open WebUI</a>, offering capacities like function call, RAG, web search and more, see configurations <a href="/InftyAI/llmaz/blob/main/docs/open-webui.md">here</a>.
{{% blocks/feature icon="fas fa-expand-arrows-alt" title="Scaling Efficiency" %}}
Horizontal Pod scaling with <a href="/InftyAI/llmaz/blob/main/docs/examples/hpa/README.md">HPA</a> based on LLM-focused metrics and node(spot instance) autoscaling with <a href="https://github.com/kubernetes-sigs/karpenter">Karpenter</a>.
{{% /blocks/feature %}}

{{% blocks/feature icon="fas fa-expand-arrows-alt" title="Scaling Efficiency" %}}
llmaz supports horizontal scaling with <a href="/InftyAI/llmaz/blob/main/docs/examples/hpa/README.md">HPA</a> by default and will integrate with autoscaling components like <a href="https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler">Cluster-Autoscaler</a> or <a href="https://github.com/kubernetes-sigs/karpenter">Karpenter</a> for smart scaling across different clouds.
{{% blocks/feature icon="fas fa-comments" title="Build-in ChatUI" %}}
Out-of-the-box chatbot support with the integration of <a href="https://github.com/open-webui/open-webui">Open WebUI</a>, offering capacities like function call, RAG, web search and more, see configurations <a href="/InftyAI/llmaz/blob/main/docs/open-webui.md">here</a>.
{{% /blocks/feature %}}

{{% blocks/feature icon="fas fa-box-open" title="Efficient Model Distribution (WIP)" %}}
Out-of-the-box model cache system support with <a href="https://github.com/InftyAI/Manta">Manta</a>, still under development right now with architecture reframing.
{{% blocks/feature icon="fas fa-ellipsis-h" title="More in the future" %}}
{{% /blocks/feature %}}

{{% /blocks/section %}}
2 changes: 1 addition & 1 deletion site/content/en/docs/develop.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Develop Guidance
weight: 3
weight: 4
description: >
This section contains a develop guidance for people who want to learn more about this project.
---
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Supported Inference Backends
weight: 5
title: Broad Inference Backends Support
weight: 1
---

If you want to integrate more backends into llmaz, please refer to this [PR](https://github.com/InftyAI/llmaz/pull/182). It's always welcomed.
Expand All @@ -9,6 +9,11 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt

[llama.cpp](https://github.com/ggerganov/llama.cpp) is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

## ollama

[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.


## SGLang

[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
Expand All @@ -21,10 +26,6 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt

[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.

## ollama

[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.

## vLLM

[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs
6 changes: 6 additions & 0 deletions site/content/en/docs/features/distributed_inference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
title: Distributed Inference
weight: 3
---

Support multi-host & homogeneous xPyD distributed serving with [LWS](https://github.com/kubernetes-sigs/lws) from day 0. Will implement the heterogeneous xPyD in the future.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Heterogeneous Cluster Support
weight: 1
weight: 2
---

A `llama2-7B` model can be running on __1xA100__ GPU, also on __1xA10__ GPU, even on __1x4090__ and a variety of other types of GPUs as well, that's what we called resource fungibility. In practical scenarios, we may have a heterogeneous cluster with different GPU types, and high-end GPUs will stock out a lot, to meet the SLOs of the service as well as the cost, we need to schedule the workloads on different GPU types. With the [ResourceFungibility](https://github.com/InftyAI/scheduler-plugins/blob/main/pkg/plugins/resource_fungibility) in the InftyAI scheduler, we can simply achieve this with at most 8 alternative GPU types.
Expand All @@ -20,4 +20,4 @@ globalConfig:
scheduler-name: inftyai-scheduler
```

then run `make helm-upgrade` to install or upgrade llmaz.
Run `make helm-upgrade` to install or upgrade llmaz.
2 changes: 1 addition & 1 deletion site/content/en/docs/reference/_index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Reference
weight: 4
weight: 5
description: >
This section contains the llmaz reference information.
menu:
Expand Down
Loading