Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,17 +92,17 @@ Models | GPU Machine Type

| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
| **DeepSeek R1 671B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | vLLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/vllm/README.md)
| **Wan2.2 T2V A14B Diffusers** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | SGLang | Inference | GKE | [Link](./inference/a4x/single-host-serving/sglang/README.md)
| **Wan2.2 I2V A14B Diffusers** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | SGLang | Inference | GKE | [Link](./inference/a4x/single-host-serving/sglang/README.md)
| **DeepSeek R1 671B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) <br> <br> [Link for Using Google Cloud Storage (GCS) as Storage Option]((./inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md)) <br> <br> [Link for Using Lustre as Storage Option]((./inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md))
| **Llama 3.1 405B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Llama 3.1 70B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Llama 3.1 8B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Qwen 2.5 VL 7B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Qwen 3 235B A22B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Qwen 3 32B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Qwen 3 4B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **DeepSeek R1 671B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | vLLM (v0.14.0rc1) | Inference | GKE | [Link](./inference/a4x/single-host-serving/vllm/README.md)
| **Wan2.2 T2V A14B Diffusers** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | SGLang (latest) | Inference | GKE | [Link](./inference/a4x/single-host-serving/sglang/README.md)
| **Wan2.2 I2V A14B Diffusers** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | SGLang (latest) | Inference | GKE | [Link](./inference/a4x/single-host-serving/sglang/README.md)
| **DeepSeek R1 671B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md) <br> <br> [Link for Using Google Cloud Storage (GCS) as Storage Option]((./inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md)) <br> <br> [Link for Using Lustre as Storage Option]((./inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md))
| **Llama 3.1 405B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Llama 3.1 70B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Llama 3.1 8B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Qwen 2.5 VL 7B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Qwen 3 235B A22B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Qwen 3 32B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)
| **Qwen 3 4B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [Link](./inference/a4x/single-host-serving/tensorrt-llm/README.md)


### Inference benchmarks G4
Expand Down