@@ -92,17 +92,17 @@ Models | GPU Machine Type
9292
9393| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe |
9494| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ |
95- | ** DeepSeek R1 671B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | vLLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/vllm/README.md )
96- | ** Wan2.2 T2V A14B Diffusers** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | SGLang | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/sglang/README.md )
97- | ** Wan2.2 I2V A14B Diffusers** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | SGLang | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/sglang/README.md )
98- | ** DeepSeek R1 671B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md ) <br > <br > [ Link for Using Google Cloud Storage (GCS) as Storage Option] ( (./inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md) ) <br > <br > [ Link for Using Lustre as Storage Option] ( (./inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md) )
99- | ** Llama 3.1 405B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
100- | ** Llama 3.1 70B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
101- | ** Llama 3.1 8B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
102- | ** Qwen 2.5 VL 7B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
103- | ** Qwen 3 235B A22B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
104- | ** Qwen 3 32B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
105- | ** Qwen 3 4B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
95+ | ** DeepSeek R1 671B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | vLLM (v0.14.0rc1) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/vllm/README.md )
96+ | ** Wan2.2 T2V A14B Diffusers** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | SGLang (latest) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/sglang/README.md )
97+ | ** Wan2.2 I2V A14B Diffusers** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | SGLang (latest) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/sglang/README.md )
98+ | ** DeepSeek R1 671B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md ) <br > <br > [ Link for Using Google Cloud Storage (GCS) as Storage Option] ( (./inference/a4x/single-host-serving/tensorrt-llm-gcs/README.md) ) <br > <br > [ Link for Using Lustre as Storage Option] ( (./inference/a4x/single-host-serving/tensorrt-llm-lustre/README.md) )
99+ | ** Llama 3.1 405B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
100+ | ** Llama 3.1 70B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
101+ | ** Llama 3.1 8B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
102+ | ** Qwen 2.5 VL 7B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
103+ | ** Qwen 3 235B A22B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
104+ | ** Qwen 3 32B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
105+ | ** Qwen 3 4B** | [ A4X (NVIDIA GB200)] ( https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms ) | TensorRT-LLM (1.3.0rc5) | Inference | GKE | [ Link] ( ./inference/a4x/single-host-serving/tensorrt-llm/README.md )
106106
107107
108108### Inference benchmarks G4
0 commit comments