xorbitsai
diff --git a/‎doc/source/models/builtin/llm/gemma-4.rst‎
Lines changed: 223 additions & 0 deletions b/‎doc/source/models/builtin/llm/gemma-4.rst‎
Lines changed: 223 additions & 0 deletions
diff --git a/‎doc/source/models/builtin/llm/index.rst‎
Lines changed: 7 additions & 0 deletions b/‎doc/source/models/builtin/llm/index.rst‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎xinference/core/virtual_env_manager.py‎
Lines changed: 2 additions & 2 deletions b/‎xinference/core/virtual_env_manager.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎xinference/model/llm/core.py‎
Lines changed: 3 additions & 0 deletions b/‎xinference/model/llm/core.py‎
Lines changed: 3 additions & 0 deletions
@@ -0,0 +1,223 @@
+.. _models_llm_gemma-4:
+
+========================================
+gemma-4
+========================================
+
+- **Context Length:** 262144
+- **Model Name:** gemma-4
+- **Languages:** en, zh
+- **Abilities:** generate, chat, reasoning, audio, vision, hybrid
+- **Description:** Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output.
+
+Specifications
+^^^^^^^^^^^^^^
+
+
+Model Spec 1 (pytorch, 2 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 2
+- **Quantizations:** none
+- **Engines**: Transformers
+- **Model ID:** google/gemma-4-E2B-it
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/google/gemma-4-E2B-it>`__, `ModelScope <https://modelscope.cn/models/google/gemma-4-E2B-it>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 2 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 2 (pytorch, 4 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 4
+- **Quantizations:** none
+- **Engines**: Transformers
+- **Model ID:** google/gemma-4-E4B-it
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/google/gemma-4-E4B-it>`__, `ModelScope <https://modelscope.cn/models/google/gemma-4-E4B-it>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 4 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 3 (pytorch, 31 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 31
+- **Quantizations:** none
+- **Engines**: Transformers
+- **Model ID:** google/gemma-4-31B-it
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/google/gemma-4-31B-it>`__, `ModelScope <https://modelscope.cn/models/google/gemma-4-31B-it>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 31 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 4 (fp4, 31 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** fp4
+- **Model Size (in billions):** 31
+- **Quantizations:** FP4
+- **Engines**: 
+- **Model ID:** nvidia/Gemma-4-31B-IT-NVFP4
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4>`__, `ModelScope <https://modelscope.cn/models/nv-community/Gemma-4-31B-IT-NVFP4>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 31 --model-format fp4 --quantization ${quantization}
+
+
+Model Spec 5 (pytorch, 26 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** pytorch
+- **Model Size (in billions):** 26
+- **Quantizations:** none
+- **Engines**: Transformers
+- **Model ID:** google/gemma-4-26B-A4B-it
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/google/gemma-4-26B-A4B-it>`__, `ModelScope <https://modelscope.cn/models/google/gemma-4-26B-A4B-it>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 26 --model-format pytorch --quantization ${quantization}
+
+
+Model Spec 6 (ggufv2, 2 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 2
+- **Quantizations:** BF16, IQ4_NL, IQ4_XS, Q3_K_M, Q3_K_S, Q4_0, Q4_1, Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_0, UD-IQ2_M, UD-IQ3_XXS, UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL, UD-Q5_K_XL, UD-Q6_K_XL, UD-Q8_K_XL
+- **Engines**: 
+- **Model ID:** unsloth/gemma-4-E2B-it-GGUF
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF>`__, `ModelScope <https://modelscope.cn/models/unsloth/gemma-4-E2B-it-GGUF>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 2 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 7 (ggufv2, 4 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 4
+- **Quantizations:** BF16, IQ4_NL, IQ4_XS, Q3_K_M, Q3_K_S, Q4_0, Q4_1, Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_0, UD-IQ2_M, UD-IQ3_XXS, UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL, UD-Q5_K_XL, UD-Q6_K_XL, UD-Q8_K_XL
+- **Engines**: 
+- **Model ID:** unsloth/gemma-4-E4B-it-GGUF
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF>`__, `ModelScope <https://modelscope.cn/models/unsloth/gemma-4-E4B-it-GGUF>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 4 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 8 (ggufv2, 31 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 31
+- **Quantizations:** IQ4_NL, IQ4_XS, Q3_K_M, Q3_K_S, Q4_0, Q4_1, Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_0, UD-IQ2_M, UD-IQ3_XXS, UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL, UD-Q5_K_XL, UD-Q6_K_XL, UD-Q8_K_XL
+- **Engines**: 
+- **Model ID:** unsloth/gemma-4-31B-it-GGUF
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/unsloth/gemma-4-31B-it-GGUF>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 31 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 9 (ggufv2, 26 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** ggufv2
+- **Model Size (in billions):** 26
+- **Quantizations:** MXFP4_MOE, Q8_0, UD-IQ2_M, UD-IQ3_S, UD-IQ3_XXS, UD-IQ4_NL, UD-IQ4_XS, UD-Q2_K_XL, UD-Q3_K_M, UD-Q3_K_S, UD-Q3_K_XL, UD-Q4_K_M, UD-Q4_K_S, UD-Q4_K_XL, UD-Q5_K_M, UD-Q5_K_S, UD-Q5_K_XL, UD-Q6_K, UD-Q6_K_XL, UD-Q8_K_XL
+- **Engines**: 
+- **Model ID:** unsloth/gemma-4-26B-A4B-it-GGUF
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF>`__, `ModelScope <https://modelscope.cn/models/unsloth/gemma-4-26B-A4B-it-GGUF>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 26 --model-format ggufv2 --quantization ${quantization}
+
+
+Model Spec 10 (mlx, 2 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** mlx
+- **Model Size (in billions):** 2
+- **Quantizations:** bf16, 8bit, 6bit, 5bit, 4bit, mxfp8, mxfp4, nvfp4
+- **Engines**: 
+- **Model ID:** mlx-community/gemma-4-e2b-it-{quantization}
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/mlx-community/gemma-4-e2b-it-{quantization}>`__, `ModelScope <https://modelscope.cn/models/mlx-community/gemma-4-e2b-it-{quantization}>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 2 --model-format mlx --quantization ${quantization}
+
+
+Model Spec 11 (mlx, 4 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** mlx
+- **Model Size (in billions):** 4
+- **Quantizations:** bf16, 8bit, 6bit, 5bit, 4bit, mxfp8, mxfp4, nvfp4
+- **Engines**: 
+- **Model ID:** mlx-community/gemma-4-e4b-it-{quantization}
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/mlx-community/gemma-4-e4b-it-{quantization}>`__, `ModelScope <https://modelscope.cn/models/mlx-community/gemma-4-e4b-it-{quantization}>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 4 --model-format mlx --quantization ${quantization}
+
+
+Model Spec 12 (mlx, 31 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** mlx
+- **Model Size (in billions):** 31
+- **Quantizations:** bf16, 8bit, 6bit, 5bit, 4bit, mxfp8, mxfp4, nvfp4
+- **Engines**: 
+- **Model ID:** mlx-community/gemma-4-31b-it-{quantization}
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/mlx-community/gemma-4-31b-it-{quantization}>`__, `ModelScope <https://modelscope.cn/models/mlx-community/gemma-4-31b-it-{quantization}>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 31 --model-format mlx --quantization ${quantization}
+
+
+Model Spec 13 (mlx, 26 Billion)
+++++++++++++++++++++++++++++++++++++++++
+
+- **Model Format:** mlx
+- **Model Size (in billions):** 26
+- **Quantizations:** bf16, 8bit, 6bit, 5bit, 4bit, mxfp8, mxfp4, nvfp4
+- **Engines**: 
+- **Model ID:** mlx-community/gemma-4-26b-a4b-it-{quantization}
+- **Model Hubs**:  `Hugging Face <https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-{quantization}>`__, `ModelScope <https://modelscope.cn/models/mlx-community/gemma-4-26b-a4b-it-{quantization}>`__
+
+Execute the following command to launch the model, remember to replace ``${quantization}`` with your
+chosen quantization method from the options listed above::
+
+   xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 26 --model-format mlx --quantization ${quantization}
+
@@ -201,6 +201,11 @@ The following is a list of built-in LLM in Xinference:
      - 131072
      - Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
 
+   * - :ref:`gemma-4 <models_llm_gemma-4>`
+     - generate, chat, reasoning, audio, vision, hybrid
+     - 262144
+     - Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output.
+
    * - :ref:`glm-4.1v-thinking <models_llm_glm-4.1v-thinking>`
      - chat, vision, reasoning, tools
      - 65536
@@ -830,6 +835,8 @@ The following is a list of built-in LLM in Xinference:
 
    gemma-3-it
 
+   gemma-4
+  
    glm-4.1v-thinking
 
    glm-4.5
 
@@ -37,7 +37,7 @@
         'sgl_kernel ; cuda_version < "13.0"',
     ],
     "vllm": [
-        "vllm>=0.11.2,<0.15.0",
+        "vllm>=0.11.2",
     ],
     "transformers": [
         "transformers>=4.53.3",
@@ -64,7 +64,7 @@
 
 ENGINE_VIRTUALENV_EXTRA_INDEX_URLS: Dict[str, List[str]] = {
     "vllm": [
-        "https://wheels.vllm.ai/0.14.1/cu130",
+        "https://wheels.vllm.ai/0.19.0/cu130",
         "https://download.pytorch.org/whl/cu130",
     ],
     "sglang": [
 
@@ -209,12 +209,15 @@ def prepare_parse_reasoning_content(
             warnings.warn(
                 "enable_thinking cannot be disabled for non hybrid model, will be ignored"
             )
+        abilities = self.model_family.model_ability or []
+        auto_insert_start_tag = "hybrid" not in abilities
         # Initialize reasoning parser if model has reasoning ability
         self.reasoning_parser = ReasoningParser(  # type: ignore
             reasoning_content,
             self.model_family.reasoning_start_tag,  # type: ignore
             self.model_family.reasoning_end_tag,  # type: ignore
             enable_thinking=enable_thinking,
+            auto_insert_start_tag=auto_insert_start_tag,
         )
 
     def prepare_parse_tool_calls(self):