Skip to content

Commit d6d1007

Browse files
FEAT: [model] gemma-4 support (#4768)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1 parent 7f9fa69 commit d6d1007

17 files changed

Lines changed: 1376 additions & 151 deletions

File tree

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
.. _models_llm_gemma-4:
2+
3+
========================================
4+
gemma-4
5+
========================================
6+
7+
- **Context Length:** 262144
8+
- **Model Name:** gemma-4
9+
- **Languages:** en, zh
10+
- **Abilities:** generate, chat, reasoning, audio, vision, hybrid
11+
- **Description:** Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output.
12+
13+
Specifications
14+
^^^^^^^^^^^^^^
15+
16+
17+
Model Spec 1 (pytorch, 2 Billion)
18+
++++++++++++++++++++++++++++++++++++++++
19+
20+
- **Model Format:** pytorch
21+
- **Model Size (in billions):** 2
22+
- **Quantizations:** none
23+
- **Engines**: Transformers
24+
- **Model ID:** google/gemma-4-E2B-it
25+
- **Model Hubs**: `Hugging Face <https://huggingface.co/google/gemma-4-E2B-it>`__, `ModelScope <https://modelscope.cn/models/google/gemma-4-E2B-it>`__
26+
27+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
28+
chosen quantization method from the options listed above::
29+
30+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 2 --model-format pytorch --quantization ${quantization}
31+
32+
33+
Model Spec 2 (pytorch, 4 Billion)
34+
++++++++++++++++++++++++++++++++++++++++
35+
36+
- **Model Format:** pytorch
37+
- **Model Size (in billions):** 4
38+
- **Quantizations:** none
39+
- **Engines**: Transformers
40+
- **Model ID:** google/gemma-4-E4B-it
41+
- **Model Hubs**: `Hugging Face <https://huggingface.co/google/gemma-4-E4B-it>`__, `ModelScope <https://modelscope.cn/models/google/gemma-4-E4B-it>`__
42+
43+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
44+
chosen quantization method from the options listed above::
45+
46+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 4 --model-format pytorch --quantization ${quantization}
47+
48+
49+
Model Spec 3 (pytorch, 31 Billion)
50+
++++++++++++++++++++++++++++++++++++++++
51+
52+
- **Model Format:** pytorch
53+
- **Model Size (in billions):** 31
54+
- **Quantizations:** none
55+
- **Engines**: Transformers
56+
- **Model ID:** google/gemma-4-31B-it
57+
- **Model Hubs**: `Hugging Face <https://huggingface.co/google/gemma-4-31B-it>`__, `ModelScope <https://modelscope.cn/models/google/gemma-4-31B-it>`__
58+
59+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
60+
chosen quantization method from the options listed above::
61+
62+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 31 --model-format pytorch --quantization ${quantization}
63+
64+
65+
Model Spec 4 (fp4, 31 Billion)
66+
++++++++++++++++++++++++++++++++++++++++
67+
68+
- **Model Format:** fp4
69+
- **Model Size (in billions):** 31
70+
- **Quantizations:** FP4
71+
- **Engines**:
72+
- **Model ID:** nvidia/Gemma-4-31B-IT-NVFP4
73+
- **Model Hubs**: `Hugging Face <https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4>`__, `ModelScope <https://modelscope.cn/models/nv-community/Gemma-4-31B-IT-NVFP4>`__
74+
75+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
76+
chosen quantization method from the options listed above::
77+
78+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 31 --model-format fp4 --quantization ${quantization}
79+
80+
81+
Model Spec 5 (pytorch, 26 Billion)
82+
++++++++++++++++++++++++++++++++++++++++
83+
84+
- **Model Format:** pytorch
85+
- **Model Size (in billions):** 26
86+
- **Quantizations:** none
87+
- **Engines**: Transformers
88+
- **Model ID:** google/gemma-4-26B-A4B-it
89+
- **Model Hubs**: `Hugging Face <https://huggingface.co/google/gemma-4-26B-A4B-it>`__, `ModelScope <https://modelscope.cn/models/google/gemma-4-26B-A4B-it>`__
90+
91+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
92+
chosen quantization method from the options listed above::
93+
94+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 26 --model-format pytorch --quantization ${quantization}
95+
96+
97+
Model Spec 6 (ggufv2, 2 Billion)
98+
++++++++++++++++++++++++++++++++++++++++
99+
100+
- **Model Format:** ggufv2
101+
- **Model Size (in billions):** 2
102+
- **Quantizations:** BF16, IQ4_NL, IQ4_XS, Q3_K_M, Q3_K_S, Q4_0, Q4_1, Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_0, UD-IQ2_M, UD-IQ3_XXS, UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL, UD-Q5_K_XL, UD-Q6_K_XL, UD-Q8_K_XL
103+
- **Engines**:
104+
- **Model ID:** unsloth/gemma-4-E2B-it-GGUF
105+
- **Model Hubs**: `Hugging Face <https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF>`__, `ModelScope <https://modelscope.cn/models/unsloth/gemma-4-E2B-it-GGUF>`__
106+
107+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
108+
chosen quantization method from the options listed above::
109+
110+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 2 --model-format ggufv2 --quantization ${quantization}
111+
112+
113+
Model Spec 7 (ggufv2, 4 Billion)
114+
++++++++++++++++++++++++++++++++++++++++
115+
116+
- **Model Format:** ggufv2
117+
- **Model Size (in billions):** 4
118+
- **Quantizations:** BF16, IQ4_NL, IQ4_XS, Q3_K_M, Q3_K_S, Q4_0, Q4_1, Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_0, UD-IQ2_M, UD-IQ3_XXS, UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL, UD-Q5_K_XL, UD-Q6_K_XL, UD-Q8_K_XL
119+
- **Engines**:
120+
- **Model ID:** unsloth/gemma-4-E4B-it-GGUF
121+
- **Model Hubs**: `Hugging Face <https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF>`__, `ModelScope <https://modelscope.cn/models/unsloth/gemma-4-E4B-it-GGUF>`__
122+
123+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
124+
chosen quantization method from the options listed above::
125+
126+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 4 --model-format ggufv2 --quantization ${quantization}
127+
128+
129+
Model Spec 8 (ggufv2, 31 Billion)
130+
++++++++++++++++++++++++++++++++++++++++
131+
132+
- **Model Format:** ggufv2
133+
- **Model Size (in billions):** 31
134+
- **Quantizations:** IQ4_NL, IQ4_XS, Q3_K_M, Q3_K_S, Q4_0, Q4_1, Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_0, UD-IQ2_M, UD-IQ3_XXS, UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL, UD-Q5_K_XL, UD-Q6_K_XL, UD-Q8_K_XL
135+
- **Engines**:
136+
- **Model ID:** unsloth/gemma-4-31B-it-GGUF
137+
- **Model Hubs**: `Hugging Face <https://huggingface.co/unsloth/gemma-4-31B-it-GGUF>`__
138+
139+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
140+
chosen quantization method from the options listed above::
141+
142+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 31 --model-format ggufv2 --quantization ${quantization}
143+
144+
145+
Model Spec 9 (ggufv2, 26 Billion)
146+
++++++++++++++++++++++++++++++++++++++++
147+
148+
- **Model Format:** ggufv2
149+
- **Model Size (in billions):** 26
150+
- **Quantizations:** MXFP4_MOE, Q8_0, UD-IQ2_M, UD-IQ3_S, UD-IQ3_XXS, UD-IQ4_NL, UD-IQ4_XS, UD-Q2_K_XL, UD-Q3_K_M, UD-Q3_K_S, UD-Q3_K_XL, UD-Q4_K_M, UD-Q4_K_S, UD-Q4_K_XL, UD-Q5_K_M, UD-Q5_K_S, UD-Q5_K_XL, UD-Q6_K, UD-Q6_K_XL, UD-Q8_K_XL
151+
- **Engines**:
152+
- **Model ID:** unsloth/gemma-4-26B-A4B-it-GGUF
153+
- **Model Hubs**: `Hugging Face <https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF>`__, `ModelScope <https://modelscope.cn/models/unsloth/gemma-4-26B-A4B-it-GGUF>`__
154+
155+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
156+
chosen quantization method from the options listed above::
157+
158+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 26 --model-format ggufv2 --quantization ${quantization}
159+
160+
161+
Model Spec 10 (mlx, 2 Billion)
162+
++++++++++++++++++++++++++++++++++++++++
163+
164+
- **Model Format:** mlx
165+
- **Model Size (in billions):** 2
166+
- **Quantizations:** bf16, 8bit, 6bit, 5bit, 4bit, mxfp8, mxfp4, nvfp4
167+
- **Engines**:
168+
- **Model ID:** mlx-community/gemma-4-e2b-it-{quantization}
169+
- **Model Hubs**: `Hugging Face <https://huggingface.co/mlx-community/gemma-4-e2b-it-{quantization}>`__, `ModelScope <https://modelscope.cn/models/mlx-community/gemma-4-e2b-it-{quantization}>`__
170+
171+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
172+
chosen quantization method from the options listed above::
173+
174+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 2 --model-format mlx --quantization ${quantization}
175+
176+
177+
Model Spec 11 (mlx, 4 Billion)
178+
++++++++++++++++++++++++++++++++++++++++
179+
180+
- **Model Format:** mlx
181+
- **Model Size (in billions):** 4
182+
- **Quantizations:** bf16, 8bit, 6bit, 5bit, 4bit, mxfp8, mxfp4, nvfp4
183+
- **Engines**:
184+
- **Model ID:** mlx-community/gemma-4-e4b-it-{quantization}
185+
- **Model Hubs**: `Hugging Face <https://huggingface.co/mlx-community/gemma-4-e4b-it-{quantization}>`__, `ModelScope <https://modelscope.cn/models/mlx-community/gemma-4-e4b-it-{quantization}>`__
186+
187+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
188+
chosen quantization method from the options listed above::
189+
190+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 4 --model-format mlx --quantization ${quantization}
191+
192+
193+
Model Spec 12 (mlx, 31 Billion)
194+
++++++++++++++++++++++++++++++++++++++++
195+
196+
- **Model Format:** mlx
197+
- **Model Size (in billions):** 31
198+
- **Quantizations:** bf16, 8bit, 6bit, 5bit, 4bit, mxfp8, mxfp4, nvfp4
199+
- **Engines**:
200+
- **Model ID:** mlx-community/gemma-4-31b-it-{quantization}
201+
- **Model Hubs**: `Hugging Face <https://huggingface.co/mlx-community/gemma-4-31b-it-{quantization}>`__, `ModelScope <https://modelscope.cn/models/mlx-community/gemma-4-31b-it-{quantization}>`__
202+
203+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
204+
chosen quantization method from the options listed above::
205+
206+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 31 --model-format mlx --quantization ${quantization}
207+
208+
209+
Model Spec 13 (mlx, 26 Billion)
210+
++++++++++++++++++++++++++++++++++++++++
211+
212+
- **Model Format:** mlx
213+
- **Model Size (in billions):** 26
214+
- **Quantizations:** bf16, 8bit, 6bit, 5bit, 4bit, mxfp8, mxfp4, nvfp4
215+
- **Engines**:
216+
- **Model ID:** mlx-community/gemma-4-26b-a4b-it-{quantization}
217+
- **Model Hubs**: `Hugging Face <https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-{quantization}>`__, `ModelScope <https://modelscope.cn/models/mlx-community/gemma-4-26b-a4b-it-{quantization}>`__
218+
219+
Execute the following command to launch the model, remember to replace ``${quantization}`` with your
220+
chosen quantization method from the options listed above::
221+
222+
xinference launch --model-engine ${engine} --model-name gemma-4 --size-in-billions 26 --model-format mlx --quantization ${quantization}
223+

doc/source/models/builtin/llm/index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,11 @@ The following is a list of built-in LLM in Xinference:
201201
- 131072
202202
- Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.
203203

204+
* - :ref:`gemma-4 <models_llm_gemma-4>`
205+
- generate, chat, reasoning, audio, vision, hybrid
206+
- 262144
207+
- Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output.
208+
204209
* - :ref:`glm-4.1v-thinking <models_llm_glm-4.1v-thinking>`
205210
- chat, vision, reasoning, tools
206211
- 65536
@@ -830,6 +835,8 @@ The following is a list of built-in LLM in Xinference:
830835

831836
gemma-3-it
832837

838+
gemma-4
839+
833840
glm-4.1v-thinking
834841

835842
glm-4.5

xinference/core/virtual_env_manager.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
'sgl_kernel ; cuda_version < "13.0"',
3838
],
3939
"vllm": [
40-
"vllm>=0.11.2,<0.15.0",
40+
"vllm>=0.11.2",
4141
],
4242
"transformers": [
4343
"transformers>=4.53.3",
@@ -64,7 +64,7 @@
6464

6565
ENGINE_VIRTUALENV_EXTRA_INDEX_URLS: Dict[str, List[str]] = {
6666
"vllm": [
67-
"https://wheels.vllm.ai/0.14.1/cu130",
67+
"https://wheels.vllm.ai/0.19.0/cu130",
6868
"https://download.pytorch.org/whl/cu130",
6969
],
7070
"sglang": [

xinference/model/llm/core.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,12 +209,15 @@ def prepare_parse_reasoning_content(
209209
warnings.warn(
210210
"enable_thinking cannot be disabled for non hybrid model, will be ignored"
211211
)
212+
abilities = self.model_family.model_ability or []
213+
auto_insert_start_tag = "hybrid" not in abilities
212214
# Initialize reasoning parser if model has reasoning ability
213215
self.reasoning_parser = ReasoningParser( # type: ignore
214216
reasoning_content,
215217
self.model_family.reasoning_start_tag, # type: ignore
216218
self.model_family.reasoning_end_tag, # type: ignore
217219
enable_thinking=enable_thinking,
220+
auto_insert_start_tag=auto_insert_start_tag,
218221
)
219222

220223
def prepare_parse_tool_calls(self):

0 commit comments

Comments
 (0)