Skip to content

Commit 41c0ab4

Browse files
authored
Merge branch 'main' into fix_speculative_decoding_demo
2 parents 405b6ee + 865379f commit 41c0ab4

55 files changed

Lines changed: 4552 additions & 511 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,4 @@ tmp/
3939
*.tar.gz
4040
models
4141
genhtml
42+
.github/skills/

demos/common/export_models/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Expected Output:
4545
usage: export_model.py text_generation [-h] [--model_repository_path MODEL_REPOSITORY_PATH] --source_model SOURCE_MODEL [--model_name MODEL_NAME] [--weight-format PRECISION] [--config_file_path CONFIG_FILE_PATH] [--overwrite_models] [--target_device TARGET_DEVICE] [--ov_cache_dir OV_CACHE_DIR]
4646
[--extra_quantization_params EXTRA_QUANTIZATION_PARAMS] [--pipeline_type {LM,LM_CB,VLM,VLM_CB,AUTO}] [--kv_cache_precision {u8}] [--enable_prefix_caching ENABLE_PREFIX_CACHING] [--disable_dynamic_split_fuse] [--max_num_batched_tokens MAX_NUM_BATCHED_TOKENS] [--max_num_seqs MAX_NUM_SEQS]
4747
[--cache_size CACHE_SIZE] [--draft_source_model DRAFT_SOURCE_MODEL] [--draft_model_name DRAFT_MODEL_NAME] [--draft_eagle3_mode] [--max_prompt_len MAX_PROMPT_LEN] [--prompt_lookup_decoding] [--reasoning_parser {qwen3,gptoss}]
48-
[--tool_parser {llama3,phi4,hermes3,mistral,qwen3coder,gptoss,devstral}] [--enable_tool_guided_generation]
48+
[--tool_parser {llama3,phi4,hermes3,mistral,qwen3coder,gptoss,devstral,lfm2}] [--enable_tool_guided_generation]
4949

5050
options:
5151
-h, --help show this help message and exit
@@ -91,7 +91,7 @@ options:
9191
Set pipeline to use prompt lookup decoding
9292
--reasoning_parser {qwen3,gptoss}
9393
Set the type of the reasoning parser for reasoning content extraction
94-
--tool_parser {llama3,phi4,hermes3,mistral,qwen3coder,gptoss,devstral}
94+
--tool_parser {llama3,phi4,hermes3,mistral,qwen3coder,gptoss,devstral,lfm2}
9595
Set the type of the tool parser for tool calls extraction
9696
--enable_tool_guided_generation
9797
Enables enforcing tool schema during generation. Requires setting tool_parser

demos/common/export_models/export_model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ def add_common_arguments(parser):
5454
'Not effective if target device is not NPU', dest='max_prompt_len')
5555
parser_text.add_argument('--prompt_lookup_decoding', action='store_true', help='Set pipeline to use prompt lookup decoding', dest='prompt_lookup_decoding')
5656
parser_text.add_argument('--reasoning_parser', choices=["qwen3", "gptoss"], help='Set the type of the reasoning parser for reasoning content extraction', dest='reasoning_parser')
57-
parser_text.add_argument('--tool_parser', choices=["llama3", "phi4", "hermes3", "mistral", "qwen3coder", "gptoss", "devstral"], help='Set the type of the tool parser for tool calls extraction', dest='tool_parser')
57+
parser_text.add_argument('--tool_parser', choices=["llama3", "phi4", "hermes3", "mistral", "qwen3coder", "gptoss", "devstral", "lfm2"], help='Set the type of the tool parser for tool calls extraction', dest='tool_parser')
5858
parser_text.add_argument('--enable_tool_guided_generation', action='store_true', help='Enables enforcing tool schema during generation. Requires setting tool_parser', dest='enable_tool_guided_generation')
5959

6060
parser_embeddings_ov = subparsers.add_parser('embeddings_ov', help='export model for embeddings endpoint with directory structure aligned with OpenVINO tools')

demos/common/export_models/requirements.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
--extra-index-url "https://download.pytorch.org/whl/cpu"
22
--extra-index-url "https://storage.openvinotoolkit.org/simple/wheels/nightly"
33
--extra-index-url "https://storage.openvinotoolkit.org/simple/wheels/pre-release"
4-
optimum-intel@git+https://github.com/huggingface/optimum-intel.git@39121884e050a32f9ded590f8597760d7e7ac205
4+
optimum-intel@git+https://github.com/huggingface/optimum-intel.git@d4dd21a3aa89c0671d85b704847ac06a378e761c
55
accelerate
66
datasets
77
diffusers # for image generation
88
nncf
99
numpy
10-
openvino-tokenizers==2026.2.0.dev20260501
11-
openvino==2026.2.0.dev20260501
10+
openvino-tokenizers==2026.2.0.dev20260509
11+
openvino==2026.2.0.dev20260509
1212
pillow
1313
sentence_transformers==5.3.0
1414
sentencepiece # Required by: transformers`

demos/python_demos/clip_image_classification/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ pip3 install -r requirements.txt
2323
## Download and convert model
2424

2525
```bash
26-
pip3 install -r download_model_requirements.txt
26+
pip3 install --pre --extra-index-url "https://download.pytorch.org/whl/cpu" --extra-index-url "https://storage.openvinotoolkit.org/simple/wheels/nightly" "openvino==2026.1.*" "numpy<2.0" "pillow==12.2.0" "torch==2.8.0+cpu" "transformers<=4.53.0"
2727
```
2828

2929
```bash

demos/python_demos/clip_image_classification/download_model_requirements.txt

Lines changed: 0 additions & 8 deletions
This file was deleted.

docs/model_server_rest_api_chat.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,7 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
221221
| tool_choice |||| string or object | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular tool via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. See [OpenAI API reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice) for more details. |
222222
| response_format |||| object | An object specifying the format that the model must output. Setting to `{ "type": "json_schema", "json_schema": {...} }` enables Structured Outputs which ensures the model will match your supplied JSON schema according to [OpenAI reference](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format). Learn more in the [Structured Outputs demo](../demos/continuous_batching/structured_output/README.md). Additionally, `response_format` can accept [XGrammar structural tags format](https://github.com/mlc-ai/xgrammar/blob/main/docs/tutorials/structural_tag.md#format-types) (not part of OpenAI API). For example: `{ "type": "const_string", "value": "Hello World!" }`. **Note** that if model server fails to process the format, the request will still be processed, but the format will not be imposed. |
223223
| chat_template_kwargs |||| object | Enables passing additional parameters to chat template engine. Example `{"enable_thinking": false}`. Note that values like `messages`, `eos_token`, `bos_token` etc. are provided natively to the template engine, so including them in `chat_template_kwargs` will cause error. |
224+
| skip_special_tokens |||| bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |
224225

225226
#### Beam search sampling specific
226227
| Param | OpenVINO Model Server | OpenAI /chat/completions API | vLLM Serving Sampling Params | Type | Description |
@@ -234,11 +235,12 @@ Some parameters, especially related to sampling (like `temperature`, `top_p` etc
234235
|-------|----------|----------|----------|---------|-----|
235236
| temperature |||| float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
236237
| top_p |||| float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
237-
| top_k |||| int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
238+
| min_p |||| float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05``0.1`. Must be in `[0.0, 1.0)`. |
239+
| top_k |||| int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
238240
| repetition_penalty |||| float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
239241
| frequency_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
240242
| presence_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
241-
| seed |||| integer (default: `0`) | Random seed to use for the generation. |
243+
| seed |||| integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
242244

243245
#### Speculative decoding specific
244246

@@ -274,14 +276,12 @@ If any of those parameters is not specified and request is made to Prompt Lookup
274276
- functions
275277

276278
#### Unsupported params from vLLM:
277-
- min_p
278279
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
279280
- early_stopping
280281
- stop_token_ids
281282
- min_tokens
282283
- prompt_logprobs
283284
- detokenize
284-
- skip_special_tokens
285285
- spaces_between_special_tokens
286286
- logits_processors
287287
- truncate_prompt_tokens

docs/model_server_rest_api_completions.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ curl http://localhost/v3/completions \
6262
| include_stop_str_in_output |||| bool (default: `false` if `stream=false`, `true` if `stream=true`) | Whether to include matched stop string in output. Setting it to false when `stream=true` is invalid configuration and will result in error. |
6363
| logprobs | ⚠️ ||| integer (optional) | Include the log probabilities on the logprob of the returned output token. **_ in stream mode logprobs are not returned. Only value 1 is accepted which returns logarithm or the chosen token _** |
6464
| echo |||| boolean (optional) | Echo back the prompt in addition to the completion |
65+
| skip_special_tokens |||| bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |
6566

6667
#### Beam search sampling specific
6768
| Param | OpenVINO Model Server | OpenAI /completions API | vLLM Serving Sampling Params | Type | Description |
@@ -75,11 +76,12 @@ curl http://localhost/v3/completions \
7576
|-------|----------|----------|----------|---------|-----|
7677
| temperature |||| float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
7778
| top_p |||| float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
78-
| top_k |||| int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
79+
| min_p |||| float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05``0.1`. Must be in `[0.0, 1.0)`. |
80+
| top_k |||| int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
7981
| repetition_penalty |||| float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
8082
| frequency_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
8183
| presence_penalty |||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
82-
| seed |||| integer (default: `0`) | Random seed to use for the generation. |
84+
| seed |||| integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
8385

8486
#### Speculative decoding specific
8587

@@ -105,14 +107,12 @@ Note that below parameters are valid only for prompt lookup pipeline. Add `"prom
105107

106108

107109
#### Unsupported params from vLLM:
108-
- min_p
109110
- use_beam_search (**In OpenVINO Model Server just simply increase _best_of_ param to enable beam search**)
110111
- early_stopping
111112
- stop_token_ids
112113
- min_tokens
113114
- prompt_logprobs
114115
- detokenize
115-
- skip_special_tokens
116116
- spaces_between_special_tokens
117117
- logits_processors
118118
- truncate_prompt_tokens

docs/model_server_rest_api_responses.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ curl http://localhost/v3/responses \
105105
| tool_choice ||| string or object (optional) | Controls which (if any) tool is called by the model. `none` means the model will not call any tool and instead generates a message. `auto` means the model can pick between generating a message or calling one or more tools. `required` means that model should call at least one tool. Specifying a particular function via `{"type": "function", "function": {"name": "my_function"}}` forces the model to call that tool. |
106106
| reasoning | ⚠️ || object (optional) | Configuration for reasoning/thinking mode. The `effort` field accepts `"low"`, `"medium"`, or `"high"` — any value enables thinking mode (`enable_thinking: true` is injected into chat template kwargs). The `summary` field is accepted but ignored. |
107107
| chat_template_kwargs ||| object (optional) | Additional keyword arguments passed to the chat template. When `reasoning` is also provided, `enable_thinking: true` is merged into these kwargs. |
108+
| skip_special_tokens ||| bool (default: `true`) | Whether to remove special tokens (e.g. `<\|endoftext\|>`, `<\|im_end\|>`) from the generated output. Set to `false` to include them, which is useful when the model uses special tokens to encode structured information (e.g. bounding boxes, reasoning markers). When `false`, any tool or reasoning parser configured on the endpoint is silently disabled for the request, so the raw token stream is returned. This option works with most detokenizers exported with OpenVINO Tokenizers 2024.5 or later, unless they are based on custom ops. |
108109
| stream_options ||| | Not supported in Responses API. Usage statistics are always included in the `response.completed` event. |
109110

110111
#### Beam search sampling specific
@@ -119,11 +120,12 @@ curl http://localhost/v3/responses \
119120
|-------|----------|----------|---------|-----|
120121
| temperature ||| float (default: `1.0`) | The value is used to modulate token probabilities for multinomial sampling. It enables multinomial sampling when set to `> 0.0`. |
121122
| top_p ||| float (default: `1.0`) | Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. |
122-
| top_k ||| int (default: all tokens) | Controls the number of top tokens to consider. Set to empty or -1 to consider all tokens. |
123+
| min_p ||| float (default: `0.0`) | Minimum probability threshold relative to the most likely token. Tokens with probability below `min_p` × the top token probability are filtered out. `0.0` (default) disables the filter. Typical values: `0.05``0.1`. Must be in `[0.0, 1.0)`. |
124+
| top_k ||| int (default: `40`) | Controls the number of top tokens to consider. When multinomial sampling is active, defaults to `40` if not set. Set to `-1` to consider all tokens. |
123125
| repetition_penalty ||| float (default: `1.0`) | Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > `1.0` encourage the model to use new tokens, while values < `1.0` encourage the model to repeat tokens. `1.0` means no penalty. |
124126
| frequency_penalty ||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. |
125127
| presence_penalty ||| float (default: `0.0`) | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. |
126-
| seed ||| integer (default: `0`) | Random seed to use for the generation. |
128+
| seed ||| integer (default: random) | Random seed for generation in range `[0, 4294967295]`. Omit to use a random seed (non-deterministic). Set explicitly to get reproducible output. Note: `rng_seed` set in `generation_config.json` is not honoured for multinomial sampling — only a per-request seed is applied. |
127129

128130
#### Speculative decoding specific
129131

docs/parameters.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -138,8 +138,8 @@ Task specific parameters for different tasks (text generation/image generation/e
138138
| `--max_prompt_len` | `integer` | Sets NPU specific property for maximum number of tokens in the prompt. |
139139
| `--kv_cache_precision` | `string` | Reduced kv cache precision to `u8` lowers the cache size consumption. Accepted values: `u8` or empty (default). |
140140
| `--model_distribution_policy` | `string` | TENSOR_PARALLEL distributes tensor to multiple sockets/devices and processes it in parallel. PIPELINE_PARALLEL distributes different tensors to process by each device. Accepted values: `TENSOR_PARALLEL`, `PIPELINE_PARALLEL` or empty (default). |
141-
| `--reasoning_parser` | `string` | Type of parser to use for reasoning content extraction from model output. Currently supported: [qwen3] |
142-
| `--tool_parser` | `string` | Type of parser to use for tool calls extraction from model output. Currently supported: [llama3, hermes3, phi4] |
141+
| `--reasoning_parser` | `string` | Type of parser to use for reasoning content extraction from model output. Currently supported: [qwen3, gptoss] |
142+
| `--tool_parser` | `string` | Type of parser to use for tool calls extraction from model output. Currently supported: [llama3, phi4, hermes3, mistral, qwen3coder, gptoss, devstral, lfm2] |
143143
| `--enable_tool_guided_generation` | `bool` | Enables enforcing tool schema during generation. Requires setting response parser. Default: false. |
144144

145145
### Image generation

0 commit comments

Comments
 (0)