Skip to content

Commit d98bad4

Browse files
authored
Merge branch 'main' into fix-deepspeed-ep-init
2 parents ac159e5 + 0323898 commit d98bad4

68 files changed

Lines changed: 2166 additions & 198 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -769,6 +769,8 @@
769769
title: OLMoE
770770
- local: model_doc/olmo_hybrid
771771
title: OlmoHybrid
772+
- local: model_doc/openai_privacy_filter
773+
title: OpenAI Privacy Filter
772774
- local: model_doc/opt
773775
title: OPT
774776
- local: model_doc/pegasus

docs/source/en/continuous_batching.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,20 @@ Cancel a request with [`~ContinuousBatchingManager.cancel_request`].
124124
manager.cancel_request(request_id="my_request")
125125
```
126126

127+
### Per-request sampling parameters
128+
129+
Enable `per_request_processors` to apply `temperature`, `top_k`, and `top_p` independently per request within the same forward pass to allow different sampling parameters for different requests (creative, high-temperature outputs versus precise, low-temperature ones for example).
130+
131+
```py
132+
cb_config = ContinuousBatchingConfig(per_request_processors=True)
133+
134+
# each request gets its own sampling parameters
135+
manager.add_request(input_ids=inputs_a, temperature=0.9, top_p=0.95)
136+
manager.add_request(input_ids=inputs_b, temperature=0.1, top_k=10)
137+
```
138+
139+
Each parameter in [`GenerationConfig`] must be a non-default value in order to create the associated logits processor at runtime. For example, set `temperature` to a value other than `None` or `1` to support per-request temperature control. Requests with temperatures of `1` can still be created afterwards.
140+
127141
### Retrieving results
128142

129143
Iterate over the manager to receive results as they arrive.
@@ -174,6 +188,7 @@ By default, `num_blocks` and `max_batch_tokens` are inferred automatically from
174188
| Prefix caching | ↓ shared KV blocks | ✓ skips redundant prefill | ✓ TTFT |
175189
| Paged attention | ↓ no fragmentation | ✓ dynamic batch membership | |
176190
| Sliding window | ↓ bounded KV per layer | | |
191+
| Per-request processors | | ✓ mixed sampling params per batch | |
177192

178193
```py
179194
from transformers.generation import ContinuousBatchingConfig

docs/source/en/model_doc/olmo.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,8 @@ print(tokenizer.decode(output[0]))
127127

128128
[[autodoc]] OlmoForCausalLM
129129
- forward
130+
131+
## OlmoForSequenceClassification
132+
133+
[[autodoc]] OlmoForSequenceClassification
134+
- forward

docs/source/en/model_doc/olmo2.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,3 +136,8 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
136136

137137
[[autodoc]] Olmo2ForCausalLM
138138
- forward
139+
140+
## Olmo2ForSequenceClassification
141+
142+
[[autodoc]] Olmo2ForSequenceClassification
143+
- forward

docs/source/en/model_doc/olmo3.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,11 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
129129

130130
[[autodoc]] Olmo3ForCausalLM
131131

132+
## Olmo3ForSequenceClassification
133+
134+
[[autodoc]] Olmo3ForSequenceClassification
135+
- forward
136+
132137
## Olmo3Model
133138

134139
[[autodoc]] Olmo3Model
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
<!--Copyright 2026 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
*This model was released on 2026-04-22 and added to Hugging Face Transformers on 2026-04-22.*
17+
18+
<div style="float: right;">
19+
<div class="flex flex-wrap space-x-1">
20+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
</div>
22+
</div>
23+
24+
# OpenAI Privacy Filter
25+
26+
OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable.
27+
28+
OpenAI Privacy Filter is pretrained autoregressively to arrive at a checkpoint with similar architecture to gpt-oss, albeit of a smaller size. We then converted that checkpoint into a bidirectional token classifier over a privacy label taxonomy, and post-trained with a supervised classification loss. (For architecture details about gpt-oss, please see the gpt-oss model card.) Instead of generating text token-by-token, this model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure. For each input token, the model predicts a probability distribution over the label taxonomy which consists of 8 output categories described below.
29+
30+
Highlights:
31+
32+
- Permissive Apache 2.0 license: ideal for experimentation, customization, and commercial deployment.
33+
- Small size: Runs in a web browser or on a laptop – 1.5B parameters total and 50M active parameters.
34+
- Fine-tunable: Adapt the model to specific data distributions through easy and data efficient finetuning.
35+
- Long-context: 128,000-token context window enables processing long text with high throughput and no chunking.
36+
- Runtime control: configure precision/recall tradeoffs and detected span lengths through preset operating points.
37+
38+
The example below demonstrates how to detect privacy-sensitive tokens with [`Pipeline`] or the [`AutoModelForTokenClassification`] class.
39+
40+
<hfoptions id="usage">
41+
<hfoption id="Pipeline">
42+
43+
```py
44+
from transformers import pipeline
45+
46+
classifier = pipeline(
47+
task="token-classification",
48+
model="openai/privacy-filter",
49+
)
50+
classifier("My name is Alice Smith")
51+
```
52+
53+
</hfoption>
54+
<hfoption id="AutoModelForTokenClassification">
55+
56+
```py
57+
import torch
58+
from transformers import AutoModelForTokenClassification, AutoTokenizer
59+
60+
tokenizer = AutoTokenizer.from_pretrained("openai/privacy-filter")
61+
model = AutoModelForTokenClassification.from_pretrained("openai/privacy-filter", device_map="auto")
62+
63+
inputs = tokenizer("My name is Alice Smith", return_tensors="pt").to(model.device)
64+
65+
with torch.no_grad():
66+
outputs = model(**inputs)
67+
68+
predicted_token_class_ids = outputs.logits.argmax(dim=-1)
69+
predicted_token_classes = [model.config.id2label[token_id.item()] for token_id in predicted_token_class_ids[0]]
70+
print(predicted_token_classes)
71+
```
72+
73+
</hfoption>
74+
</hfoptions>
75+
76+
- Developed by: OpenAI
77+
- Funded by: OpenAI
78+
- Shared by: OpenAI
79+
- Model type: Bidirectional token classification model for privacy span detection
80+
- Language(s): Primarily English; selected multilingual robustness evaluation reported
81+
- License: [Apache 2.0](LICENSE)
82+
83+
- Source repository: https://github.com/openai/privacy-filter
84+
- Model weights: https://huggingface.co/openai/privacy-filter
85+
- Demo: https://huggingface.co/spaces/openai/privacy-filter
86+
87+
## Resources
88+
89+
- [Token classification task guide](../tasks/token_classification)
90+
91+
## OpenAIPrivacyFilterConfig
92+
93+
[[autodoc]] OpenAIPrivacyFilterConfig
94+
95+
## OpenAIPrivacyFilterModel
96+
97+
[[autodoc]] OpenAIPrivacyFilterModel
98+
- forward
99+
100+
## OpenAIPrivacyFilterForTokenClassification
101+
102+
[[autodoc]] OpenAIPrivacyFilterForTokenClassification
103+
- forward

docs/source/en/quantization/torchao.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -328,11 +328,9 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
328328
import torch
329329
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
330330
from torchao.quantization import Int4WeightOnlyConfig
331-
from torchao.dtypes import Int4XPULayout
332-
from torchao.quantization.quant_primitives import ZeroPointDomain
333331

334332

335-
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4XPULayout(), zero_point_domain=ZeroPointDomain.INT, int4_packing_format="plain_int32")
333+
quant_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="plain_int32")
336334
quantization_config = TorchAoConfig(quant_type=quant_config)
337335

338336
# Load and quantize the model
@@ -345,7 +343,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
345343

346344
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
347345
input_text = "What are we having for dinner?"
348-
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device, quantized_model.dtype)
346+
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device).to(quantized_model.dtype)
349347

350348
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
351349
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
@@ -395,9 +393,9 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
395393
```py
396394
import torch
397395
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
398-
from torchao.prototype.int4_opaque_tensor import Int4WeightOnlyOpaqueTensorConfig
396+
from torchao.prototype.quantization.int4 import PrototypeInt4WeightOnlyConfig
399397

400-
quantization_config = TorchAoConfig(Int4WeightOnlyOpaqueTensorConfig(group_size=128))
398+
quantization_config = TorchAoConfig(PrototypeInt4WeightOnlyConfig(group_size=128, int4_choose_qparams_algorithm="tinygemm"))
401399

402400
# Load and quantize the model
403401
quantized_model = AutoModelForCausalLM.from_pretrained(
@@ -409,7 +407,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
409407

410408
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
411409
input_text = "What are we having for dinner?"
412-
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device, quantized_model.dtype)
410+
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device).to(quantized_model.dtype)
413411

414412
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
415413
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")

docs/source/en/serve-cli/serving.md

Lines changed: 84 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -456,7 +456,7 @@ data: {"id":"f47ac10b-58cc-4372-a567-0e02b2c3d479","choices":[{"delta":{"content
456456

457457
### Audio-based completions
458458

459-
Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input using the OpenAI `input_audio` content type. The audio must be base64-encoded and the format (`mp3` or `wav`) must be specified.
459+
Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input through the OpenAI `input_audio` content type. Base64-encode the audio and specify the format (`mp3` or `wav`).
460460

461461
<hfoptions id="audio-completions">
462462
<hfoption id="huggingface_hub">
@@ -695,7 +695,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
695695
> [!WARNING]
696696
> The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.
697697
698-
As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
698+
You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.
699699

700700
```python
701701
completion = client.chat.completions.create(
@@ -717,7 +717,7 @@ completion = client.chat.completions.create(
717717
> [!WARNING]
718718
> The `video_url` content type is an extension not part of the OpenAI standard and may change in future versions.
719719
720-
Video input is supported using the `video_url` content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames.
720+
Use the `video_url` content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames.
721721

722722
> [!TIP]
723723
> Video processing requires [torchcodec](https://github.com/pytorch/torchcodec). Install it with `pip install torchcodec`.
@@ -934,7 +934,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
934934
</hfoption>
935935
</hfoptions>
936936

937-
### Multi-turn conversations
937+
### Multi-turn conversations[[completions]]
938938

939939
To have a multi-turn conversation, include the full conversation history in the `messages` list with alternating `user` and `assistant` roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
940940

@@ -954,7 +954,7 @@ completion = client.chat.completions.create(
954954
print(completion.choices[0].message.content)
955955
```
956956

957-
The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
957+
The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
958958

959959
```
960960
As of 2021, the population of Paris is approximately 2.2 million people.
@@ -1466,7 +1466,7 @@ data: {"content_index":0,"delta":"This ","item_id":"msg_a1b2c3d4","output_index"
14661466
> [!WARNING]
14671467
> The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.
14681468
1469-
As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
1469+
You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.
14701470

14711471
```python
14721472
response = client.responses.create(
@@ -1621,7 +1621,7 @@ data: {"content_index":0,"delta":"Based ","item_id":"msg_b2c3d4e5","output_index
16211621
</hfoption>
16221622
</hfoptions>
16231623

1624-
### Multi-turn conversations
1624+
### Multi-turn conversations[[responses]]
16251625

16261626
For multi-turn conversations, pass a list of messages with `role` keys in the `input` field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
16271627

@@ -1643,7 +1643,7 @@ response = client.responses.create(
16431643
print(response.output[0].content[0].text)
16441644
```
16451645

1646-
The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
1646+
The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
16471647

16481648
```
16491649
As of 2021, Paris has a population of approximately 2.8 million people.
@@ -1734,15 +1734,15 @@ The stream ends with exactly one terminal event, `ready` (success) or `error` (f
17341734

17351735
## Timeout
17361736

1737-
`transformers serve` supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading entirely.
1737+
`transformers serve` handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading.
17381738

17391739
```shell
17401740
transformers serve --model-timeout 400
17411741
```
17421742

17431743
### Loading examples
17441744

1745-
See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory.
1745+
The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory.
17461746

17471747
<hfoptions id="load-model-examples">
17481748
<hfoption id="fresh load">
@@ -1784,7 +1784,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true}
17841784
The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.
17851785

17861786
> [!NOTE]
1787-
> Tool calling is currently limited to the Qwen model family.
1787+
> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an [issue](https://github.com/huggingface/transformers/issues/new/choose) to request support for a specific model.
17881788
17891789
Define tools as a list of function specifications following the OpenAI format.
17901790

@@ -1846,6 +1846,79 @@ for event in response:
18461846
print(event)
18471847
```
18481848

1849+
### Multi-turn tool calling
1850+
1851+
After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. See the [OpenAI function calling guide](https://developers.openai.com/api/docs/guides/function-calling?api-mode=chat) for the full spec.
1852+
1853+
The examples below reuse the `tools` list defined above.
1854+
1855+
<hfoptions id="multi-turn-tool-calling">
1856+
<hfoption id="v1/chat/completions">
1857+
1858+
Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`.
1859+
1860+
```py
1861+
# Model returns a tool call
1862+
messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
1863+
response = client.chat.completions.create(
1864+
model="Qwen/Qwen2.5-7B-Instruct",
1865+
messages=messages,
1866+
tools=tools,
1867+
)
1868+
assistant_message = response.choices[0].message
1869+
1870+
# Execute the tool locally
1871+
tool_call = assistant_message.tool_calls[0]
1872+
result = {"temperature": 22, "condition": "sunny"} # your actual function call here
1873+
1874+
# Send the tool result back
1875+
messages.append(assistant_message)
1876+
messages.append({
1877+
"role": "tool",
1878+
"tool_call_id": tool_call.id,
1879+
"content": json.dumps(result),
1880+
})
1881+
final_response = client.chat.completions.create(
1882+
model="Qwen/Qwen2.5-7B-Instruct",
1883+
messages=messages,
1884+
tools=tools,
1885+
)
1886+
print(final_response.choices[0].message.content)
1887+
```
1888+
1889+
</hfoption>
1890+
<hfoption id="v1/responses">
1891+
1892+
Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request.
1893+
1894+
```py
1895+
user_message = {"role": "user", "content": "What's the weather like in San Francisco?"}
1896+
response = client.responses.create(
1897+
model="Qwen/Qwen2.5-7B-Instruct",
1898+
input=[user_message],
1899+
tools=tools,
1900+
stream=False,
1901+
)
1902+
tool_call = next(item for item in response.output if item.type == "function_call")
1903+
1904+
result = {"temperature": 22, "condition": "sunny"}
1905+
1906+
final_response = client.responses.create(
1907+
model="Qwen/Qwen2.5-7B-Instruct",
1908+
input=[
1909+
user_message,
1910+
tool_call.model_dump(exclude_none=True),
1911+
{"type": "function_call_output", "call_id": tool_call.call_id, "output": json.dumps(result)},
1912+
],
1913+
tools=tools,
1914+
stream=False,
1915+
)
1916+
print(final_response.output_text)
1917+
```
1918+
1919+
</hfoption>
1920+
</hfoptions>
1921+
18491922
## Port forwarding
18501923

18511924
Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine.

0 commit comments

Comments
 (0)