evalstate
diff --git a/‎docs/source/en/_toctree.yml‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/en/_toctree.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/en/continuous_batching.md‎
Lines changed: 15 additions & 0 deletions b/‎docs/source/en/continuous_batching.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/olmo.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/en/model_doc/olmo.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/olmo2.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/en/model_doc/olmo2.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/olmo3.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/en/model_doc/olmo3.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/openai_privacy_filter.md‎
Lines changed: 103 additions & 0 deletions b/‎docs/source/en/model_doc/openai_privacy_filter.md‎
Lines changed: 103 additions & 0 deletions
diff --git a/‎docs/source/en/quantization/torchao.md‎
Lines changed: 5 additions & 7 deletions b/‎docs/source/en/quantization/torchao.md‎
Lines changed: 5 additions & 7 deletions
diff --git a/‎docs/source/en/serve-cli/serving.md‎
Lines changed: 84 additions & 11 deletions b/‎docs/source/en/serve-cli/serving.md‎
Lines changed: 84 additions & 11 deletions
@@ -769,6 +769,8 @@
         title: OLMoE
       - local: model_doc/olmo_hybrid
         title: OlmoHybrid
+      - local: model_doc/openai_privacy_filter
+        title: OpenAI Privacy Filter
       - local: model_doc/opt
         title: OPT
       - local: model_doc/pegasus
 
@@ -124,6 +124,20 @@ Cancel a request with [`~ContinuousBatchingManager.cancel_request`].
 manager.cancel_request(request_id="my_request")
 ```
 
+### Per-request sampling parameters
+
+Enable `per_request_processors` to apply `temperature`, `top_k`, and `top_p` independently per request within the same forward pass to allow different sampling parameters for different requests (creative, high-temperature outputs versus precise, low-temperature ones for example).
+
+```py
+cb_config = ContinuousBatchingConfig(per_request_processors=True)
+
+# each request gets its own sampling parameters
+manager.add_request(input_ids=inputs_a, temperature=0.9, top_p=0.95)
+manager.add_request(input_ids=inputs_b, temperature=0.1, top_k=10)
+```
+
+Each parameter in [`GenerationConfig`] must be a non-default value in order to create the associated logits processor at runtime. For example, set `temperature` to a value other than `None` or `1` to support per-request temperature control. Requests with temperatures of `1` can still be created afterwards.
+
 ### Retrieving results
 
 Iterate over the manager to receive results as they arrive.
@@ -174,6 +188,7 @@ By default, `num_blocks` and `max_batch_tokens` are inferred automatically from
 | Prefix caching | ↓ shared KV blocks | ✓ skips redundant prefill | ✓ TTFT |
 | Paged attention | ↓ no fragmentation | ✓ dynamic batch membership | |
 | Sliding window | ↓ bounded KV per layer | | |
+| Per-request processors | | ✓ mixed sampling params per batch | |
 
 ```py
 from transformers.generation import ContinuousBatchingConfig
 
@@ -127,3 +127,8 @@ print(tokenizer.decode(output[0]))
 
 [[autodoc]] OlmoForCausalLM
     - forward
+
+## OlmoForSequenceClassification
+
+[[autodoc]] OlmoForSequenceClassification
+    - forward
@@ -136,3 +136,8 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 
 [[autodoc]] Olmo2ForCausalLM
     - forward
+
+## Olmo2ForSequenceClassification
+
+[[autodoc]] Olmo2ForSequenceClassification
+    - forward
@@ -129,6 +129,11 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 
 [[autodoc]] Olmo3ForCausalLM
 
+## Olmo3ForSequenceClassification
+
+[[autodoc]] Olmo3ForSequenceClassification
+    - forward
+
 ## Olmo3Model
 
 [[autodoc]] Olmo3Model
 
@@ -0,0 +1,103 @@
+<!--Copyright 2026 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+*This model was released on 2026-04-22 and added to Hugging Face Transformers on 2026-04-22.*
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>
+
+# OpenAI Privacy Filter
+
+OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable.
+
+OpenAI Privacy Filter is pretrained autoregressively to arrive at a checkpoint with similar architecture to gpt-oss, albeit of a smaller size.  We  then converted that checkpoint into a bidirectional token classifier over a privacy label taxonomy, and post-trained with a supervised classification loss. (For architecture details about gpt-oss, please see the gpt-oss model card.) Instead of generating text token-by-token, this model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure. For each input token, the model predicts a probability distribution over the label taxonomy which consists of 8 output categories described below.
+
+Highlights:
+
+- Permissive Apache 2.0 license: ideal for experimentation, customization, and commercial deployment.
+- Small size: Runs in a web browser or on a laptop – 1.5B parameters total and 50M active parameters.
+- Fine-tunable: Adapt the model to specific data distributions through easy and data efficient finetuning.
+- Long-context: 128,000-token context window enables processing long text with high throughput and no chunking.
+- Runtime control: configure precision/recall tradeoffs and detected span lengths through preset operating points.
+
+The example below demonstrates how to detect privacy-sensitive tokens with [`Pipeline`] or the [`AutoModelForTokenClassification`] class.
+
+<hfoptions id="usage">
+<hfoption id="Pipeline">
+
+```py
+from transformers import pipeline
+
+classifier = pipeline(
+    task="token-classification",
+    model="openai/privacy-filter",
+)
+classifier("My name is Alice Smith")
+```
+
+</hfoption>
+<hfoption id="AutoModelForTokenClassification">
+
+```py
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("openai/privacy-filter")
+model = AutoModelForTokenClassification.from_pretrained("openai/privacy-filter", device_map="auto")
+
+inputs = tokenizer("My name is Alice Smith", return_tensors="pt").to(model.device)
+
+with torch.no_grad():
+    outputs = model(**inputs)
+
+predicted_token_class_ids = outputs.logits.argmax(dim=-1)
+predicted_token_classes = [model.config.id2label[token_id.item()] for token_id in predicted_token_class_ids[0]]
+print(predicted_token_classes)
+```
+
+</hfoption>
+</hfoptions>
+
+- Developed by: OpenAI
+- Funded by: OpenAI
+- Shared by: OpenAI
+- Model type: Bidirectional token classification model for privacy span detection
+- Language(s): Primarily English; selected multilingual robustness evaluation reported
+- License: [Apache 2.0](LICENSE)
+
+- Source repository: https://github.com/openai/privacy-filter
+- Model weights: https://huggingface.co/openai/privacy-filter
+- Demo: https://huggingface.co/spaces/openai/privacy-filter
+
+## Resources
+
+- [Token classification task guide](../tasks/token_classification)
+
+## OpenAIPrivacyFilterConfig
+
+[[autodoc]] OpenAIPrivacyFilterConfig
+
+## OpenAIPrivacyFilterModel
+
+[[autodoc]] OpenAIPrivacyFilterModel
+    - forward
+
+## OpenAIPrivacyFilterForTokenClassification
+
+[[autodoc]] OpenAIPrivacyFilterForTokenClassification
+    - forward
@@ -328,11 +328,9 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 import torch
 from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
 from torchao.quantization import Int4WeightOnlyConfig
-from torchao.dtypes import Int4XPULayout
-from torchao.quantization.quant_primitives import ZeroPointDomain
 
 
-quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4XPULayout(), zero_point_domain=ZeroPointDomain.INT, int4_packing_format="plain_int32")
+quant_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="plain_int32")
 quantization_config = TorchAoConfig(quant_type=quant_config)
 
 # Load and quantize the model
@@ -345,7 +343,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
 
 tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
 input_text = "What are we having for dinner?"
-input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device, quantized_model.dtype)
+input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device).to(quantized_model.dtype)
 
 # auto-compile the quantized model with `cache_implementation="static"` to get speed up
 output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
@@ -395,9 +393,9 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```py
 import torch
 from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
-from torchao.prototype.int4_opaque_tensor import Int4WeightOnlyOpaqueTensorConfig
+from torchao.prototype.quantization.int4 import PrototypeInt4WeightOnlyConfig
 
-quantization_config = TorchAoConfig(Int4WeightOnlyOpaqueTensorConfig(group_size=128))
+quantization_config = TorchAoConfig(PrototypeInt4WeightOnlyConfig(group_size=128, int4_choose_qparams_algorithm="tinygemm"))
 
 # Load and quantize the model
 quantized_model = AutoModelForCausalLM.from_pretrained(
@@ -409,7 +407,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
 
 tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
 input_text = "What are we having for dinner?"
-input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device, quantized_model.dtype)
+input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device).to(quantized_model.dtype)
 
 # auto-compile the quantized model with `cache_implementation="static"` to get speed up
 output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
 
@@ -456,7 +456,7 @@ data: {"id":"f47ac10b-58cc-4372-a567-0e02b2c3d479","choices":[{"delta":{"content
 
 ### Audio-based completions
 
-Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input using the OpenAI `input_audio` content type. The audio must be base64-encoded and the format (`mp3` or `wav`) must be specified.
+Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input through the OpenAI `input_audio` content type. Base64-encode the audio and specify the format (`mp3` or `wav`).
 
 <hfoptions id="audio-completions">
 <hfoption id="huggingface_hub">
@@ -695,7 +695,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
 > [!WARNING]
 > The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.
 
-As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
+You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.
 
 ```python
 completion = client.chat.completions.create(
@@ -717,7 +717,7 @@ completion = client.chat.completions.create(
 > [!WARNING]
 > The `video_url` content type is an extension not part of the OpenAI standard and may change in future versions.
 
-Video input is supported using the `video_url` content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames.
+Use the `video_url` content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames.
 
 > [!TIP]
 > Video processing requires [torchcodec](https://github.com/pytorch/torchcodec). Install it with `pip install torchcodec`.
@@ -934,7 +934,7 @@ data: {"id":"cb997e1d-98b9-414a-be89-1880288610ef","choices":[{"delta":{"content
 </hfoption>
 </hfoptions>
 
-### Multi-turn conversations
+### Multi-turn conversations[[completions]]
 
 To have a multi-turn conversation, include the full conversation history in the `messages` list with alternating `user` and `assistant` roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
 
@@ -954,7 +954,7 @@ completion = client.chat.completions.create(
 print(completion.choices[0].message.content)
 ```
 
-The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
+The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
 
 ```
 As of 2021, the population of Paris is approximately 2.2 million people.
@@ -1466,7 +1466,7 @@ data: {"content_index":0,"delta":"This ","item_id":"msg_a1b2c3d4","output_index"
 > [!WARNING]
 > The `audio_url` content type is an extension not part of the OpenAI standard and may change in future versions.
 
-As a convenience, audio can also be passed by URL using the `audio_url` content type, avoiding the need for base64 encoding.
+You can also pass audio by URL with the `audio_url` content type to skip base64 encoding.
 
 ```python
 response = client.responses.create(
@@ -1621,7 +1621,7 @@ data: {"content_index":0,"delta":"Based ","item_id":"msg_b2c3d4e5","output_index
 </hfoption>
 </hfoptions>
 
-### Multi-turn conversations
+### Multi-turn conversations[[responses]]
 
 For multi-turn conversations, pass a list of messages with `role` keys in the `input` field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
 
@@ -1643,7 +1643,7 @@ response = client.responses.create(
 print(response.output[0].content[0].text)
 ```
 
-The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
+The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
 
 ```
 As of 2021, Paris has a population of approximately 2.8 million people.
@@ -1734,15 +1734,15 @@ The stream ends with exactly one terminal event, `ready` (success) or `error` (f
 
 ## Timeout
 
-`transformers serve` supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading entirely.
+`transformers serve` handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading.
 
 ```shell
 transformers serve --model-timeout 400
 ```
 
 ### Loading examples
 
-See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory.
+The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory.
 
 <hfoptions id="load-model-examples">
 <hfoption id="fresh load">
@@ -1784,7 +1784,7 @@ data: {"status": "ready", "model": "org/model@main", "cached": true}
 The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.
 
 > [!NOTE]
-> Tool calling is currently limited to the Qwen model family.
+> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an [issue](https://github.com/huggingface/transformers/issues/new/choose) to request support for a specific model.
 
 Define tools as a list of function specifications following the OpenAI format.
 
@@ -1846,6 +1846,79 @@ for event in response:
   print(event)
 ```
 
+### Multi-turn tool calling
+
+After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. See the [OpenAI function calling guide](https://developers.openai.com/api/docs/guides/function-calling?api-mode=chat) for the full spec.
+
+The examples below reuse the `tools` list defined above.
+
+<hfoptions id="multi-turn-tool-calling">
+<hfoption id="v1/chat/completions">
+
+Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`.
+
+```py
+# Model returns a tool call
+messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
+response = client.chat.completions.create(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    messages=messages,
+    tools=tools,
+)
+assistant_message = response.choices[0].message
+
+# Execute the tool locally
+tool_call = assistant_message.tool_calls[0]
+result = {"temperature": 22, "condition": "sunny"}  # your actual function call here
+
+# Send the tool result back
+messages.append(assistant_message)
+messages.append({
+    "role": "tool",
+    "tool_call_id": tool_call.id,
+    "content": json.dumps(result),
+})
+final_response = client.chat.completions.create(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    messages=messages,
+    tools=tools,
+)
+print(final_response.choices[0].message.content)
+```
+
+</hfoption>
+<hfoption id="v1/responses">
+
+Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request.
+
+```py
+user_message = {"role": "user", "content": "What's the weather like in San Francisco?"}
+response = client.responses.create(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    input=[user_message],
+    tools=tools,
+    stream=False,
+)
+tool_call = next(item for item in response.output if item.type == "function_call")
+
+result = {"temperature": 22, "condition": "sunny"}
+
+final_response = client.responses.create(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    input=[
+        user_message,
+        tool_call.model_dump(exclude_none=True),
+        {"type": "function_call_output", "call_id": tool_call.call_id, "output": json.dumps(result)},
+    ],
+    tools=tools,
+    stream=False,
+)
+print(final_response.output_text)
+```
+
+</hfoption>
+</hfoptions>
+
 ## Port forwarding
 
 Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine.