You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/continuous_batching.md
+15Lines changed: 15 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -124,6 +124,20 @@ Cancel a request with [`~ContinuousBatchingManager.cancel_request`].
124
124
manager.cancel_request(request_id="my_request")
125
125
```
126
126
127
+
### Per-request sampling parameters
128
+
129
+
Enable `per_request_processors` to apply `temperature`, `top_k`, and `top_p` independently per request within the same forward pass to allow different sampling parameters for different requests (creative, high-temperature outputs versus precise, low-temperature ones for example).
Each parameter in [`GenerationConfig`] must be a non-default value in order to create the associated logits processor at runtime. For example, set `temperature` to a value other than `None` or `1` to support per-request temperature control. Requests with temperatures of `1` can still be created afterwards.
140
+
127
141
### Retrieving results
128
142
129
143
Iterate over the manager to receive results as they arrive.
@@ -174,6 +188,7 @@ By default, `num_blocks` and `max_batch_tokens` are inferred automatically from
OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable.
27
+
28
+
OpenAI Privacy Filter is pretrained autoregressively to arrive at a checkpoint with similar architecture to gpt-oss, albeit of a smaller size. We then converted that checkpoint into a bidirectional token classifier over a privacy label taxonomy, and post-trained with a supervised classification loss. (For architecture details about gpt-oss, please see the gpt-oss model card.) Instead of generating text token-by-token, this model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure. For each input token, the model predicts a probability distribution over the label taxonomy which consists of 8 output categories described below.
29
+
30
+
Highlights:
31
+
32
+
- Permissive Apache 2.0 license: ideal for experimentation, customization, and commercial deployment.
33
+
- Small size: Runs in a web browser or on a laptop – 1.5B parameters total and 50M active parameters.
34
+
- Fine-tunable: Adapt the model to specific data distributions through easy and data efficient finetuning.
35
+
- Long-context: 128,000-token context window enables processing long text with high throughput and no chunking.
36
+
- Runtime control: configure precision/recall tradeoffs and detected span lengths through preset operating points.
37
+
38
+
The example below demonstrates how to detect privacy-sensitive tokens with [`Pipeline`] or the [`AutoModelForTokenClassification`] class.
39
+
40
+
<hfoptionsid="usage">
41
+
<hfoptionid="Pipeline">
42
+
43
+
```py
44
+
from transformers import pipeline
45
+
46
+
classifier = pipeline(
47
+
task="token-classification",
48
+
model="openai/privacy-filter",
49
+
)
50
+
classifier("My name is Alice Smith")
51
+
```
52
+
53
+
</hfoption>
54
+
<hfoptionid="AutoModelForTokenClassification">
55
+
56
+
```py
57
+
import torch
58
+
from transformers import AutoModelForTokenClassification, AutoTokenizer
Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input using the OpenAI `input_audio` content type. The audio must be base64-encoded and the format (`mp3` or `wav`) must be specified.
459
+
Multimodal models like [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) and [Qwen2.5-Omni](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) accept audio input through the OpenAI `input_audio` content type. Base64-encode the audio and specify the format (`mp3` or `wav`).
> The `video_url` content type is an extension not part of the OpenAI standard and may change in future versions.
719
719
720
-
Video input is supported using the `video_url` content type. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the audio track is automatically extracted from the video and processed alongside the visual frames.
720
+
Use the `video_url` content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames.
721
721
722
722
> [!TIP]
723
723
> Video processing requires [torchcodec](https://github.com/pytorch/torchcodec). Install it with `pip install torchcodec`.
To have a multi-turn conversation, include the full conversation history in the `messages` list with alternating `user` and `assistant` roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
For multi-turn conversations, pass a list of messages with `role` keys in the `input` field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.
The follow-up question "How many people live there?" relies on the prior context, and the model answers about Paris accordingly.
1646
+
The follow-up question "How many people live there?" relies on the prior context, so the model answers about Paris.
1647
1647
1648
1648
```
1649
1649
As of 2021, Paris has a population of approximately 2.8 million people.
@@ -1734,15 +1734,15 @@ The stream ends with exactly one terminal event, `ready` (success) or `error` (f
1734
1734
1735
1735
## Timeout
1736
1736
1737
-
`transformers serve`supports different requests by different models. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free up GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading entirely.
1737
+
`transformers serve`handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set `--model-timeout` to a different value in seconds, or `-1` to disable unloading.
1738
1738
1739
1739
```shell
1740
1740
transformers serve --model-timeout 400
1741
1741
```
1742
1742
1743
1743
### Loading examples
1744
1744
1745
-
See the example responses below for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model that already exists in memory.
1745
+
The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory.
The `transformers serve` server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.
1785
1785
1786
1786
> [!NOTE]
1787
-
> Tool calling is currently limited to the Qwen model family.
1787
+
> Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an [issue](https://github.com/huggingface/transformers/issues/new/choose) to request support for a specific model.
1788
1788
1789
1789
Define tools as a list of function specifications following the OpenAI format.
1790
1790
@@ -1846,6 +1846,79 @@ for event in response:
1846
1846
print(event)
1847
1847
```
1848
1848
1849
+
### Multi-turn tool calling
1850
+
1851
+
After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model's final answer. The pattern differs slightly between the two APIs. See the [OpenAI function calling guide](https://developers.openai.com/api/docs/guides/function-calling?api-mode=chat) for the full spec.
1852
+
1853
+
The examples below reuse the `tools` list defined above.
1854
+
1855
+
<hfoptionsid="multi-turn-tool-calling">
1856
+
<hfoptionid="v1/chat/completions">
1857
+
1858
+
Pass the tool result as a `role: "tool"` message with the matching `tool_call_id`.
1859
+
1860
+
```py
1861
+
# Model returns a tool call
1862
+
messages = [{"role": "user", "content": "What's the weather like in San Francisco?"}]
1863
+
response = client.chat.completions.create(
1864
+
model="Qwen/Qwen2.5-7B-Instruct",
1865
+
messages=messages,
1866
+
tools=tools,
1867
+
)
1868
+
assistant_message = response.choices[0].message
1869
+
1870
+
# Execute the tool locally
1871
+
tool_call = assistant_message.tool_calls[0]
1872
+
result = {"temperature": 22, "condition": "sunny"} # your actual function call here
1873
+
1874
+
# Send the tool result back
1875
+
messages.append(assistant_message)
1876
+
messages.append({
1877
+
"role": "tool",
1878
+
"tool_call_id": tool_call.id,
1879
+
"content": json.dumps(result),
1880
+
})
1881
+
final_response = client.chat.completions.create(
1882
+
model="Qwen/Qwen2.5-7B-Instruct",
1883
+
messages=messages,
1884
+
tools=tools,
1885
+
)
1886
+
print(final_response.choices[0].message.content)
1887
+
```
1888
+
1889
+
</hfoption>
1890
+
<hfoptionid="v1/responses">
1891
+
1892
+
Pass the tool result as a `function_call_output` item in the `input` list of the follow-up request.
1893
+
1894
+
```py
1895
+
user_message = {"role": "user", "content": "What's the weather like in San Francisco?"}
1896
+
response = client.responses.create(
1897
+
model="Qwen/Qwen2.5-7B-Instruct",
1898
+
input=[user_message],
1899
+
tools=tools,
1900
+
stream=False,
1901
+
)
1902
+
tool_call =next(item for item in response.output if item.type =="function_call")
1903
+
1904
+
result = {"temperature": 22, "condition": "sunny"}
0 commit comments