Skip to content

Commit e902b1b

Browse files
authored
Merge branch 'main' into clip_tokenizer_max_model_length
2 parents 6637bac + 74a2a4d commit e902b1b

105 files changed

Lines changed: 5331 additions & 327 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -661,6 +661,8 @@
661661
title: HunYuanDenseV1
662662
- local: model_doc/hunyuan_v1_moe
663663
title: HunYuanMoEV1
664+
- local: model_doc/hy_v3
665+
title: HYV3
664666
- local: model_doc/ibert
665667
title: I-BERT
666668
- local: model_doc/jais2
@@ -767,6 +769,8 @@
767769
title: OLMoE
768770
- local: model_doc/olmo_hybrid
769771
title: OlmoHybrid
772+
- local: model_doc/openai_privacy_filter
773+
title: OpenAI Privacy Filter
770774
- local: model_doc/opt
771775
title: OPT
772776
- local: model_doc/pegasus
@@ -1365,6 +1369,8 @@
13651369
title: SigLIP
13661370
- local: model_doc/siglip2
13671371
title: SigLIP2
1372+
- local: model_doc/slanet
1373+
title: SLANet
13681374
- local: model_doc/slanext
13691375
title: SLANeXt
13701376
- local: model_doc/smollm3

docs/source/en/continuous_batching.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,20 @@ Cancel a request with [`~ContinuousBatchingManager.cancel_request`].
124124
manager.cancel_request(request_id="my_request")
125125
```
126126

127+
### Per-request sampling parameters
128+
129+
Enable `per_request_processors` to apply `temperature`, `top_k`, and `top_p` independently per request within the same forward pass to allow different sampling parameters for different requests (creative, high-temperature outputs versus precise, low-temperature ones for example).
130+
131+
```py
132+
cb_config = ContinuousBatchingConfig(per_request_processors=True)
133+
134+
# each request gets its own sampling parameters
135+
manager.add_request(input_ids=inputs_a, temperature=0.9, top_p=0.95)
136+
manager.add_request(input_ids=inputs_b, temperature=0.1, top_k=10)
137+
```
138+
139+
Each parameter in [`GenerationConfig`] must be a non-default value in order to create the associated logits processor at runtime. For example, set `temperature` to a value other than `None` or `1` to support per-request temperature control. Requests with temperatures of `1` can still be created afterwards.
140+
127141
### Retrieving results
128142

129143
Iterate over the manager to receive results as they arrive.
@@ -174,6 +188,7 @@ By default, `num_blocks` and `max_batch_tokens` are inferred automatically from
174188
| Prefix caching | ↓ shared KV blocks | ✓ skips redundant prefill | ✓ TTFT |
175189
| Paged attention | ↓ no fragmentation | ✓ dynamic batch membership | |
176190
| Sliding window | ↓ bounded KV per layer | | |
191+
| Per-request processors | | ✓ mixed sampling params per batch | |
177192

178193
```py
179194
from transformers.generation import ContinuousBatchingConfig

docs/source/en/model_doc/glm_moe_dsa.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ limitations under the License.
1616
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
1717
1818
-->
19-
*This model was released on {release_date} and added to Hugging Face Transformers on 2026-02-08.*
19+
*This model was released on 2026-02-17 and added to Hugging Face Transformers on 2026-02-09.*
2020

2121
<div style="float: right;">
2222
<div class="flex flex-wrap space-x-1">

docs/source/en/model_doc/hy_v3.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
<!--Copyright 2026 THL A29 Limited, a Tencent company and The HuggingFace Inc. team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
*This model was released on {release_date} and added to Hugging Face Transformers on 2026-04-22.*
17+
18+
# Hy3-preview
19+
20+
## Overview
21+
22+
Hy3-preview is a large-scale Mixture-of-Experts (MoE) language model developed by the Tencent HunYuan team. It features a dense-MoE hybrid architecture with 192 routed experts and 1 always-active shared expert per MoE layer, achieving strong performance with efficient inference via sparse expert activation.
23+
24+
Key architectural features:
25+
26+
- **Dense-MoE hybrid**: The first layer uses a dense FFN; all subsequent layers use MoE with top-k routing (default k=8).
27+
- **Shared experts**: Each MoE layer includes 1 shared expert that processes all tokens alongside the routed experts.
28+
- **Sigmoid routing with expert-bias correction**: Tokens are routed via sigmoid scoring (not softmax) with a learned per-expert bias for load balancing.
29+
- **QK-Norm**: Per-head RMSNorm applied to query and key projections before attention for improved training stability.
30+
31+
## Usage tips
32+
33+
- Load with `AutoModelForCausalLM`. The model requires multiple GPUs due to its size.
34+
- Set `output_router_logits=True` in the config or forward call to collect per-layer MoE router logits. Note that this model does not compute an auxiliary load-balancing loss; `aux_loss` is always `None`.
35+
- The model supports `gradient_checkpointing` to reduce memory during fine-tuning.
36+
37+
```python
38+
from transformers import AutoTokenizer, AutoModelForCausalLM
39+
40+
model_id = "tencent/Hy3-preview"
41+
tokenizer = AutoTokenizer.from_pretrained(model_id)
42+
model = AutoModelForCausalLM.from_pretrained(
43+
model_id,
44+
device_map="auto",
45+
)
46+
47+
inputs = tokenizer("The future of artificial intelligence is", return_tensors="pt").to(model.device)
48+
outputs = model.generate(**inputs, max_new_tokens=64)
49+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
50+
```
51+
52+
## HYV3Config
53+
54+
[[autodoc]] HYV3Config
55+
56+
## HYV3Model
57+
58+
[[autodoc]] HYV3Model
59+
- forward
60+
61+
## HYV3ForCausalLM
62+
63+
[[autodoc]] HYV3ForCausalLM
64+
- forward

docs/source/en/model_doc/olmo.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,8 @@ print(tokenizer.decode(output[0]))
127127

128128
[[autodoc]] OlmoForCausalLM
129129
- forward
130+
131+
## OlmoForSequenceClassification
132+
133+
[[autodoc]] OlmoForSequenceClassification
134+
- forward

docs/source/en/model_doc/olmo2.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,3 +136,8 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
136136

137137
[[autodoc]] Olmo2ForCausalLM
138138
- forward
139+
140+
## Olmo2ForSequenceClassification
141+
142+
[[autodoc]] Olmo2ForSequenceClassification
143+
- forward

docs/source/en/model_doc/olmo3.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,11 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
129129

130130
[[autodoc]] Olmo3ForCausalLM
131131

132+
## Olmo3ForSequenceClassification
133+
134+
[[autodoc]] Olmo3ForSequenceClassification
135+
- forward
136+
132137
## Olmo3Model
133138

134139
[[autodoc]] Olmo3Model
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
<!--Copyright 2026 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
*This model was released on 2026-04-22 and added to Hugging Face Transformers on 2026-04-22.*
17+
18+
<div style="float: right;">
19+
<div class="flex flex-wrap space-x-1">
20+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21+
</div>
22+
</div>
23+
24+
# OpenAI Privacy Filter
25+
26+
OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable.
27+
28+
OpenAI Privacy Filter is pretrained autoregressively to arrive at a checkpoint with similar architecture to gpt-oss, albeit of a smaller size. We then converted that checkpoint into a bidirectional token classifier over a privacy label taxonomy, and post-trained with a supervised classification loss. (For architecture details about gpt-oss, please see the gpt-oss model card.) Instead of generating text token-by-token, this model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure. For each input token, the model predicts a probability distribution over the label taxonomy which consists of 8 output categories described below.
29+
30+
Highlights:
31+
32+
- Permissive Apache 2.0 license: ideal for experimentation, customization, and commercial deployment.
33+
- Small size: Runs in a web browser or on a laptop – 1.5B parameters total and 50M active parameters.
34+
- Fine-tunable: Adapt the model to specific data distributions through easy and data efficient finetuning.
35+
- Long-context: 128,000-token context window enables processing long text with high throughput and no chunking.
36+
- Runtime control: configure precision/recall tradeoffs and detected span lengths through preset operating points.
37+
38+
The example below demonstrates how to detect privacy-sensitive tokens with [`Pipeline`] or the [`AutoModelForTokenClassification`] class.
39+
40+
<hfoptions id="usage">
41+
<hfoption id="Pipeline">
42+
43+
```py
44+
from transformers import pipeline
45+
46+
classifier = pipeline(
47+
task="token-classification",
48+
model="openai/privacy-filter",
49+
)
50+
classifier("My name is Alice Smith")
51+
```
52+
53+
</hfoption>
54+
<hfoption id="AutoModelForTokenClassification">
55+
56+
```py
57+
import torch
58+
from transformers import AutoModelForTokenClassification, AutoTokenizer
59+
60+
tokenizer = AutoTokenizer.from_pretrained("openai/privacy-filter")
61+
model = AutoModelForTokenClassification.from_pretrained("openai/privacy-filter", device_map="auto")
62+
63+
inputs = tokenizer("My name is Alice Smith", return_tensors="pt").to(model.device)
64+
65+
with torch.no_grad():
66+
outputs = model(**inputs)
67+
68+
predicted_token_class_ids = outputs.logits.argmax(dim=-1)
69+
predicted_token_classes = [model.config.id2label[token_id.item()] for token_id in predicted_token_class_ids[0]]
70+
print(predicted_token_classes)
71+
```
72+
73+
</hfoption>
74+
</hfoptions>
75+
76+
- Developed by: OpenAI
77+
- Funded by: OpenAI
78+
- Shared by: OpenAI
79+
- Model type: Bidirectional token classification model for privacy span detection
80+
- Language(s): Primarily English; selected multilingual robustness evaluation reported
81+
- License: [Apache 2.0](LICENSE)
82+
83+
- Source repository: https://github.com/openai/privacy-filter
84+
- Model weights: https://huggingface.co/openai/privacy-filter
85+
- Demo: https://huggingface.co/spaces/openai/privacy-filter
86+
87+
## Resources
88+
89+
- [Token classification task guide](../tasks/token_classification)
90+
91+
## OpenAIPrivacyFilterConfig
92+
93+
[[autodoc]] OpenAIPrivacyFilterConfig
94+
95+
## OpenAIPrivacyFilterModel
96+
97+
[[autodoc]] OpenAIPrivacyFilterModel
98+
- forward
99+
100+
## OpenAIPrivacyFilterForTokenClassification
101+
102+
[[autodoc]] OpenAIPrivacyFilterForTokenClassification
103+
- forward

docs/source/en/model_doc/slanet.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
<!--Copyright 2026 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
*This model was released on 2025-03-07 and added to Hugging Face Transformers on 2026-04-22.*
17+
18+
# SLANet
19+
20+
<div class="flex flex-wrap space-x-1">
21+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
22+
</div>
23+
24+
## Overview
25+
26+
**SLANet** and **SLANet_plus** are part of a series of dedicated lightweight models for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. For more details about the SLANet series model, please refer to the [official documentation](https://www.paddleocr.ai/latest/en/version3.x/module_usage/table_structure_recognition.html).
27+
28+
## Model Architecture
29+
30+
SLANet is a table structure recognition model developed by Baidu PaddlePaddle Vision Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information.
31+
32+
## Usage
33+
34+
### Single input inference
35+
36+
The example below demonstrates how to detect text with SLANet using the [`AutoModel`].
37+
38+
<hfoptions id="usage">
39+
<hfoption id="AutoModel">
40+
41+
```py
42+
from io import BytesIO
43+
44+
import httpx
45+
from PIL import Image
46+
from transformers import AutoImageProcessor, AutoModelForTableRecognition
47+
48+
model_path="PaddlePaddle/SLANet_plus_safetensors"
49+
model = AutoModelForTableRecognition.from_pretrained(model_path, device_map="auto")
50+
image_processor = AutoImageProcessor.from_pretrained(model_path)
51+
52+
image = Image.open(BytesIO(httpx.get(image_url).content))
53+
inputs = image_processor(images=image, return_tensors="pt").to(model.device)
54+
outputs = model(**inputs)
55+
56+
results = image_processor.post_process_table_recognition(outputs)
57+
58+
print(result['structure'])
59+
print(result['structure_score'])
60+
```
61+
62+
</hfoption>
63+
</hfoptions>
64+
65+
## SLANetConfig
66+
67+
[[autodoc]] SLANetConfig
68+
69+
## SLANetForTableRecognition
70+
71+
[[autodoc]] SLANetForTableRecognition
72+
73+
## SLANetBackbone
74+
75+
[[autodoc]] SLANetBackbone
76+
77+
## SLANetSLAHead
78+
79+
[[autodoc]] SLANetSLAHead
80+

docs/source/en/quantization/torchao.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -328,11 +328,9 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
328328
import torch
329329
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
330330
from torchao.quantization import Int4WeightOnlyConfig
331-
from torchao.dtypes import Int4XPULayout
332-
from torchao.quantization.quant_primitives import ZeroPointDomain
333331

334332

335-
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4XPULayout(), zero_point_domain=ZeroPointDomain.INT, int4_packing_format="plain_int32")
333+
quant_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="plain_int32")
336334
quantization_config = TorchAoConfig(quant_type=quant_config)
337335

338336
# Load and quantize the model
@@ -345,7 +343,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
345343

346344
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
347345
input_text = "What are we having for dinner?"
348-
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device, quantized_model.dtype)
346+
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device).to(quantized_model.dtype)
349347

350348
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
351349
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
@@ -395,9 +393,9 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
395393
```py
396394
import torch
397395
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
398-
from torchao.prototype.int4_opaque_tensor import Int4WeightOnlyOpaqueTensorConfig
396+
from torchao.prototype.quantization.int4 import PrototypeInt4WeightOnlyConfig
399397

400-
quantization_config = TorchAoConfig(Int4WeightOnlyOpaqueTensorConfig(group_size=128))
398+
quantization_config = TorchAoConfig(PrototypeInt4WeightOnlyConfig(group_size=128, int4_choose_qparams_algorithm="tinygemm"))
401399

402400
# Load and quantize the model
403401
quantized_model = AutoModelForCausalLM.from_pretrained(
@@ -409,7 +407,7 @@ quantized_model = AutoModelForCausalLM.from_pretrained(
409407

410408
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
411409
input_text = "What are we having for dinner?"
412-
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device, quantized_model.dtype)
410+
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device).to(quantized_model.dtype)
413411

414412
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
415413
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")

0 commit comments

Comments
 (0)