Skip to content

Commit 3cee1ac

Browse files
authored
casually dropping the most capable open weights on the planet (#1627)
1 parent 364ebd4 commit 3cee1ac

17 files changed

Lines changed: 840 additions & 14 deletions

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -294,10 +294,11 @@ To find compatible models on the Hub, select the "transformers.js" library tag i
294294
1. **FastViT** (from Apple) released with the paper [FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization](https://huggingface.co/papers/2303.14189) by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.
295295
1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei
296296
1. **Florence2** (from Microsoft) released with the paper [Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks](https://huggingface.co/papers/2311.06242) by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan.
297-
1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the paper [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
298-
1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the paper [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
299-
1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the paper [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
300-
1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the paper [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
297+
1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the blog post [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
298+
1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the blog post [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
299+
1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the blog post [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
300+
1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the blog post [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
301+
1. **[Gemma4](https://huggingface.co/docs/transformers/main/model_doc/gemma4)** (from Google) released with the blog post [Gemma 4](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/) by the Gemma Google team.
301302
1. **[GLM](https://huggingface.co/docs/transformers/main/model_doc/glm)** (from the GLM Team, THUDM & ZhipuAI) released with the paper [ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools](https://huggingface.co/papers/2406.12793v2) by Team GLM: Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, Zihan Wang.
302303
1. **[GLM-MoE-DSA](https://huggingface.co/docs/transformers/main/model_doc/glm_moe_dsa)** (from the GLM Team, ZhipuAI & Tsinghua University) released with the paper [GLM-5: from Vibe Coding to Agentic Engineering](https://huggingface.co/papers/2602.15763) by Team GLM.
303304
1. **[GLM-OCR](https://huggingface.co/docs/transformers/main/model_doc/glm_ocr)** (from the GLM Team, ZhipuAI & Tsinghua University) released with the paper [GLM-OCR Technical Report](https://huggingface.co/papers/2603.10910) by Team GLM: Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, Jie Tang.

packages/transformers/docs/snippets/5_supported-models.snippet

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,10 +55,11 @@
5555
1. **FastViT** (from Apple) released with the paper [FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization](https://huggingface.co/papers/2303.14189) by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.
5656
1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei
5757
1. **Florence2** (from Microsoft) released with the paper [Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks](https://huggingface.co/papers/2311.06242) by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan.
58-
1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the paper [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
59-
1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the paper [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
60-
1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the paper [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
61-
1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the paper [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
58+
1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the blog post [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
59+
1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the blog post [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
60+
1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the blog post [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
61+
1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the blog post [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
62+
1. **[Gemma4](https://huggingface.co/docs/transformers/main/model_doc/gemma4)** (from Google) released with the blog post [Gemma 4](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/) by the Gemma Google team.
6263
1. **[GLM](https://huggingface.co/docs/transformers/main/model_doc/glm)** (from the GLM Team, THUDM & ZhipuAI) released with the paper [ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools](https://huggingface.co/papers/2406.12793v2) by Team GLM: Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, Zihan Wang.
6364
1. **[GLM-MoE-DSA](https://huggingface.co/docs/transformers/main/model_doc/glm_moe_dsa)** (from the GLM Team, ZhipuAI & Tsinghua University) released with the paper [GLM-5: from Vibe Coding to Agentic Engineering](https://huggingface.co/papers/2602.15763) by Team GLM.
6465
1. **[GLM-OCR](https://huggingface.co/docs/transformers/main/model_doc/glm_ocr)** (from the GLM Team, ZhipuAI & Tsinghua University) released with the paper [GLM-OCR Technical Report](https://huggingface.co/papers/2603.10910) by Team GLM: Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, Jie Tang.

packages/transformers/src/configs.js

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ function getNormalizedConfig(config) {
7474
case 'voxtral_realtime':
7575
case 'smolvlm':
7676
case 'gemma3n':
77+
case 'gemma4':
7778
case 'lfm2_vl':
7879
case 'chatterbox':
7980
case 'lighton_ocr':
@@ -165,6 +166,7 @@ function getNormalizedConfig(config) {
165166
case 'vaultgemma':
166167
case 'gemma3_text':
167168
case 'gemma3n_text':
169+
case 'gemma4_text':
168170
case 'glm':
169171
case 'helium':
170172
case 'ernie4_5':
@@ -434,6 +436,32 @@ export function getCacheShapes(config, options) {
434436
}
435437
}
436438
return cache_values;
439+
} else if (['gemma4', 'gemma4_text'].includes(config.model_type)) {
440+
const c = /** @type {any} */ (
441+
config.model_type === 'gemma4' ? /** @type {any} */ (config).text_config : config
442+
);
443+
const pkv_prefix = options?.prefix ?? 'past_key_values';
444+
445+
/** @type {Record<string, number[]>} */
446+
const cache_values = {};
447+
const num_hidden_layers = c.num_hidden_layers;
448+
const num_kv_shared_layers = c.num_kv_shared_layers ?? 0;
449+
const num_kv_layers = num_hidden_layers - num_kv_shared_layers;
450+
const num_key_value_heads = c.num_key_value_heads;
451+
const head_dim = c.head_dim;
452+
const global_head_dim = c.global_head_dim ?? head_dim;
453+
const layer_types = c.layer_types ?? [];
454+
455+
// Create `num_kv_layers` unique KV entries, corresponding to the first `num_kv_layers`
456+
// model layers (the remaining layers share caches with earlier ones).
457+
// Full attention layers use global_head_dim, sliding attention layers use head_dim.
458+
for (let i = 0; i < num_kv_layers; ++i) {
459+
const dim = layer_types[i] === 'full_attention' ? global_head_dim : head_dim;
460+
for (const kv of ['key', 'value']) {
461+
cache_values[`${pkv_prefix}.${i}.${kv}`] = [batch_size, num_key_value_heads, 0, dim];
462+
}
463+
}
464+
return cache_values;
437465
} else if (['lfm2_vl', 'qwen3_5', 'qwen3_5_moe', 'voxtral_realtime'].includes(config.model_type)) {
438466
let subConfig;
439467
if (config.model_type === 'voxtral_realtime' && options?.session_name === 'audio_encoder') {

packages/transformers/src/models/feature_extractors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ export * from './clap/feature_extraction_clap.js';
55
export * from './cohere_asr/feature_extraction_cohere_asr.js';
66
export * from './dac/feature_extraction_dac.js';
77
export * from './gemma3n/feature_extraction_gemma3n.js';
8+
export * from './gemma4/feature_extraction_gemma4.js';
89
export * from './granite_speech/feature_extraction_granite_speech.js';
910
export * from './moonshine/feature_extraction_moonshine.js';
1011
export * from './parakeet/feature_extraction_parakeet.js';

packages/transformers/src/models/gemma3n/modeling_gemma3n.js

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,7 @@ export class Gemma3nForConditionalGeneration extends Gemma3nPreTrainedModel {
4949
}));
5050
if (input_ids.dims[1] !== 1) {
5151
if (pixel_values) {
52-
// Encode the image
53-
const { image_features } = await sessionRun(this.sessions['vision_encoder'], {
54-
pixel_values,
55-
});
52+
const { image_features } = await this._encode_vision({ pixel_values, ...kwargs });
5653
({ inputs_embeds, attention_mask } = this._merge_input_ids_with_image_features({
5754
image_features,
5855
inputs_embeds,
@@ -93,6 +90,10 @@ export class Gemma3nForConditionalGeneration extends Gemma3nPreTrainedModel {
9390
return outputs;
9491
}
9592

93+
_encode_vision(kwargs) {
94+
return sessionRun(this.sessions['vision_encoder'], { pixel_values: kwargs.pixel_values });
95+
}
96+
9697
_merge_input_ids_with_image_features(kwargs) {
9798
const vision_hidden_size = kwargs.image_features.dims.at(-1);
9899
const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size);
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
import { validate_audio_inputs } from '../../feature_extraction_utils.js';
2+
import { Tensor } from '../../utils/tensor.js';
3+
import { spectrogram } from '../../utils/audio.js';
4+
import { Gemma3nAudioFeatureExtractor } from '../gemma3n/feature_extraction_gemma3n.js';
5+
6+
export class Gemma4AudioFeatureExtractor extends Gemma3nAudioFeatureExtractor {
7+
/**
8+
* @override
9+
* Gemma4 uses semicausal padding, unfold(frame_length+1) framing, and
10+
* additive mel_floor — all controlled via flags on the shared spectrogram().
11+
*/
12+
async _extract_fbank_features(waveform, max_length) {
13+
const { frame_length, hop_length, fft_length } = this.config;
14+
15+
// Compute frame count matching Python's unfold(size=frame_length+1, step=hop_length)
16+
const pad_left = Math.floor(frame_length / 2);
17+
const num_frames = Math.floor((waveform.length + pad_left - (frame_length + 1)) / hop_length) + 1;
18+
19+
return spectrogram(waveform, this.window, frame_length, hop_length, {
20+
fft_length,
21+
center: true,
22+
pad_mode: 'semicausal',
23+
onesided: true,
24+
preemphasis: this.config.preemphasis,
25+
preemphasis_htk_flavor: this.config.preemphasis_htk_flavor,
26+
mel_filters: this.mel_filters,
27+
log_mel: 'log',
28+
mel_floor: this.config.mel_floor,
29+
mel_floor_mode: 'add',
30+
remove_dc_offset: false,
31+
transpose: true,
32+
max_num_frames: num_frames,
33+
});
34+
}
35+
36+
/**
37+
* @override
38+
* Wraps the base class result with a frame-aware attention mask
39+
* and zeros out features for invalid (padded) frames.
40+
*/
41+
async _call(audio, options = {}) {
42+
validate_audio_inputs(audio, 'Gemma4AudioFeatureExtractor');
43+
44+
const original_length = audio.length;
45+
const result = await super._call(audio, options);
46+
47+
const { input_features } = result;
48+
const [, num_frames, num_features] = input_features.dims;
49+
50+
// Build frame-aware mask: a frame is valid only when all its samples are real audio.
51+
const { frame_length, hop_length } = this.config;
52+
const pad_left = Math.floor(frame_length / 2);
53+
const frame_size_for_unfold = frame_length + 1;
54+
55+
const sample_mask = new Uint8Array(original_length + pad_left + (options.pad_to_multiple_of ?? 128));
56+
sample_mask.fill(1, pad_left, pad_left + original_length);
57+
58+
const frame_mask = new Uint8Array(num_frames);
59+
for (let i = 0; i < num_frames; ++i) {
60+
frame_mask[i] = sample_mask[i * hop_length + frame_size_for_unfold - 1] ? 1 : 0;
61+
}
62+
63+
// Zero out features for invalid frames (matching Python's speech * mask[..., None])
64+
const feat_data = /** @type {Float32Array} */ (input_features.data);
65+
for (let i = 0; i < num_frames; ++i) {
66+
if (!frame_mask[i]) {
67+
feat_data.fill(0, i * num_features, (i + 1) * num_features);
68+
}
69+
}
70+
71+
result.input_features_mask = new Tensor('bool', frame_mask, [1, num_frames]);
72+
return result;
73+
}
74+
}

0 commit comments

Comments
 (0)