huggingface
diff --git a/‎README.md‎
Lines changed: 5 additions & 4 deletions b/‎README.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎packages/transformers/docs/snippets/5_supported-models.snippet‎
Lines changed: 5 additions & 4 deletions b/‎packages/transformers/docs/snippets/5_supported-models.snippet‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎packages/transformers/src/configs.js‎
Lines changed: 28 additions & 0 deletions b/‎packages/transformers/src/configs.js‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎packages/transformers/src/models/feature_extractors.js‎
Lines changed: 1 addition & 0 deletions b/‎packages/transformers/src/models/feature_extractors.js‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎packages/transformers/src/models/gemma3n/modeling_gemma3n.js‎
Lines changed: 5 additions & 4 deletions b/‎packages/transformers/src/models/gemma3n/modeling_gemma3n.js‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎packages/transformers/src/models/gemma4/feature_extraction_gemma4.js‎
Lines changed: 74 additions & 0 deletions b/‎packages/transformers/src/models/gemma4/feature_extraction_gemma4.js‎
Lines changed: 74 additions & 0 deletions
@@ -294,10 +294,11 @@ To find compatible models on the Hub, select the "transformers.js" library tag i
 1. **FastViT** (from Apple) released with the paper [FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization](https://huggingface.co/papers/2303.14189) by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.
 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei
 1. **Florence2** (from Microsoft) released with the paper [Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks](https://huggingface.co/papers/2311.06242) by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan.
-1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the paper [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
-1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the paper [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
-1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the paper [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
-1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the paper [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
+1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the blog post [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
+1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the blog post [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
+1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the blog post [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
+1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the blog post [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
+1. **[Gemma4](https://huggingface.co/docs/transformers/main/model_doc/gemma4)** (from Google) released with the blog post [Gemma 4](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/) by the Gemma Google team.
 1. **[GLM](https://huggingface.co/docs/transformers/main/model_doc/glm)** (from the GLM Team, THUDM & ZhipuAI) released with the paper [ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools](https://huggingface.co/papers/2406.12793v2) by Team GLM: Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, Zihan Wang.
 1. **[GLM-MoE-DSA](https://huggingface.co/docs/transformers/main/model_doc/glm_moe_dsa)** (from the GLM Team, ZhipuAI & Tsinghua University) released with the paper [GLM-5: from Vibe Coding to Agentic Engineering](https://huggingface.co/papers/2602.15763) by Team GLM.
 1. **[GLM-OCR](https://huggingface.co/docs/transformers/main/model_doc/glm_ocr)** (from the GLM Team, ZhipuAI & Tsinghua University) released with the paper [GLM-OCR Technical Report](https://huggingface.co/papers/2603.10910) by Team GLM: Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, Jie Tang.
 
@@ -55,10 +55,11 @@
 1. **FastViT** (from Apple) released with the paper [FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization](https://huggingface.co/papers/2303.14189) by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.
 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei
 1. **Florence2** (from Microsoft) released with the paper [Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks](https://huggingface.co/papers/2311.06242) by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan.
-1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the paper [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
-1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the paper [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
-1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the paper [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
-1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the paper [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
+1. **[Gemma](https://huggingface.co/docs/transformers/main/model_doc/gemma)** (from Google) released with the blog post [Gemma: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/gemma-open-models/) by the Gemma Google team.
+1. **[Gemma2](https://huggingface.co/docs/transformers/main/model_doc/gemma2)** (from Google) released with the blog post [Gemma2: Open Models Based on Gemini Technology and Research](https://blog.google/technology/developers/google-gemma-2/) by the Gemma Google team.
+1. **[Gemma3](https://huggingface.co/docs/transformers/main/model_doc/gemma3)** (from Google) released with the blog post [Introducing Gemma 3: The most capable model you can run on a single GPU or TPU](https://blog.google/technology/developers/gemma-3/) by the Gemma Google team.
+1. **[Gemma3n](https://huggingface.co/docs/transformers/main/model_doc/gemma3n)** (from Google) released with the blog post [Announcing Gemma 3n preview: powerful, efficient, mobile-first AI](https://developers.googleblog.com/en/introducing-gemma-3n/) by the Gemma Google team.
+1. **[Gemma4](https://huggingface.co/docs/transformers/main/model_doc/gemma4)** (from Google) released with the blog post [Gemma 4](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/) by the Gemma Google team.
 1. **[GLM](https://huggingface.co/docs/transformers/main/model_doc/glm)** (from the GLM Team, THUDM & ZhipuAI) released with the paper [ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools](https://huggingface.co/papers/2406.12793v2) by Team GLM: Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zhang, Yifan An, Yifan Xu, Yilin Niu, Yuantao Yang, Yueyan Li, Yushi Bai, Yuxiao Dong, Zehan Qi, Zhaoyu Wang, Zhen Yang, Zhengxiao Du, Zhenyu Hou, Zihan Wang.
 1. **[GLM-MoE-DSA](https://huggingface.co/docs/transformers/main/model_doc/glm_moe_dsa)** (from the GLM Team, ZhipuAI & Tsinghua University) released with the paper [GLM-5: from Vibe Coding to Agentic Engineering](https://huggingface.co/papers/2602.15763) by Team GLM.
 1. **[GLM-OCR](https://huggingface.co/docs/transformers/main/model_doc/glm_ocr)** (from the GLM Team, ZhipuAI & Tsinghua University) released with the paper [GLM-OCR Technical Report](https://huggingface.co/papers/2603.10910) by Team GLM: Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, Jie Tang.
 
@@ -74,6 +74,7 @@ function getNormalizedConfig(config) {
         case 'voxtral_realtime':
         case 'smolvlm':
         case 'gemma3n':
+        case 'gemma4':
         case 'lfm2_vl':
         case 'chatterbox':
         case 'lighton_ocr':
@@ -165,6 +166,7 @@ function getNormalizedConfig(config) {
         case 'vaultgemma':
         case 'gemma3_text':
         case 'gemma3n_text':
+        case 'gemma4_text':
         case 'glm':
         case 'helium':
         case 'ernie4_5':
@@ -434,6 +436,32 @@ export function getCacheShapes(config, options) {
             }
         }
         return cache_values;
+    } else if (['gemma4', 'gemma4_text'].includes(config.model_type)) {
+        const c = /** @type {any} */ (
+            config.model_type === 'gemma4' ? /** @type {any} */ (config).text_config : config
+        );
+        const pkv_prefix = options?.prefix ?? 'past_key_values';
+
+        /** @type {Record<string, number[]>} */
+        const cache_values = {};
+        const num_hidden_layers = c.num_hidden_layers;
+        const num_kv_shared_layers = c.num_kv_shared_layers ?? 0;
+        const num_kv_layers = num_hidden_layers - num_kv_shared_layers;
+        const num_key_value_heads = c.num_key_value_heads;
+        const head_dim = c.head_dim;
+        const global_head_dim = c.global_head_dim ?? head_dim;
+        const layer_types = c.layer_types ?? [];
+
+        // Create `num_kv_layers` unique KV entries, corresponding to the first `num_kv_layers`
+        // model layers (the remaining layers share caches with earlier ones).
+        // Full attention layers use global_head_dim, sliding attention layers use head_dim.
+        for (let i = 0; i < num_kv_layers; ++i) {
+            const dim = layer_types[i] === 'full_attention' ? global_head_dim : head_dim;
+            for (const kv of ['key', 'value']) {
+                cache_values[`${pkv_prefix}.${i}.${kv}`] = [batch_size, num_key_value_heads, 0, dim];
+            }
+        }
+        return cache_values;
     } else if (['lfm2_vl', 'qwen3_5', 'qwen3_5_moe', 'voxtral_realtime'].includes(config.model_type)) {
         let subConfig;
         if (config.model_type === 'voxtral_realtime' && options?.session_name === 'audio_encoder') {
 
@@ -5,6 +5,7 @@ export * from './clap/feature_extraction_clap.js';
 export * from './cohere_asr/feature_extraction_cohere_asr.js';
 export * from './dac/feature_extraction_dac.js';
 export * from './gemma3n/feature_extraction_gemma3n.js';
+export * from './gemma4/feature_extraction_gemma4.js';
 export * from './granite_speech/feature_extraction_granite_speech.js';
 export * from './moonshine/feature_extraction_moonshine.js';
 export * from './parakeet/feature_extraction_parakeet.js';
 
@@ -49,10 +49,7 @@ export class Gemma3nForConditionalGeneration extends Gemma3nPreTrainedModel {
             }));
             if (input_ids.dims[1] !== 1) {
                 if (pixel_values) {
-                    // Encode the image
-                    const { image_features } = await sessionRun(this.sessions['vision_encoder'], {
-                        pixel_values,
-                    });
+                    const { image_features } = await this._encode_vision({ pixel_values, ...kwargs });
                     ({ inputs_embeds, attention_mask } = this._merge_input_ids_with_image_features({
                         image_features,
                         inputs_embeds,
@@ -93,6 +90,10 @@ export class Gemma3nForConditionalGeneration extends Gemma3nPreTrainedModel {
         return outputs;
     }
 
+    _encode_vision(kwargs) {
+        return sessionRun(this.sessions['vision_encoder'], { pixel_values: kwargs.pixel_values });
+    }
+
     _merge_input_ids_with_image_features(kwargs) {
         const vision_hidden_size = kwargs.image_features.dims.at(-1);
         const reshaped_image_hidden_states = kwargs.image_features.view(-1, vision_hidden_size);
 
@@ -0,0 +1,74 @@
+import { validate_audio_inputs } from '../../feature_extraction_utils.js';
+import { Tensor } from '../../utils/tensor.js';
+import { spectrogram } from '../../utils/audio.js';
+import { Gemma3nAudioFeatureExtractor } from '../gemma3n/feature_extraction_gemma3n.js';
+
+export class Gemma4AudioFeatureExtractor extends Gemma3nAudioFeatureExtractor {
+    /**
+     * @override
+     * Gemma4 uses semicausal padding, unfold(frame_length+1) framing, and
+     * additive mel_floor — all controlled via flags on the shared spectrogram().
+     */
+    async _extract_fbank_features(waveform, max_length) {
+        const { frame_length, hop_length, fft_length } = this.config;
+
+        // Compute frame count matching Python's unfold(size=frame_length+1, step=hop_length)
+        const pad_left = Math.floor(frame_length / 2);
+        const num_frames = Math.floor((waveform.length + pad_left - (frame_length + 1)) / hop_length) + 1;
+
+        return spectrogram(waveform, this.window, frame_length, hop_length, {
+            fft_length,
+            center: true,
+            pad_mode: 'semicausal',
+            onesided: true,
+            preemphasis: this.config.preemphasis,
+            preemphasis_htk_flavor: this.config.preemphasis_htk_flavor,
+            mel_filters: this.mel_filters,
+            log_mel: 'log',
+            mel_floor: this.config.mel_floor,
+            mel_floor_mode: 'add',
+            remove_dc_offset: false,
+            transpose: true,
+            max_num_frames: num_frames,
+        });
+    }
+
+    /**
+     * @override
+     * Wraps the base class result with a frame-aware attention mask
+     * and zeros out features for invalid (padded) frames.
+     */
+    async _call(audio, options = {}) {
+        validate_audio_inputs(audio, 'Gemma4AudioFeatureExtractor');
+
+        const original_length = audio.length;
+        const result = await super._call(audio, options);
+
+        const { input_features } = result;
+        const [, num_frames, num_features] = input_features.dims;
+
+        // Build frame-aware mask: a frame is valid only when all its samples are real audio.
+        const { frame_length, hop_length } = this.config;
+        const pad_left = Math.floor(frame_length / 2);
+        const frame_size_for_unfold = frame_length + 1;
+
+        const sample_mask = new Uint8Array(original_length + pad_left + (options.pad_to_multiple_of ?? 128));
+        sample_mask.fill(1, pad_left, pad_left + original_length);
+
+        const frame_mask = new Uint8Array(num_frames);
+        for (let i = 0; i < num_frames; ++i) {
+            frame_mask[i] = sample_mask[i * hop_length + frame_size_for_unfold - 1] ? 1 : 0;
+        }
+
+        // Zero out features for invalid frames (matching Python's speech * mask[..., None])
+        const feat_data = /** @type {Float32Array} */ (input_features.data);
+        for (let i = 0; i < num_frames; ++i) {
+            if (!frame_mask[i]) {
+                feat_data.fill(0, i * num_features, (i + 1) * num_features);
+            }
+        }
+
+        result.input_features_mask = new Tensor('bool', frame_mask, [1, num_frames]);
+        return result;
+    }
+}