[models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8 by mikepapadim · Pull Request #92 · beehive-lab/GPULlama3.java

mikepapadim · 2025-12-17T14:35:04Z

This PR adds first-class support for IBM Granite transformer models (v3.3 and above) in GPULlama3.java, enabling both CPU inference and GPU-accelerated execution with FP16 (and preliminary Q8) support. The implementation fully integrates Granite into the existing inference pipeline, tokenizer, and TornadoVM execution path.

Key Additions

Granite Model Integration
Introduced Granite model type in the core model type registry.
Extended model loader logic to detect and handle Granite models via metadata/name.
Implemented µP (maximal update) parameterization with custom scaling for embeddings, attention, residuals, and logits during inference.

Tested with: https://huggingface.co/ibm-research/granite-3.2-2b-instruct-GGUF/tree/1fffc80d42091ac8652479fcceefa82c02937187

Example output:

./llama-tornado   --gpu --opencl  --model granite-3.2-2b-instruct-Q8_0.gguf --prompt "If a GPU kernel runs in 12 ms and is launched 80 times, how many seconds is the total runtime?                 
Show the calculation." 
WARNING: Using incubator modules: jdk.incubator.vector
12 milliseconds (ms) is equal to 0.012 seconds (s).

To find the total runtime in seconds, you multiply the runtime in milliseconds by the number of launches:

Total runtime (s) = Runtime per launch (s) × Number of launches

Total runtime (s) = 0.012 s/launch × 80 launches

Total runtime (s) = 0.96 s

So, the total runtime for the GPU kernel is approximately 0.96 seconds.


achieved tok/s: 24.93. Tokens: 158, seconds: 6.34

- Introduced `Granite` model type in `ModelType`. - Added `forwardGranite` method with µP scaling in `InferenceCore`. - Implemented token generation methods (`generateTokensGranite`, `generateTokensGPUGranite`) for Granite models. - Updated `ModelLoader` to detect Granite models via metadata or name. - Enhanced tokenizer and chat format compatibility with Granite.

- Introduced `GraniteStandardWeights` and `GraniteTornadoWeights` for CPU and TornadoVM GPU weight handling. - Added `GraniteState` to manage model-specific state during inference. - Implemented `GraniteTokenizer` with GPT-2-style BPE support adapted for Granite. - Added `GraniteLoader` to handle model loading, configuration, and weight initialization. - Created `GraniteConfiguration` to define model-specific parameters and scaling factors (µP parameterization).

…move unused methods.

- Introduced `GraniteKernels` for optimized kernel operations with FP16 support. - Implemented `GraniteFP16FFNLayers` and `GraniteFP16LayerPlanner` for Transformer-based inference with TornadoVM. - Added `LogitsGraniteFP16Layer` to support Granite logits layer. - Enabled model-specific task graph creation and worker grid configuration tailored for Granite. - Updated `QuantizationPlannerFactory` to integrate `GraniteFP16LayerPlanner`. - Extended support for attention scaling, residual connections, and model-specific configurations (e.g., µP scaling).

Copilot

Pull request overview

This pull request adds support for IBM Granite transformer models, which use maximal update parameterization (µP) with custom scaling factors for embeddings, attention, residuals, and logits. The implementation includes CPU inference, FP16 GPU acceleration via TornadoVM, and a GPT-2-style BPE tokenizer with Granite-specific special tokens.

Key Changes:

Implemented Granite model architecture with µP scaling factors applied at embedding lookup, attention computation, residual connections, and logit output
Added GraniteTokenizer with BPE encoding/decoding using token ID 0 (<|end_of_text|>) as both BOS and EOS
Created TornadoVM GPU kernels and layers for FP16 inference with Granite-specific scaling operations

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
src/main/java/org/beehive/gpullama3/model/granite/GraniteConfiguration.java	Configuration record with µP scaling parameters (embedding, residual, attention, logit scales)
src/main/java/org/beehive/gpullama3/model/granite/Granite.java	Main model class implementing Granite-specific forward passes
src/main/java/org/beehive/gpullama3/inference/state/GraniteState.java	Inference state with identical tensor shapes to LlamaState
src/main/java/org/beehive/gpullama3/inference/weights/standard/GraniteStandardWeights.java	CPU weights wrapper extending StandardWeights
src/main/java/org/beehive/gpullama3/inference/weights/tornado/GraniteTornadoWeights.java	GPU weights wrapper extending TornadoWeights
src/main/java/org/beehive/gpullama3/model/loader/GraniteLoader.java	Model loader with Granite-specific metadata extraction
src/main/java/org/beehive/gpullama3/inference/InferenceCore.java	Added forwardGranite method with µP scaling applied at key points
src/main/java/org/beehive/gpullama3/inference/InferenceEngine.java	Token generation methods for Granite (CPU and GPU variants)
src/main/java/org/beehive/gpullama3/tornadovm/kernels/GraniteKernels.java	GPU kernels for embedding conversion, logits projection, attention with Granite scaling
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/LogitsGraniteFP16Layer.java	Logits layer using Granite-specific scaling kernel
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/GraniteFP16FFNLayers.java	FFN layers with Granite residual scaling
src/main/java/org/beehive/gpullama3/tornadovm/layers/ActivationGranite.java	Activation layer applying embedding scale factor for FP16
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/GraniteFP16LayerPlanner.java	Layer planner coordinating Granite FP16 execution
src/main/java/org/beehive/gpullama3/tokenizer/GraniteTokenizer.java	GPT-2-style BPE tokenizer with Granite special tokens
src/main/java/org/beehive/gpullama3/model/format/GraniteChatFormat.java	Chat format using <\|start_of_role\|> and <\|end_of_role\|> delimiters
src/main/java/org/beehive/gpullama3/model/ModelType.java	Added GRANITE enum variant with loader integration
src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java	Model type detection for Granite by name and metadata
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/base/QuantizationPlannerFactory.java	Factory integration for GraniteFP16LayerPlanner
src/main/java/org/beehive/gpullama3/model/format/ChatFormat.java	Added GraniteTokenizer case to factory method
src/main/java/org/beehive/gpullama3/model/format/LlamaChatFormat.java	Generalized from LlamaTokenizer to Tokenizer for reusability
Makefile	Fixed indentation (spaces to tab) on install target

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...in/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/GraniteFP16LayerPlanner.java

Copilot · 2025-12-17T17:01:15Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/GraniteFP16FFNLayers.java

+
+    // @formatter:off
+    /**
+     * Transformer Layer Task Flow (LlamaFP16FFNLayers)


The documentation header incorrectly references LlamaFP16FFNLayers instead of GraniteFP16FFNLayers. This should be updated to accurately reflect the class being documented.

Suggested change

* Transformer Layer Task Flow (LlamaFP16FFNLayers)

* Transformer Layer Task Flow (GraniteFP16FFNLayers)

src/main/java/org/beehive/gpullama3/tornadovm/layers/ActivationGranite.java

Copilot · 2025-12-17T17:01:18Z

src/main/java/org/beehive/gpullama3/tokenizer/GraniteTokenizer.java

+        Map<String, Integer> specialTokens = new HashMap<>();
+        for (int i = 0; i < allTokens; i++) {
+            String token = vocabulary.get(i);
+            // Identify special tokens by their format: start with <| and end with |>
+            if (token.startsWith("<|") && token.endsWith("|>")) {
+                specialTokens.put(token, i);
+            }
+        }
+
+        // init tokenizer object fields
+        this.vocabulary = vocabulary;
+        this.compiledPattern = Pattern.compile(GRANITE_PATTERN);
+        this.specialTokens = new HashMap<>(specialTokens);


Potentially confusing name: GraniteTokenizer also refers to field specialTokens (as this.specialTokens).

Suggested change

Map<String, Integer> specialTokens = new HashMap<>();

for (int i = 0; i < allTokens; i++) {

String token = vocabulary.get(i);

// Identify special tokens by their format: start with <| and end with |>

if (token.startsWith("<|") && token.endsWith("|>")) {

specialTokens.put(token, i);

}

}

// init tokenizer object fields

this.vocabulary = vocabulary;

this.compiledPattern = Pattern.compile(GRANITE_PATTERN);

this.specialTokens = new HashMap<>(specialTokens);

Map<String, Integer> detectedSpecialTokens = new HashMap<>();

for (int i = 0; i < allTokens; i++) {

String token = vocabulary.get(i);

// Identify special tokens by their format: start with <| and end with |>

if (token.startsWith("<|") && token.endsWith("|>")) {

detectedSpecialTokens.put(token, i);

}

}

// init tokenizer object fields

this.vocabulary = vocabulary;

this.compiledPattern = Pattern.compile(GRANITE_PATTERN);

this.specialTokens = new HashMap<>(detectedSpecialTokens);

Copilot · 2025-12-17T17:01:18Z

src/main/java/org/beehive/gpullama3/tokenizer/GraniteTokenizer.java

+
+    /**
+     * Encode text handling special tokens.
+     */


This method overrides Tokenizer.encode; it is advisable to add an Override annotation.

Suggested change

*/

*/

@Override

Copilot · 2025-12-17T17:01:19Z

src/main/java/org/beehive/gpullama3/tokenizer/GraniteTokenizer.java

+
+        return IntStream.range(0, bs.size()).boxed().collect(Collectors.toMap(bs::get, cs::get));
+    }
+


This method overrides Tokenizer.regexPattern; it is advisable to add an Override annotation.

Suggested change

@Override

Copilot · 2025-12-17T17:01:19Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/GraniteFP16FFNLayers.java

+
+        return unifiedLayer;
+    }
+


This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.

Suggested change

@Override

Copilot · 2025-12-17T17:01:19Z

src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/GraniteFP16FFNLayers.java

+    public ImmutableTaskGraph getImmutableTaskGraph() {
+        return null;
+    }
+


This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.

Suggested change

@Override

- Added `processHeadsParallelGranite` in `GraniteKernels` for efficient multi-head attention processing with parallelism using TornadoVM. - Updated `GraniteFP16FFNLayers` to utilize the new kernel. - Extended support for configurable attention scaling.

… `GraniteFP16FFNLayers`

…ranite 3.3 collection

…istency in environment variable naming

…logits initialization

…ate constructor for enhanced configurability

- Introduced `GraniteKernels` enhancements, adding Q8_0 kernel operations such as `convertQ8_0toFP32withGraniteScale` and fused matrix-vector computation. - Implemented `GraniteQ8_0FFNLayers` and `Granite8_0LayerPlanner` for layered Q8_0 inference. - Added `LogitsGraniteQ8_0Layer` to support Granite logits processing with Q8_0 quantization. - Updated `QuantizationPlannerFactory` to include Q8_0 planning for Granite models. - Enhanced `ActivationGranite` to handle Q8_0 embedding dequantization using Granite-specific scales.

mikepapadim · 2025-12-17T20:23:39Z

Only missing to extend CI with model testing

fp16 - granite3.3
fp16 - granite4.0
Q8 - granite3.3
Q8 - granite4.0

… to eliminate hardcoded values.

…omputation and fused cache writes

…ng, and streamline tensor-loading logic.

… call with `GraniteKernels::ropeRotationWithCacheCopy` in `GraniteFP16FFNLayers` and add `ropeTheta` to arguments.

…adata handling, and improve pretokenizer logic.

… with `GraniteKernels::ropeRotationWithCacheCopy` in `GraniteQ8_0FFNLayers` and add `ropeTheta` to kernel arguments.

… Granite 4.0 models

…ow step to run Granite-3.2-2b-instruct-f16.gguf during CI

…Q8_0.gguf during CI

…` and `output` writes.

orionpapadakis

Nice! Just a typo in file/class name

...in/java/org/beehive/gpullama3/tornadovm/layerplanner/model/q8_0/GraniteQ8_0LayerPlanner.java

…istency with naming conventions across Q8_0 models.

mikepapadim added 5 commits December 17, 2025 15:23

Fix Makefile indentation on 'install' target

af363a8

Refactor GraniteTokenizer to simplify special token handling and re…

c14b086

…move unused methods.

mikepapadim requested a review from Copilot December 17, 2025 16:51

mikepapadim changed the title ~~[modes] Support IBM Granite Models~~ [models] Support IBM Granite Models Dec 17, 2025

mikepapadim requested review from mairooni and orionpapadakis December 17, 2025 16:52

mikepapadim self-assigned this Dec 17, 2025

mikepapadim added the models label Dec 17, 2025

Copilot started reviewing on behalf of mikepapadim December 17, 2025 16:52 View session

mikepapadim requested a review from stratika December 17, 2025 16:52

mikepapadim changed the title ~~[models] Support IBM Granite Models~~ [models] Support for IBM Granite Models 3.3+ with FP16 and Q8 Dec 17, 2025

Copilot AI reviewed Dec 17, 2025

View reviewed changes

mikepapadim added 7 commits December 17, 2025 20:53

Remove unused LlamaTornadoWeights import and clean up formatting in…

8a6578d

… `GraniteFP16FFNLayers`

Update README to include IBM Granite 3.1+ model support and link to G…

79293b3

…ranite 3.3 collection

Update README to replace TORNADO_SDK with TORNADOVM_HOME for cons…

0808a9d

…istency in environment variable naming

Update GraniteFP16LayerPlanner to use LogitsGraniteFP16Layer for …

7803ad2

…logits initialization

Refactor LogitsGraniteFP16Layer to extend LogitsFP16Layer and upd…

a0fe7da

…ate constructor for enhanced configurability

mikepapadim marked this pull request as ready for review December 17, 2025 20:23

mikepapadim added 6 commits December 17, 2025 22:58

Use GraniteTokenizer for EOS token retrieval in GraniteChatFormat…

cc965de

… to eliminate hardcoded values.

Add ropeRotationWithCacheCopy kernel to GraniteKernels for RoPE c…

dc69ae4

…omputation and fused cache writes

Refactor GraniteLoader: enhance metadata handling, improve formatti…

1aeb5e1

…ng, and streamline tensor-loading logic.

Replace TransformerComputeKernelsLayered::ropeRotationWithCacheCopy…

34144a5

… call with `GraniteKernels::ropeRotationWithCacheCopy` in `GraniteFP16FFNLayers` and add `ropeTheta` to arguments.

Refactor GraniteTokenizer: add support for Granite 4.0, enhance met…

a6b6a80

…adata handling, and improve pretokenizer logic.

Replace TransformerComputeKernelsLayered::ropeRotationWithCacheCopy…

aafa156

… with `GraniteKernels::ropeRotationWithCacheCopy` in `GraniteQ8_0FFNLayers` and add `ropeTheta` to kernel arguments.

mikepapadim changed the title ~~[models] Support for IBM Granite Models 3.3+ with FP16 and Q8~~ [models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8 Dec 17, 2025

mikepapadim added 7 commits December 17, 2025 23:10

Update README to add IBM Granite 4.0 collection link

a8759a2

Update README to include support for Phi-3, IBM Granite 3.2+, and IBM…

c1459d9

… Granite 4.0 models

Add build step for running Granite 3.2 model in CI pipelineAdd workfl…

2571614

…ow step to run Granite-3.2-2b-instruct-f16.gguf during CI

Add workflow step to run Granite-3.2-2b-instruct-Q8.gguf during CI

a45b532

Add workflow steps to run Granite-4.0-1b-F16.gguf and Granite-4.0-1b-…

c2b91c4

…Q8_0.gguf during CI

Fix logits scaling in GraniteKernels: correct scaling order for `hb…

49eb298

…` and `output` writes.

Remove unused imports from GraniteFP16LayerPlanner.

150c5ee

orionpapadakis requested changes Dec 18, 2025

View reviewed changes

...in/java/org/beehive/gpullama3/tornadovm/layerplanner/model/q8_0/GraniteQ8_0LayerPlanner.java Show resolved Hide resolved

Rename Granite8_0LayerPlanner to GraniteQ8_0LayerPlanner for cons…

59eb425

…istency with naming conventions across Q8_0 models.

mikepapadim merged commit a73a5ff into main Dec 18, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8#92

[models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8#92
mikepapadim merged 26 commits intomainfrom
feat/models/ibm-granite

mikepapadim commented Dec 17, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

Copilot AI Dec 17, 2025

Uh oh!

mikepapadim commented Dec 17, 2025 •

edited

Loading

Uh oh!

orionpapadakis left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	* Transformer Layer Task Flow (LlamaFP16FFNLayers)
	* Transformer Layer Task Flow (GraniteFP16FFNLayers)


		return IntStream.range(0, bs.size()).boxed().collect(Collectors.toMap(bs::get, cs::get));
		}

Conversation

mikepapadim commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

mikepapadim commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionpapadakis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mikepapadim commented Dec 17, 2025 •

edited

Loading

mikepapadim commented Dec 17, 2025 •

edited

Loading