Skip to content

[models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8#92

Merged
mikepapadim merged 26 commits intomainfrom
feat/models/ibm-granite
Dec 18, 2025
Merged

[models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8#92
mikepapadim merged 26 commits intomainfrom
feat/models/ibm-granite

Conversation

@mikepapadim
Copy link
Copy Markdown
Member

@mikepapadim mikepapadim commented Dec 17, 2025

This PR adds first-class support for IBM Granite transformer models (v3.3 and above) in GPULlama3.java, enabling both CPU inference and GPU-accelerated execution with FP16 (and preliminary Q8) support. The implementation fully integrates Granite into the existing inference pipeline, tokenizer, and TornadoVM execution path.

Key Additions

  • Granite Model Integration
  • Introduced Granite model type in the core model type registry.
  • Extended model loader logic to detect and handle Granite models via metadata/name.
  • Implemented µP (maximal update) parameterization with custom scaling for embeddings, attention, residuals, and logits during inference.

Tested with: https://huggingface.co/ibm-research/granite-3.2-2b-instruct-GGUF/tree/1fffc80d42091ac8652479fcceefa82c02937187

Example output:

./llama-tornado   --gpu --opencl  --model granite-3.2-2b-instruct-Q8_0.gguf --prompt "If a GPU kernel runs in 12 ms and is launched 80 times, how many seconds is the total runtime?                 
Show the calculation." 
WARNING: Using incubator modules: jdk.incubator.vector
12 milliseconds (ms) is equal to 0.012 seconds (s).

To find the total runtime in seconds, you multiply the runtime in milliseconds by the number of launches:

Total runtime (s) = Runtime per launch (s) × Number of launches

Total runtime (s) = 0.012 s/launch × 80 launches

Total runtime (s) = 0.96 s

So, the total runtime for the GPU kernel is approximately 0.96 seconds.


achieved tok/s: 24.93. Tokens: 158, seconds: 6.34

- Introduced `Granite` model type in `ModelType`.
- Added `forwardGranite` method with µP scaling in `InferenceCore`.
- Implemented token generation methods (`generateTokensGranite`, `generateTokensGPUGranite`) for Granite models.
- Updated `ModelLoader` to detect Granite models via metadata or name.
- Enhanced tokenizer and chat format compatibility with Granite.
- Introduced `GraniteStandardWeights` and `GraniteTornadoWeights` for CPU and TornadoVM GPU weight handling.
- Added `GraniteState` to manage model-specific state during inference.
- Implemented `GraniteTokenizer` with GPT-2-style BPE support adapted for Granite.
- Added `GraniteLoader` to handle model loading, configuration, and weight initialization.
- Created `GraniteConfiguration` to define model-specific parameters and scaling factors (µP parameterization).
- Introduced `GraniteKernels` for optimized kernel operations with FP16 support.
- Implemented `GraniteFP16FFNLayers` and `GraniteFP16LayerPlanner` for Transformer-based inference with TornadoVM.
- Added `LogitsGraniteFP16Layer` to support Granite logits layer.
- Enabled model-specific task graph creation and worker grid configuration tailored for Granite.
- Updated `QuantizationPlannerFactory` to integrate `GraniteFP16LayerPlanner`.
- Extended support for attention scaling, residual connections, and model-specific configurations (e.g., µP scaling).
@mikepapadim mikepapadim requested a review from Copilot December 17, 2025 16:51
@mikepapadim mikepapadim changed the title [modes] Support IBM Granite Models [models] Support IBM Granite Models Dec 17, 2025
@mikepapadim mikepapadim self-assigned this Dec 17, 2025
@mikepapadim mikepapadim requested a review from stratika December 17, 2025 16:52
@mikepapadim mikepapadim changed the title [models] Support IBM Granite Models [models] Support for IBM Granite Models 3.3+ with FP16 and Q8 Dec 17, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for IBM Granite transformer models, which use maximal update parameterization (µP) with custom scaling factors for embeddings, attention, residuals, and logits. The implementation includes CPU inference, FP16 GPU acceleration via TornadoVM, and a GPT-2-style BPE tokenizer with Granite-specific special tokens.

Key Changes:

  • Implemented Granite model architecture with µP scaling factors applied at embedding lookup, attention computation, residual connections, and logit output
  • Added GraniteTokenizer with BPE encoding/decoding using token ID 0 (<|end_of_text|>) as both BOS and EOS
  • Created TornadoVM GPU kernels and layers for FP16 inference with Granite-specific scaling operations

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
src/main/java/org/beehive/gpullama3/model/granite/GraniteConfiguration.java Configuration record with µP scaling parameters (embedding, residual, attention, logit scales)
src/main/java/org/beehive/gpullama3/model/granite/Granite.java Main model class implementing Granite-specific forward passes
src/main/java/org/beehive/gpullama3/inference/state/GraniteState.java Inference state with identical tensor shapes to LlamaState
src/main/java/org/beehive/gpullama3/inference/weights/standard/GraniteStandardWeights.java CPU weights wrapper extending StandardWeights
src/main/java/org/beehive/gpullama3/inference/weights/tornado/GraniteTornadoWeights.java GPU weights wrapper extending TornadoWeights
src/main/java/org/beehive/gpullama3/model/loader/GraniteLoader.java Model loader with Granite-specific metadata extraction
src/main/java/org/beehive/gpullama3/inference/InferenceCore.java Added forwardGranite method with µP scaling applied at key points
src/main/java/org/beehive/gpullama3/inference/InferenceEngine.java Token generation methods for Granite (CPU and GPU variants)
src/main/java/org/beehive/gpullama3/tornadovm/kernels/GraniteKernels.java GPU kernels for embedding conversion, logits projection, attention with Granite scaling
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/LogitsGraniteFP16Layer.java Logits layer using Granite-specific scaling kernel
src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/GraniteFP16FFNLayers.java FFN layers with Granite residual scaling
src/main/java/org/beehive/gpullama3/tornadovm/layers/ActivationGranite.java Activation layer applying embedding scale factor for FP16
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/GraniteFP16LayerPlanner.java Layer planner coordinating Granite FP16 execution
src/main/java/org/beehive/gpullama3/tokenizer/GraniteTokenizer.java GPT-2-style BPE tokenizer with Granite special tokens
src/main/java/org/beehive/gpullama3/model/format/GraniteChatFormat.java Chat format using <|start_of_role|> and <|end_of_role|> delimiters
src/main/java/org/beehive/gpullama3/model/ModelType.java Added GRANITE enum variant with loader integration
src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java Model type detection for Granite by name and metadata
src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/base/QuantizationPlannerFactory.java Factory integration for GraniteFP16LayerPlanner
src/main/java/org/beehive/gpullama3/model/format/ChatFormat.java Added GraniteTokenizer case to factory method
src/main/java/org/beehive/gpullama3/model/format/LlamaChatFormat.java Generalized from LlamaTokenizer to Tokenizer for reusability
Makefile Fixed indentation (spaces to tab) on install target

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// @formatter:off
/**
* Transformer Layer Task Flow (LlamaFP16FFNLayers)
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation header incorrectly references LlamaFP16FFNLayers instead of GraniteFP16FFNLayers. This should be updated to accurately reflect the class being documented.

Suggested change
* Transformer Layer Task Flow (LlamaFP16FFNLayers)
* Transformer Layer Task Flow (GraniteFP16FFNLayers)

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +58
Map<String, Integer> specialTokens = new HashMap<>();
for (int i = 0; i < allTokens; i++) {
String token = vocabulary.get(i);
// Identify special tokens by their format: start with <| and end with |>
if (token.startsWith("<|") && token.endsWith("|>")) {
specialTokens.put(token, i);
}
}

// init tokenizer object fields
this.vocabulary = vocabulary;
this.compiledPattern = Pattern.compile(GRANITE_PATTERN);
this.specialTokens = new HashMap<>(specialTokens);
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially confusing name: GraniteTokenizer also refers to field specialTokens (as this.specialTokens).

Suggested change
Map<String, Integer> specialTokens = new HashMap<>();
for (int i = 0; i < allTokens; i++) {
String token = vocabulary.get(i);
// Identify special tokens by their format: start with <| and end with |>
if (token.startsWith("<|") && token.endsWith("|>")) {
specialTokens.put(token, i);
}
}
// init tokenizer object fields
this.vocabulary = vocabulary;
this.compiledPattern = Pattern.compile(GRANITE_PATTERN);
this.specialTokens = new HashMap<>(specialTokens);
Map<String, Integer> detectedSpecialTokens = new HashMap<>();
for (int i = 0; i < allTokens; i++) {
String token = vocabulary.get(i);
// Identify special tokens by their format: start with <| and end with |>
if (token.startsWith("<|") && token.endsWith("|>")) {
detectedSpecialTokens.put(token, i);
}
}
// init tokenizer object fields
this.vocabulary = vocabulary;
this.compiledPattern = Pattern.compile(GRANITE_PATTERN);
this.specialTokens = new HashMap<>(detectedSpecialTokens);

Copilot uses AI. Check for mistakes.

/**
* Encode text handling special tokens.
*/
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides Tokenizer.encode; it is advisable to add an Override annotation.

Suggested change
*/
*/
@Override

Copilot uses AI. Check for mistakes.

return IntStream.range(0, bs.size()).boxed().collect(Collectors.toMap(bs::get, cs::get));
}

Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides Tokenizer.regexPattern; it is advisable to add an Override annotation.

Suggested change
@Override

Copilot uses AI. Check for mistakes.

return unifiedLayer;
}

Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.

Suggested change
@Override

Copilot uses AI. Check for mistakes.
public ImmutableTaskGraph getImmutableTaskGraph() {
return null;
}

Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.

Suggested change
@Override

Copilot uses AI. Check for mistakes.
- Added `processHeadsParallelGranite` in `GraniteKernels` for efficient multi-head attention processing with parallelism using TornadoVM.
- Updated `GraniteFP16FFNLayers` to utilize the new kernel.
- Extended support for configurable attention scaling.
…ate constructor for enhanced configurability
- Introduced `GraniteKernels` enhancements, adding Q8_0 kernel operations such as `convertQ8_0toFP32withGraniteScale` and fused matrix-vector computation.
- Implemented `GraniteQ8_0FFNLayers` and `Granite8_0LayerPlanner` for layered Q8_0 inference.
- Added `LogitsGraniteQ8_0Layer` to support Granite logits processing with Q8_0 quantization.
- Updated `QuantizationPlannerFactory` to include Q8_0 planning for Granite models.
- Enhanced `ActivationGranite` to handle Q8_0 embedding dequantization using Granite-specific scales.
@mikepapadim mikepapadim marked this pull request as ready for review December 17, 2025 20:23
@mikepapadim
Copy link
Copy Markdown
Member Author

mikepapadim commented Dec 17, 2025

Only missing to extend CI with model testing

  • fp16 - granite3.3
  • fp16 - granite4.0
  • Q8 - granite3.3
  • Q8 - granite4.0

… call with `GraniteKernels::ropeRotationWithCacheCopy` in `GraniteFP16FFNLayers` and add `ropeTheta` to arguments.
…adata handling, and improve pretokenizer logic.
… with `GraniteKernels::ropeRotationWithCacheCopy` in `GraniteQ8_0FFNLayers` and add `ropeTheta` to kernel arguments.
@mikepapadim mikepapadim changed the title [models] Support for IBM Granite Models 3.3+ with FP16 and Q8 [models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8 Dec 17, 2025
Copy link
Copy Markdown
Collaborator

@orionpapadakis orionpapadakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just a typo in file/class name

…istency with naming conventions across Q8_0 models.
@mikepapadim mikepapadim merged commit a73a5ff into main Dec 18, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants