[models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8#92
[models] Support for IBM Granite Models 3.2, 3.3 & 4.0 with FP16 and Q8#92mikepapadim merged 26 commits intomainfrom
Conversation
- Introduced `Granite` model type in `ModelType`. - Added `forwardGranite` method with µP scaling in `InferenceCore`. - Implemented token generation methods (`generateTokensGranite`, `generateTokensGPUGranite`) for Granite models. - Updated `ModelLoader` to detect Granite models via metadata or name. - Enhanced tokenizer and chat format compatibility with Granite.
- Introduced `GraniteStandardWeights` and `GraniteTornadoWeights` for CPU and TornadoVM GPU weight handling. - Added `GraniteState` to manage model-specific state during inference. - Implemented `GraniteTokenizer` with GPT-2-style BPE support adapted for Granite. - Added `GraniteLoader` to handle model loading, configuration, and weight initialization. - Created `GraniteConfiguration` to define model-specific parameters and scaling factors (µP parameterization).
…move unused methods.
- Introduced `GraniteKernels` for optimized kernel operations with FP16 support. - Implemented `GraniteFP16FFNLayers` and `GraniteFP16LayerPlanner` for Transformer-based inference with TornadoVM. - Added `LogitsGraniteFP16Layer` to support Granite logits layer. - Enabled model-specific task graph creation and worker grid configuration tailored for Granite. - Updated `QuantizationPlannerFactory` to integrate `GraniteFP16LayerPlanner`. - Extended support for attention scaling, residual connections, and model-specific configurations (e.g., µP scaling).
There was a problem hiding this comment.
Pull request overview
This pull request adds support for IBM Granite transformer models, which use maximal update parameterization (µP) with custom scaling factors for embeddings, attention, residuals, and logits. The implementation includes CPU inference, FP16 GPU acceleration via TornadoVM, and a GPT-2-style BPE tokenizer with Granite-specific special tokens.
Key Changes:
- Implemented Granite model architecture with µP scaling factors applied at embedding lookup, attention computation, residual connections, and logit output
- Added GraniteTokenizer with BPE encoding/decoding using token ID 0 (<|end_of_text|>) as both BOS and EOS
- Created TornadoVM GPU kernels and layers for FP16 inference with Granite-specific scaling operations
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main/java/org/beehive/gpullama3/model/granite/GraniteConfiguration.java | Configuration record with µP scaling parameters (embedding, residual, attention, logit scales) |
| src/main/java/org/beehive/gpullama3/model/granite/Granite.java | Main model class implementing Granite-specific forward passes |
| src/main/java/org/beehive/gpullama3/inference/state/GraniteState.java | Inference state with identical tensor shapes to LlamaState |
| src/main/java/org/beehive/gpullama3/inference/weights/standard/GraniteStandardWeights.java | CPU weights wrapper extending StandardWeights |
| src/main/java/org/beehive/gpullama3/inference/weights/tornado/GraniteTornadoWeights.java | GPU weights wrapper extending TornadoWeights |
| src/main/java/org/beehive/gpullama3/model/loader/GraniteLoader.java | Model loader with Granite-specific metadata extraction |
| src/main/java/org/beehive/gpullama3/inference/InferenceCore.java | Added forwardGranite method with µP scaling applied at key points |
| src/main/java/org/beehive/gpullama3/inference/InferenceEngine.java | Token generation methods for Granite (CPU and GPU variants) |
| src/main/java/org/beehive/gpullama3/tornadovm/kernels/GraniteKernels.java | GPU kernels for embedding conversion, logits projection, attention with Granite scaling |
| src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/LogitsGraniteFP16Layer.java | Logits layer using Granite-specific scaling kernel |
| src/main/java/org/beehive/gpullama3/tornadovm/layers/type/fp16/GraniteFP16FFNLayers.java | FFN layers with Granite residual scaling |
| src/main/java/org/beehive/gpullama3/tornadovm/layers/ActivationGranite.java | Activation layer applying embedding scale factor for FP16 |
| src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/GraniteFP16LayerPlanner.java | Layer planner coordinating Granite FP16 execution |
| src/main/java/org/beehive/gpullama3/tokenizer/GraniteTokenizer.java | GPT-2-style BPE tokenizer with Granite special tokens |
| src/main/java/org/beehive/gpullama3/model/format/GraniteChatFormat.java | Chat format using <|start_of_role|> and <|end_of_role|> delimiters |
| src/main/java/org/beehive/gpullama3/model/ModelType.java | Added GRANITE enum variant with loader integration |
| src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java | Model type detection for Granite by name and metadata |
| src/main/java/org/beehive/gpullama3/tornadovm/layerplanner/base/QuantizationPlannerFactory.java | Factory integration for GraniteFP16LayerPlanner |
| src/main/java/org/beehive/gpullama3/model/format/ChatFormat.java | Added GraniteTokenizer case to factory method |
| src/main/java/org/beehive/gpullama3/model/format/LlamaChatFormat.java | Generalized from LlamaTokenizer to Tokenizer for reusability |
| Makefile | Fixed indentation (spaces to tab) on install target |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...in/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/GraniteFP16LayerPlanner.java
Outdated
Show resolved
Hide resolved
...in/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/GraniteFP16LayerPlanner.java
Outdated
Show resolved
Hide resolved
...in/java/org/beehive/gpullama3/tornadovm/layerplanner/model/fp16/GraniteFP16LayerPlanner.java
Outdated
Show resolved
Hide resolved
|
|
||
| // @formatter:off | ||
| /** | ||
| * Transformer Layer Task Flow (LlamaFP16FFNLayers) |
There was a problem hiding this comment.
The documentation header incorrectly references LlamaFP16FFNLayers instead of GraniteFP16FFNLayers. This should be updated to accurately reflect the class being documented.
| * Transformer Layer Task Flow (LlamaFP16FFNLayers) | |
| * Transformer Layer Task Flow (GraniteFP16FFNLayers) |
src/main/java/org/beehive/gpullama3/tornadovm/layers/ActivationGranite.java
Show resolved
Hide resolved
| Map<String, Integer> specialTokens = new HashMap<>(); | ||
| for (int i = 0; i < allTokens; i++) { | ||
| String token = vocabulary.get(i); | ||
| // Identify special tokens by their format: start with <| and end with |> | ||
| if (token.startsWith("<|") && token.endsWith("|>")) { | ||
| specialTokens.put(token, i); | ||
| } | ||
| } | ||
|
|
||
| // init tokenizer object fields | ||
| this.vocabulary = vocabulary; | ||
| this.compiledPattern = Pattern.compile(GRANITE_PATTERN); | ||
| this.specialTokens = new HashMap<>(specialTokens); |
There was a problem hiding this comment.
Potentially confusing name: GraniteTokenizer also refers to field specialTokens (as this.specialTokens).
| Map<String, Integer> specialTokens = new HashMap<>(); | |
| for (int i = 0; i < allTokens; i++) { | |
| String token = vocabulary.get(i); | |
| // Identify special tokens by their format: start with <| and end with |> | |
| if (token.startsWith("<|") && token.endsWith("|>")) { | |
| specialTokens.put(token, i); | |
| } | |
| } | |
| // init tokenizer object fields | |
| this.vocabulary = vocabulary; | |
| this.compiledPattern = Pattern.compile(GRANITE_PATTERN); | |
| this.specialTokens = new HashMap<>(specialTokens); | |
| Map<String, Integer> detectedSpecialTokens = new HashMap<>(); | |
| for (int i = 0; i < allTokens; i++) { | |
| String token = vocabulary.get(i); | |
| // Identify special tokens by their format: start with <| and end with |> | |
| if (token.startsWith("<|") && token.endsWith("|>")) { | |
| detectedSpecialTokens.put(token, i); | |
| } | |
| } | |
| // init tokenizer object fields | |
| this.vocabulary = vocabulary; | |
| this.compiledPattern = Pattern.compile(GRANITE_PATTERN); | |
| this.specialTokens = new HashMap<>(detectedSpecialTokens); |
|
|
||
| /** | ||
| * Encode text handling special tokens. | ||
| */ |
There was a problem hiding this comment.
This method overrides Tokenizer.encode; it is advisable to add an Override annotation.
| */ | |
| */ | |
| @Override |
|
|
||
| return IntStream.range(0, bs.size()).boxed().collect(Collectors.toMap(bs::get, cs::get)); | ||
| } | ||
|
|
There was a problem hiding this comment.
This method overrides Tokenizer.regexPattern; it is advisable to add an Override annotation.
| @Override |
|
|
||
| return unifiedLayer; | ||
| } | ||
|
|
There was a problem hiding this comment.
This method overrides AbstractLayer.configureLayerDataTransfers; it is advisable to add an Override annotation.
| @Override |
| public ImmutableTaskGraph getImmutableTaskGraph() { | ||
| return null; | ||
| } | ||
|
|
There was a problem hiding this comment.
This method overrides AbstractFFNLayers.getFfnLayerTaskGraphs; it is advisable to add an Override annotation.
| @Override |
- Added `processHeadsParallelGranite` in `GraniteKernels` for efficient multi-head attention processing with parallelism using TornadoVM. - Updated `GraniteFP16FFNLayers` to utilize the new kernel. - Extended support for configurable attention scaling.
… `GraniteFP16FFNLayers`
…ranite 3.3 collection
…istency in environment variable naming
…logits initialization
…ate constructor for enhanced configurability
- Introduced `GraniteKernels` enhancements, adding Q8_0 kernel operations such as `convertQ8_0toFP32withGraniteScale` and fused matrix-vector computation. - Implemented `GraniteQ8_0FFNLayers` and `Granite8_0LayerPlanner` for layered Q8_0 inference. - Added `LogitsGraniteQ8_0Layer` to support Granite logits processing with Q8_0 quantization. - Updated `QuantizationPlannerFactory` to include Q8_0 planning for Granite models. - Enhanced `ActivationGranite` to handle Q8_0 embedding dequantization using Granite-specific scales.
|
Only missing to extend CI with model testing
|
… to eliminate hardcoded values.
…omputation and fused cache writes
…ng, and streamline tensor-loading logic.
… call with `GraniteKernels::ropeRotationWithCacheCopy` in `GraniteFP16FFNLayers` and add `ropeTheta` to arguments.
…adata handling, and improve pretokenizer logic.
… with `GraniteKernels::ropeRotationWithCacheCopy` in `GraniteQ8_0FFNLayers` and add `ropeTheta` to kernel arguments.
… Granite 4.0 models
…ow step to run Granite-3.2-2b-instruct-f16.gguf during CI
…Q8_0.gguf during CI
…` and `output` writes.
orionpapadakis
left a comment
There was a problem hiding this comment.
Nice! Just a typo in file/class name
...in/java/org/beehive/gpullama3/tornadovm/layerplanner/model/q8_0/GraniteQ8_0LayerPlanner.java
Show resolved
Hide resolved
…istency with naming conventions across Q8_0 models.
This PR adds first-class support for IBM Granite transformer models (v3.3 and above) in GPULlama3.java, enabling both CPU inference and GPU-accelerated execution with FP16 (and preliminary Q8) support. The implementation fully integrates Granite into the existing inference pipeline, tokenizer, and TornadoVM execution path.
Key Additions
Tested with: https://huggingface.co/ibm-research/granite-3.2-2b-instruct-GGUF/tree/1fffc80d42091ac8652479fcceefa82c02937187
Example output: