Skip to content

Commit fff273f

Browse files
authored
Merge pull request #85 from beehive-lab/docs-upd-read
Update README to enhance TornadoVM performance section and clarify GP…
2 parents 8422058 + f6d8137 commit fff273f

File tree

1 file changed

+33
-36
lines changed

1 file changed

+33
-36
lines changed

README.md

Lines changed: 33 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -51,42 +51,6 @@ GPULlama3ChatModel model = GPULlama3ChatModel.builder()
5151
#### **[Interactive-mode]** Running on a RTX 5090 with nvtop on bottom to track GPU utilization and memory usage.
5252

5353
![Demo](docs/inter-output.gif)
54-
-----------
55-
#### **[Instruct-mode]** Running on a RTX 5090
56-
57-
![Demo](docs/intruct-output.gif)
58-
----------
59-
60-
### TornadoVM-Accelerated Inference Performance and Optimization Status
61-
62-
We are at the early stages of Java entering the AI world with features added to the JVM that enable faster execution such as GPU acceleration, Vector acceleration, high-performance access to off-heap memory and others.
63-
<br><br>This repository provides the first Java-native implementation of Llama3 that automatically compiles and executes Java code on GPUs via TornadoVM.
64-
The baseline numbers presented below provide a solid starting point for achieving more competitive performance compared to llama.cpp or native CUDA implementations.
65-
[Our roadmap](https://github.com/beehive-lab/GPULlama3.java/blob/main/docs/GPULlama3_ROADMAP.md) provides the upcoming set of features that will dramatically improve the numbers below with the clear target being to achieve performance parity with the fastest implementations.
66-
<br><br>
67-
If you achieve additional performance data points (e.g. new hardware or platforms) please let us know to add them below.
68-
<br><br>
69-
In addition, if you are interested to learn more about the challenges of managed programming languages and GPU acceleration, you can read [our book](https://link.springer.com/book/10.1007/978-3-031-49559-5) or consult the [TornadoVM educational pages](https://www.tornadovm.org/resources).
70-
71-
72-
| Vendor / Backend | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
73-
|:----------------------------:|:------------:|:---------------------:|:---------------------:|:-------------:|
74-
| | | **FP16** | **FP16** | **Support** |
75-
| **NVIDIA / OpenCL-PTX** | RTX 3070 | 52 tokens/s | 22.96 tokens/s ||
76-
| | RTX 4090 | 66.07 tokens/s | 35.51 tokens/s ||
77-
| | RTX 5090 | 96.65 tokens/s | 47.68 tokens/s ||
78-
| | L4 Tensor | 52.96 tokens/s | 22.68 tokens/s ||
79-
| **Intel / OpenCL** | Arc A770 | 15.65 tokens/s | 7.02 tokens/s | (WIP) |
80-
| **Apple Silicon / OpenCL** | M3 Pro | 14.04 tokens/s | 6.78 tokens/s | (WIP) |
81-
| | M4 Pro | 16.77 tokens/s | 8.56 tokens/s | (WIP) |
82-
| **AMD / OpenCL** | Radeon RX | (WIP) | (WIP) | (WIP) |
83-
84-
##### ⚠️ Note on Apple Silicon Performance
85-
86-
TornadoVM currently runs on Apple Silicon via [OpenCL](https://developer.apple.com/opencl/), which has been officially deprecated since macOS 10.14.
87-
88-
Despite being deprecated, OpenCL can still run on Apple Silicon; albeit, with older drivers which do not support all optimizations of TornadoVM. Therefore, the performance is not optimal since TornadoVM does not have a Metal backend yet (it currently has OpenCL, PTX, and SPIR-V backends). We recommend using Apple silicon for development and for performance testing to use OpenCL/PTX compatible Nvidia GPUs for the time being (until we add a Metal backend to TornadoVM and start optimizing it).
89-
9054

9155
-----------
9256

@@ -159,6 +123,39 @@ make
159123
./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke"
160124
```
161125

126+
127+
----------
128+
129+
### TornadoVM-Accelerated Inference Performance and Optimization Status
130+
131+
We are at the early stages of Java entering the AI world with features added to the JVM that enable faster execution such as GPU acceleration, Vector acceleration, high-performance access to off-heap memory and others.
132+
<br><br>This repository provides the first Java-native implementation of Llama3 that automatically compiles and executes Java code on GPUs via TornadoVM.
133+
The baseline numbers presented below provide a solid starting point for achieving more competitive performance compared to llama.cpp or native CUDA implementations.
134+
[Our roadmap](https://github.com/beehive-lab/GPULlama3.java/blob/main/docs/GPULlama3_ROADMAP.md) provides the upcoming set of features that will dramatically improve the numbers below with the clear target being to achieve performance parity with the fastest implementations.
135+
<br><br>
136+
If you achieve additional performance data points (e.g. new hardware or platforms) please let us know to add them below.
137+
<br><br>
138+
In addition, if you are interested to learn more about the challenges of managed programming languages and GPU acceleration, you can read [our book](https://link.springer.com/book/10.1007/978-3-031-49559-5) or consult the [TornadoVM educational pages](https://www.tornadovm.org/resources).
139+
140+
141+
| Vendor / Backend | Hardware | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Optimizations |
142+
|:----------------------------:|:------------:|:---------------------:|:---------------------:|:-------------:|
143+
| | | **FP16** | **FP16** | **Support** |
144+
| **NVIDIA / OpenCL-PTX** | RTX 3070 | 66 tokens/s | 55.46 tokens/s ||
145+
| | RTX 4090 | 86.11 tokens/s | 75.32 tokens/s ||
146+
| | RTX 5090 | 117.65 tokens/s | 112.68 tokens/s ||
147+
| | L4 Tensor | 52.96 tokens/s | 22.68 tokens/s ||
148+
| **Intel / OpenCL** | Arc A770 | 15.65 tokens/s | 7.02 tokens/s | (WIP) |
149+
| **Apple Silicon / OpenCL** | M3 Pro | 14.04 tokens/s | 6.78 tokens/s | (WIP) |
150+
| | M4 Pro | 16.77 tokens/s | 8.56 tokens/s | (WIP) |
151+
| **AMD / OpenCL** | Radeon RX | (WIP) | (WIP) | (WIP) |
152+
153+
##### ⚠️ Note on Apple Silicon Performance
154+
155+
TornadoVM currently runs on Apple Silicon via [OpenCL](https://developer.apple.com/opencl/), which has been officially deprecated since macOS 10.14.
156+
157+
Despite being deprecated, OpenCL can still run on Apple Silicon; albeit, with older drivers which do not support all optimizations of TornadoVM. Therefore, the performance is not optimal since TornadoVM does not have a Metal backend yet (it currently has OpenCL, PTX, and SPIR-V backends). We recommend using Apple silicon for development and for performance testing to use OpenCL/PTX compatible Nvidia GPUs for the time being (until we add a Metal backend to TornadoVM and start optimizing it).
158+
162159
-----------
163160
## 📦 Maven Dependency
164161

0 commit comments

Comments
 (0)