You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/backend/OPENVINO.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,16 +73,17 @@ The OpenVINO backend can be configured using the following environment variables
73
73
74
74
| Variable | Description |
75
75
|--------|-------------|
76
-
|`GGML_OPENVINO_DEVICE`| Specify the target device (`CPU`, `GPU`, `NPU`). If not set, the backend automatically selects the first available device in priority order: **GPU → CPU → NPU**. When set to `NPU`, static compilation mode is enabled for optimal performance. |
76
+
|`GGML_OPENVINO_DEVICE`| Specify the target device (`CPU`, `GPU`, `NPU`). When set to `NPU`, static compilation mode is enabled for optimal performance. |
77
77
|`GGML_OPENVINO_CACHE_DIR`| Directory for OpenVINO model caching (recommended: `/tmp/ov_cache`). Enables model caching when set. **Not supported on NPU devices.**|
|*`GGML_OPENVINO_STATEFUL_EXECUTION`| Enable stateful execution for better performance |
83
+
|`GGML_OPENVINO_STATEFUL_EXECUTION`| Enable stateful execution for better performance |
84
84
85
-
*`GGML_OPENVINO_STATEFUL_EXECUTION` is an **Experimental** feature to allow stateful execution for managing the KV cache internally inside the OpenVINO model, improving performance on CPUs and GPUs. Stateful execution is not effective on NPUs, and not all models currently support this feature. This feature is experimental and has been validated only with the llama-simple, llama-cli, llama-bench, and llama-run applications and is recommended to enable for the best performance. Other applications, such as llama-server and llama-perplexity, are not yet supported.
85
+
> [!NOTE]
86
+
>`GGML_OPENVINO_STATEFUL_EXECUTION` is an **Experimental** feature to allow stateful execution for managing the KV cache internally inside the OpenVINO model, improving performance on CPUs and GPUs. Stateful execution is not effective on NPUs, and not all models currently support this feature. This feature is experimental and has been validated only with the llama-simple, llama-cli, llama-bench, and llama-run applications and is recommended to enable for the best performance. Other applications, such as llama-server and llama-perplexity, are not yet supported.
Copy file name to clipboardExpand all lines: docs/build.md
+18-16Lines changed: 18 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -735,7 +735,7 @@ To read documentation for how to build on IBM Z & LinuxONE, [click here](./build
735
735
736
736
## OpenVINO
737
737
738
-
[OpenVINO](https://docs.openvino.ai/2025/index.html) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge.
738
+
[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge.
739
739
The OpenVINO backend enhances performance by leveraging hardware-specific optimizations and can be enabled for use with llama.cpp.
740
740
741
741
Follow the instructions below to install OpenVINO runtime and build llama.cpp with OpenVINO support. For more detailed information on OpenVINO backend, refer to [OPENVINO.md](backend/OPENVINO.md)
@@ -753,12 +753,11 @@ Follow the instructions below to install OpenVINO runtime and build llama.cpp wi
Select "Desktop development with C++" under workloads
760
+
- Download Microsoft.VisualStudio.2022.BuildTools: [Visual_Studio_Build_Tools](https://aka.ms/vs/17/release/vs_BuildTools.exe) and select"Desktop development with C++" under workloads
762
761
- Install git
763
762
- Install OpenCL with vcpkg
764
763
```powershell
@@ -768,7 +767,8 @@ Follow the instructions below to install OpenVINO runtime and build llama.cpp wi
768
767
bootstrap-vcpkg.bat
769
768
vcpkg install opencl
770
769
```
771
-
- Use "x64 Native Tools Command Prompt"for Build
770
+
> [!NOTE]
771
+
> Use `x64 Native Tools Command Prompt` for Windows build.
When using the OpenVINO backend, the first inference token may have slightly higher latency due to on-the-fly conversion to the OpenVINO graph. Subsequent tokens and runs will be faster.
833
833
834
-
```bash
835
-
# If device is unset or unavailable, default to CPU.
834
+
```
835
+
# Linux
836
+
# If device is unset or unavailable, defaults to CPU.
836
837
export GGML_OPENVINO_DEVICE=GPU
837
838
./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "
839
+
840
+
# Windows Command Line
841
+
set GGML_OPENVINO_DEVICE=GPU
842
+
# Windows PowerShell
843
+
$env:GGML_OPENVINO_DEVICE = "GPU"
844
+
845
+
build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "
838
846
```
839
847
840
848
To run in chat mode:
@@ -846,15 +854,9 @@ To run in chat mode:
846
854
847
855
Control OpenVINO behavior using these environment variables:
848
856
849
-
- **`GGML_OPENVINO_DEVICE`**: Specify the target device forOpenVINO inference. If not set, automatically selects the first available devicein priority order: GPU, CPU, NPU. When set to `NPU` to use Intel NPUs, it enables static compilation mode for optimal performance.
850
-
- **`GGML_OPENVINO_CACHE_DIR`**: Directory formodel caching (recommended: `/tmp/ov_cache`). If set, enables model cachingin OpenVINO. Note: Not supported when using NPU devices yet.
851
-
- **`GGML_OPENVINO_PROFILING`**: Enable execution time profiling.
852
-
- **`GGML_OPENVINO_DUMP_CGRAPH`**: Save compute graph to `cgraph.txt`.
853
-
- **`GGML_OPENVINO_DUMP_IR`**: Export OpenVINO IR files with timestamps.
854
-
855
857
| Variable | Description |
856
858
|--------|-------------|
857
-
|`GGML_OPENVINO_DEVICE`| Specify the target device forOpenVINO inference. If not set, automatically selects the first available devicein priority order: GPU, CPU, NPU. When set to `NPU` to use Intel NPUs, it enables |
859
+
| `GGML_OPENVINO_DEVICE` | Specify the target device for OpenVINO inference. When set to `NPU`, static compilation mode is enabled for optimal performance. |
858
860
| `GGML_OPENVINO_CACHE_DIR` | Directory for OpenVINO model caching (recommended: `/tmp/ov_cache`). If set, enables model caching in OpenVINO. Note: Not supported when using NPU devices yet. |
0 commit comments