Skip to content

Commit af2036a

Browse files
committed
Correct typos in README & add threadpoolctl for PT benchmark
1 parent 32bae2e commit af2036a

1 file changed

Lines changed: 12 additions & 14 deletions

File tree

README.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Inference Vision Transformer (ViT) in plain C/C++ using ggml without any extra d
55
## Description
66

77

8-
This project presents a standalone implementation of the well known Vision Transformer (ViT) model family, used in a broad spectrum of applications and SOTA models like Large Multimodal Modls(LMM). The primary goal is to develop a C/C++ inference engine tailored for ViT models, utilizing [ggml](https://github.com/ggerganov/ggml) to enhance performance, particularly on edge devices. Designed to be both lightweight and self-contained, this implementation can be run across diverse platforms.
8+
This project presents a standalone implementation of the well known Vision Transformer (ViT) model family, used in a broad spectrum of applications and SOTA models like Large Multimodal Models(LMM). The primary goal is to develop a C/C++ inference engine tailored for ViT models, utilizing [ggml](https://github.com/ggerganov/ggml) to enhance performance, particularly on edge devices. Designed to be both lightweight and self-contained, this implementation can be run across diverse platforms.
99

1010
<details>
1111
<summary>Table of Contents</summary>
@@ -35,8 +35,7 @@ This project presents a standalone implementation of the well known Vision Trans
3535
- 4-bit, 5-bit and 8-bit quantization support.
3636
- Support for timm ViTs with different variants out of the box.
3737

38-
`vit.cpp` also has a short startup time compared to large ML frameworks, which makes it suitable for serverless deployments where the cold start is an issue.
39-
38+
An important aspect of using `vit.cpp` is that it has short startup times compared to common DL frameworks, which makes it suitable for serverless deployments where the cold start is an issue.
4039

4140
## Vision Transformer architecture
4241

@@ -106,7 +105,7 @@ The implemented architecture is based on the original Vision Transformer from:
106105
# install torch and timm
107106
pip install torch timm
108107

109-
# list available models if needed, note that not all models are supported
108+
# list available models if needed; note that not all models are supported
110109
python convert-pth-to-ggml.py --list
111110

112111
# convert the weights to gguf : vit tiny with patch size of 16 and an image
@@ -186,12 +185,12 @@ You can efficiently run ViT inference on the CPU.
186185
Memory requirements and inference speed on AMD Ryzen 7 3700U(4 cores, 8 threads) for both native PyTorch and `vit.cpp`.
187186
Using 4 threads gives better results for my machine. The reported results of inference speed correspond to 10 runs averages for both PyTorch and `vit.cpp`.
188187

189-
| Model | Mem(PyTorch) | Mem | Speed(PyTorch) | Speed |
188+
| Model | Max Mem(PyTorch) | Max Mem | Speed(PyTorch) | Speed |
190189
| :----: | :-----------: | :------------: | :------------: | :------------: |
191190
| tiny | ~780 MB | **~20 MB** | 431 ms | **120 ms** |
192191
| small | ~965 MB | **~52 MB** | 780 ms | **463 ms** |
193-
| base | ~1609 MB | **~179 MB** | 2393 ms | **1441 ms** |
194-
| large | ~3865 MB | **~597 MB** | 8151 ms | **4892 ms** |
192+
| base | ~1.61 GB | **~179 MB** | 2393 ms | **1441 ms** |
193+
| large | ~3.86 GB | **~597 MB** | 8151 ms | **4892 ms** |
195194

196195
> **Note:** The models used are of the form `vit_{size}_patch16_224.augreg_in21k_ft_in1k`.
197196
@@ -202,9 +201,10 @@ In order to test the inference speed on your machine, you can run the following
202201

203202
chmod +x scripts/benchmark.*
204203

204+
# install memory_profiler & threadpoolctl
205+
pip install memory_profiler threadpoolctl
206+
205207
# run the benchmark of PyTorch
206-
# install memory_profiler first
207-
pip install memory_profiler
208208
python scripts/benchmark.py
209209

210210
# run the benchmark of vit.cpp
@@ -215,10 +215,8 @@ Both scripts use 4 threads by default. In Python, the `threadpoolctl` library is
215215
## Quantization
216216

217217

218-
`vit.cpp` supports q4_0, q4_1, q5_0, q5_1 and q8_0 quantization types.
219-
You can quantize a model in f32 (recommended) or f16 to one of these types by using the `./bin/quantize` binary.
220-
221-
218+
`vit.cpp` supports many quantization strategies from ggml such as q4_0, q4_1, q5_0, q5_1 and q8_0 types.
219+
You can quantize a model in F32 (the patch embedding is in F16) to one of these types by using the `./bin/quantize` binary.
222220
```
223221
usage: ./bin/quantize /path/to/ggml-model-f32.gguf /path/to/ggml-model-quantized.gguf type
224222
type = 2 - q4_0
@@ -234,7 +232,7 @@ For example, you can run the following to convert the model to q5_1:
234232
./bin/quantize ../tiny-ggml-model-f16.gguf ../tiny-ggml-model-f16-quant.gguf 7
235233
```
236234

237-
Now you can use `tiny-ggml-model-f16-quant.gguf` just like the model in F16.
235+
Then you can use `tiny-ggml-model-f16-quant.gguf` just like the model in F16.
238236

239237

240238
## To-Do List

0 commit comments

Comments
 (0)