@@ -207,9 +207,12 @@ In order to test the inference speed on your machine, you can run the following
207207 # run the benchmark of PyTorch
208208 python scripts/benchmark.py
209209
210- # run the benchmark of vit.cpp
210+ # run the benchmark of vit.cpp for non-qunatized model
211211 ./scripts/benchmark.sh
212212
213+ # to run the benchamrk for qunatized models; 4 threads and quantize flag
214+ ./scripts/benchmark.sh 4 1
215+
213216Both scripts use 4 threads by default. In Python, the ` threadpoolctl ` library is used to limit the number of threads used by PyTorch.
214217
215218## Quantization
@@ -234,6 +237,32 @@ For example, you can run the following to convert the model to q5_1:
234237
235238Then you can use ` tiny-ggml-model-f16-quant.gguf ` just like the model in F16.
236239
240+ ### Results
241+
242+ Here are the benchmarks for the different models and quantizations on my machine:
243+
244+ | Model | Quantization | Speed (ms) | Mem (MB) |
245+ | :----: | :----------: | :-----------: | :---------------: |
246+ | tiny | q4_0 | 100 ms | 12 MB |
247+ | tiny | q4_1 | 102 ms | 12 MB |
248+ | tiny | q5_0 | 116 ms | 13 MB |
249+ | tiny | q5_1 | 112 ms | 13 MB |
250+ | tiny | q8_0 | 92 ms | 15 MB |
251+ | small | q4_0 | 261 ms | 23 MB |
252+ | small | q4_1 | 229 ms | 24 MB |
253+ | small | q5_0 | 291 ms | 25 MB |
254+ | small | q5_1 | 276 ms | 27 MB |
255+ | small | q8_0 | 232 ms | 33 MB |
256+ | base | q4_0 | 714 ms | 61 MB |
257+ | base | q4_1 | 657 ms | 66 MB |
258+ | base | q5_0 | 879 ms | 71 MB |
259+ | base | q5_1 | 838 ms | 76 MB |
260+ | base | q8_0 | 658 ms | 102 MB |
261+ | large | q4_0 | 2189 ms | 181 MB |
262+ | large | q4_1 | 1935 ms | 199 MB |
263+ | large | q5_0 | 2708 ms | 217 MB |
264+ | large | q5_1 | 2560 ms | 235 MB |
265+ | large | q8_0 | 2042 ms | 325 MB |
237266
238267## To-Do List
239268
0 commit comments