Replies: 1 comment
-
|
I've done several tests using Unified memory and it is so slow and I think it just does not make sense to investigate this issue. It is was faster just to move some of the layers to the CPU. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
My laptop is with 7700HQ CPU + nVidia GTX 1050 GPU. 32GB main RAM and 4GB VRAM.
I'm loading Mistral-Nemo-Instruct-2407-Q4_0 model using the following params.
I'm analyzing the total RAM usage = RAM + VRAM using different combination of -ngl and --no-warmup
-ngl = 0: Total RAM usage after warm-up = ~10GB + ~0GB VRAM = ~10GB total RAM usage.
-ngl = 0 and --no-warmup: Total RAM usage = ~10GB + ~0GB VRAM = ~10GB total RAM usage.
-ngl = 128: Total RAM usage after warm-up = ~10GB + ~4GB VRAM = ~14GB total RAM usage.
-ngl = 19: Total RAM usage after warm-up = ~10GB + ~4GB VRAM = ~14GB total RAM usage.
-ngl = 18: Total RAM usage after warm-up = ~6GB + ~4GB VRAM = ~10GB total RAM usage.
-ngl = 128 and --no-warmup: Total RAM usage before warming up = ~6GB + ~4GB VRAM = ~10GB total RAM usage and after warming the result is again ~10GB + ~4GB VRAM = ~14GB total RAM usage.
NB: After -ngl = 14 is the max number of layers that fit entirely into 4GB VRAM. Everything above 14 need UMA.
Why in some cases (when trying to allocate significantly more VRAM than actually available) the total RAM usage is 14GB, while the rest of the cases it is 10GB. Shouldn't it always be ~constant?
Why there is a sudden spike (from -ngl 18 to -ngl 19) of RAM usage (from 6 to 10GB of RAM) after the warming up. Having in mind that the logs show only few hundreds of MB difference. Therefore I mean this one layer is not so extremely large compared to other layers. Also from -ngl 19 to -n 128 (this model has 41 layers) the total RAM usage is always 14GB.
Any ideas what I'm doing wrong?
Update: Few observations from today's testing.
Beta Was this translation helpful? Give feedback.
All reactions