From f58e9ea432639fce0622d5b27a33885d944e975f Mon Sep 17 00:00:00 2001 From: Dhruvizzle Date: Fri, 26 Jun 2026 14:57:48 -0400 Subject: [PATCH 1/2] docs: add Apple A-chipsets model size estimation guide --- docs/apple-chipsets-model-size-guide.md | 105 ++++++++++++++++++++++++ 1 file changed, 105 insertions(+) create mode 100644 docs/apple-chipsets-model-size-guide.md diff --git a/docs/apple-chipsets-model-size-guide.md b/docs/apple-chipsets-model-size-guide.md new file mode 100644 index 000000000000..706927d9fe0d --- /dev/null +++ b/docs/apple-chipsets-model-size-guide.md @@ -0,0 +1,105 @@ +# Apple A-chipsets: How to Estimate a Suitable Model Size for llama.cpp + +## Unified Memory Architecture + +On a standard PC, the CPU and GPU have separate memory pools. There may be 16GB +worth of RAM and an NVIDIA GPU with 8GB of VRAM. The Large Language Model will +run entirely on the GPU's 8GB VRAM. Essentially, this is the hard limit for +model inference, meaning the 8GB GPU can only use 8GB, no matter how much RAM +is installed. + +Apple Silicon, on the other hand, uses Unified Memory Architecture (UMA), where +the CPU and GPU share the same memory pool. On a 16GB MacBook Air, the OS +typically uses around 3-4 GB, leaving roughly 12 GB for your model. llama.cpp +uses Apple's Metal framework to run inference on the GPU, taking advantage of +the full unified memory pool. This means a Mac user can effectively use more +memory for AI than someone with a PC running a cheap GPU with only 4-8 GB of +dedicated VRAM. + +## Model Size Formula + +The size of a model in memory depends on two things: the number of parameters +and how many bytes each one uses. The formula is simple: + +Model size in GB = (parameters x bytes per weight) / 1,000,000,000 + +For example, a 7 billion parameter model at F16 uses 2 bytes per weight, so it +requires roughly 14 GB. The same model at Q4 uses only 0.5 bytes per weight, +bringing it down to 3.5 GB. This is why quantization matters so much on Apple +Silicon. It is the main way to fit larger, smarter models into your available +memory. + +## Quantization + +Quantization is the process of compressing continuous or precise information +into smaller, simpler values. In the context of AI and machine learning, it +refers to reducing the precision of a model's weights, for example from 32-bit +down to 4 or 8-bit. This reduction in memory usage allows larger models to run +faster on devices with limited memory, like a MacBook Air. + +- Q4 is good for most Mac users because it only takes up 3.5GB for a 7B model, leaving plenty of room for the OS. +- Q8 has better quality than Q4 but needs 7GB for a 7B model, so it works best if you have 16GB or more. +- F16 is the best quality but needs 14GB for a 7B model, so only use it if you have 32GB or more. +- Q2 is the smallest option but quality suffers noticeably, so only use it if your Mac has 8GB and nothing else fits. + +## How Much Can My Mac Run? + +Use your Mac's unified memory size to find the right model for you. + +**8 GB Mac** +Your effective memory budget is around 4 GB after the OS takes its share. +Best options: 7B Q2 or 3B Q4 + +**16 GB Mac** +Your effective memory budget is around 12 GB. +Best options: 7B Q8 or 13B Q4 + +**24 GB Mac** +Your effective memory budget is around 20 GB. +Best options: 13B Q8 or 30B Q4 + +**32 GB Mac** +Your effective memory budget is around 28 GB. +Best options: 30B Q8 or 70B Q4 + +**64 GB Mac** +Your effective memory budget is around 60 GB. +Best options: 70B Q8 + +**96 GB Mac** +Your effective memory budget is around 90 GB. +Best options: 70B F16 + +## Practical llama.cpp Commands + +Before downloading a large model, you can check how much memory it will +need by using the following command: + +./build/bin/llama-cli -m your-model.gguf -n 1 --verbose + +This will show you how much memory the model is using before you commit +to a full run. If the number is close to your available memory, consider +downloading a more compressed version instead. + +To run a model with Metal GPU acceleration on your Mac: + +./build/bin/llama-cli -m your-model.gguf -ngl 99 -p "Hello" + +The -ngl 99 flag tells llama.cpp to offload as many layers as possible +to the GPU, which is what you want on Apple Silicon to take full advantage +of unified memory. + +## Recommendations by Mac Memory Size + +If you have a 16GB Mac, you should be aware that the OS will occupy around +3-4GB leaving you with 12GB. In this case, you should look for a model with +either 7 or 13 billion parameters. The model should be downloaded with the Q8 +version as this will ensure you have the greatest quality and be able to +comfortably fit within your remaining memory. + +If you have an 8GB Mac, the OS will take around 3-4GB leaving you with about +4GB free. Look for models with 3 or 7 billion parameters and download the Q4 +version as anything larger will not fit comfortably. + +If you have a 32GB Mac, you have around 28GB available after the OS. You can +comfortably run 30 or 70 billion parameter models at Q4 or Q8 quality. From 3d0fe1b2943c5b21dbcef63a411055298980bfd3 Mon Sep 17 00:00:00 2001 From: Dhruvizzle Date: Fri, 26 Jun 2026 19:19:03 -0400 Subject: [PATCH 2/2] docs: add Apple A-chipsets model size estimation guide --- docs/apple-chipsets-model-size-guide.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/apple-chipsets-model-size-guide.md b/docs/apple-chipsets-model-size-guide.md index 706927d9fe0d..41b246b0df66 100644 --- a/docs/apple-chipsets-model-size-guide.md +++ b/docs/apple-chipsets-model-size-guide.md @@ -33,14 +33,14 @@ memory. Quantization is the process of compressing continuous or precise information into smaller, simpler values. In the context of AI and machine learning, it -refers to reducing the precision of a model's weights, for example from 32-bit +refers to reducing the precision of a model's weights, for example, from 32-bit down to 4 or 8-bit. This reduction in memory usage allows larger models to run faster on devices with limited memory, like a MacBook Air. - Q4 is good for most Mac users because it only takes up 3.5GB for a 7B model, leaving plenty of room for the OS. - Q8 has better quality than Q4 but needs 7GB for a 7B model, so it works best if you have 16GB or more. - F16 is the best quality but needs 14GB for a 7B model, so only use it if you have 32GB or more. -- Q2 is the smallest option but quality suffers noticeably, so only use it if your Mac has 8GB and nothing else fits. +- Q2 is the smallest option, but quality suffers noticeably, so only use it if your Mac has 8GB and nothing else fits. ## How Much Can My Mac Run? @@ -86,18 +86,17 @@ To run a model with Metal GPU acceleration on your Mac: ./build/bin/llama-cli -m your-model.gguf -ngl 99 -p "Hello" The -ngl 99 flag tells llama.cpp to offload as many layers as possible -to the GPU, which is what you want on Apple Silicon to take full advantage +to the GPU, which is what you want on Apple Silicon, to take full advantage of unified memory. ## Recommendations by Mac Memory Size If you have a 16GB Mac, you should be aware that the OS will occupy around -3-4GB leaving you with 12GB. In this case, you should look for a model with +3-4GB, leaving you with 12GB. In this case, you should look for a model with either 7 or 13 billion parameters. The model should be downloaded with the Q8 -version as this will ensure you have the greatest quality and be able to -comfortably fit within your remaining memory. +version, as this will ensure you have the highest quality and can fit comfortably within your remaining memory. -If you have an 8GB Mac, the OS will take around 3-4GB leaving you with about +If you have an 8GB Mac, the OS will take around 3-4GB, leaving you with about 4GB free. Look for models with 3 or 7 billion parameters and download the Q4 version as anything larger will not fit comfortably.