|
| 1 | +--- |
| 2 | +tags: |
| 3 | +- gguf |
| 4 | +- llama.cpp |
| 5 | +- text-generation |
| 6 | +- moe |
| 7 | +- quantized |
| 8 | +- bailing |
| 9 | +license: apache-2.0 |
| 10 | +language: |
| 11 | +- en |
| 12 | +- zh |
| 13 | +pipeline_tag: text-generation |
| 14 | +base_model: inclusionAI/Ling-2.6-flash |
| 15 | +base_model_relation: quantized |
| 16 | +--- |
| 17 | + |
| 18 | +# Ling-2.6-flash GGUF |
| 19 | + |
| 20 | +Quantized GGUF of [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) — a 104B parameter MoE model (7.4B active) with hybrid MLA/GLA architecture. |
| 21 | + |
| 22 | +## Files |
| 23 | + |
| 24 | +| File | Size | Format | |
| 25 | +|------|------|--------| |
| 26 | +| `Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf` | ~57 GB | IQ4_NL | |
| 27 | + |
| 28 | +## Running in llama.cpp |
| 29 | + |
| 30 | +**This model requires a custom llama.cpp branch with Bailing Hybrid architecture support:** |
| 31 | + |
| 32 | +*https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2* |
| 33 | + |
| 34 | +While the mtp works (llama-server accepts '--spec-type mtp') atm it actually slows down the decode. So the speed test below are *without* mtp in. IDK why mtp does not help. (can think o reasonsf: the mtp implementation is poor or buggy, Ling-2.6 has only 1 extra head, giving only 1 extra token - does not suffice, or maybe the quantisation is detremental) |
| 35 | + |
| 36 | +### Build |
| 37 | + |
| 38 | +```bash |
| 39 | +git clone https://github.com/ljubomirj/llama.cpp.git |
| 40 | +cd llama.cpp |
| 41 | +git checkout LJ-Ling-2.6-flash-r2 |
| 42 | +mkdir -p build && cd build |
| 43 | +cmake .. -DLLAMA_METAL=ON |
| 44 | +make -j llama-cli llama-server llama-batched-bench |
| 45 | +``` |
| 46 | + |
| 47 | +### CLI |
| 48 | + |
| 49 | +```bash |
| 50 | +./bin/llama-cli \ |
| 51 | + -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \ |
| 52 | + -st -p "The capital of France is" |
| 53 | +``` |
| 54 | + |
| 55 | +```bash |
| 56 | +ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices |
| 57 | +ggml_metal_library_init: using embedded metal library |
| 58 | +ggml_metal_library_init: loaded in 0.013 sec |
| 59 | +ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s) |
| 60 | +ggml_metal_device_init: GPU name: MTL0 (Apple M2 Max) |
| 61 | +ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) |
| 62 | +ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) |
| 63 | +ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) |
| 64 | +ggml_metal_device_init: simdgroup reduction = true |
| 65 | +ggml_metal_device_init: simdgroup matrix mul. = true |
| 66 | +ggml_metal_device_init: has unified memory = true |
| 67 | +ggml_metal_device_init: has bfloat = true |
| 68 | +ggml_metal_device_init: has tensor = false |
| 69 | +ggml_metal_device_init: use residency sets = true |
| 70 | +ggml_metal_device_init: use shared buffers = true |
| 71 | +ggml_metal_device_init: recommendedMaxWorkingSetSize = 92274.69 MB |
| 72 | + |
| 73 | +Loading model... |
| 74 | + |
| 75 | +> The capital of France is |
| 76 | + |
| 77 | +The capital of France is Paris. |
| 78 | + |
| 79 | +[ Prompt: 96.1 t/s | Generation: 33.3 t/s ] |
| 80 | + |
| 81 | +Exiting... |
| 82 | +common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | |
| 83 | +common_memory_breakdown_print: | - MTL0 (Apple M2 Max) | 88000 = 27848 + (59447 = 58324 + 632 + 490) + 704 | |
| 84 | +common_memory_breakdown_print: | - Host | 653 = 345 + 0 + 308 | |
| 85 | +ggml_metal_free: deallocating |
| 86 | +``` |
| 87 | + |
| 88 | +### Server |
| 89 | + |
| 90 | +```bash |
| 91 | +./bin/llama-server \ |
| 92 | + -m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \ |
| 93 | + -ctx 4096 -fa -ngl 99 |
| 94 | +``` |
| 95 | + |
| 96 | +## Performance (MacBook Pro M2 Max, 96 GB) |
| 97 | + |
| 98 | +- Prefill: ~250-400 tok/s |
| 99 | +- Generation: ~30-45 tok/s |
| 100 | + |
| 101 | +```bash |
| 102 | +./bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000 |
| 103 | +``` |
| 104 | + |
| 105 | +```bash |
| 106 | +main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8 |
| 107 | + |
| 108 | +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | |
| 109 | +|-------|--------|------|--------|----------|----------|----------|----------|----------|----------| |
| 110 | +| 512 | 128 | 1 | 640 | 1.169 | 437.96 | 2.739 | 46.73 | 3.908 | 163.75 | |
| 111 | +| 1024 | 128 | 1 | 1152 | 2.855 | 358.72 | 3.534 | 36.22 | 6.389 | 180.32 | |
| 112 | +| 2048 | 128 | 1 | 2176 | 6.073 | 337.25 | 3.535 | 36.20 | 9.608 | 226.48 | |
| 113 | +| 4096 | 128 | 1 | 4224 | 12.564 | 326.00 | 3.753 | 34.10 | 16.318 | 258.86 | |
| 114 | +| 8192 | 128 | 1 | 8320 | 26.474 | 309.43 | 3.938 | 32.50 | 30.412 | 273.57 | |
| 115 | +| 16384 | 128 | 1 | 16512 | 57.800 | 283.46 | 4.252 | 30.10 | 62.052 | 266.10 | |
| 116 | +| 32768 | 128 | 1 | 32896 | 131.884 | 248.46 | 4.631 | 27.64 | 136.515 | 240.97 | |
| 117 | + |
| 118 | +llama_perf_context_print: load time = 7196.80 ms |
| 119 | +llama_perf_context_print: prompt eval time = 239042.77 ms / 65040 tokens ( 3.68 ms per token, 272.09 tokens per second) |
| 120 | +llama_perf_context_print: eval time = 26374.75 ms / 896 runs ( 29.44 ms per token, 33.97 tokens per second) |
| 121 | +llama_perf_context_print: total time = 272401.59 ms / 65936 tokens |
| 122 | +llama_perf_context_print: graphs reused = 889 |
| 123 | +``` |
| 124 | + |
| 125 | +## Implementation Notes |
| 126 | + |
| 127 | +### Reference: `bailing_hybrid.py` |
| 128 | + |
| 129 | +The [`docs/bailing_hybrid.py`](https://github.com/ljubomirj/llama.cpp/blob/LJ-Ling-2.6-flash-r2/docs/bailing_hybrid.py) in the llama.cpp fork is the original MLX model implementation from [mlx-lm PR #1227](https://github.com/ml-explore/mlx-lm/pull/1227). It was the primary reference for porting the Bailing Hybrid architecture to llama.cpp — covering MLA attention, GLA (Gated Linear Attention) with the recurrent state kernel, MoE expert routing, and the MTP speculative decoding head. |
| 130 | + |
| 131 | +### GLA Slope Fix |
| 132 | + |
| 133 | +The upstream model had an [off-by-one bug in the GLA decay slope](https://huggingface.co/inclusionAI/Ling-2.6-flash/commit/7c60792051a885a3f14a75576f01f7f5cb6a08ff): `(self.layer_idx - 1)` was used instead of `self.layer_idx` in the layer-dependent decay scaling. This caused incorrect decay rates for GLA layers, with the most severe effect on layer 0 (which got a negative slope). Our llama.cpp implementation used the correct formula from the start: `layer_factor = 1.0 - il / (n_layer - 1) + 1e-5`. |
| 134 | + |
| 135 | +### MTP (Multi-Token Prediction) |
| 136 | + |
| 137 | +The MTP speculative decoding head works (100% draft acceptance with greedy sampling) but provides no speedup for this model. Ling-2.6 has only 1 MTP head (`nextn_predict_layers=1`), limiting speculative decoding to 1 draft per trunk verification pass. With the MTP head on CPU, the extra draft overhead exceeds any trunk pass savings. Models with multiple MTP heads (e.g. DeepSeek-V3 with 3 heads) would benefit more. |
| 138 | + |
| 139 | +## Quantization Method |
| 140 | + |
| 141 | +This GGUF quantization was developed entirely by AI coding agents reading the [`bailing_hybrid.py`](https://github.com/ml-explore/mlx-lm/pull/1227) implementation from [mlx-lm#1227](https://github.com/ml-explore/mlx-lm/pull/1227) and adapting it for llama.cpp compatibility. |
| 142 | + |
| 143 | +Agents / LLMs used to make this run on my M2 Max: |
| 144 | +- **Claude / GLM-5.1** |
| 145 | +- **OpenCode / Kimi-K2.6** |
| 146 | +- **OpenCode / DeepSeek-V4-Pro** |
| 147 | + |
| 148 | +## Credits |
| 149 | + |
| 150 | +- The OG [llama.cpp](https://github.com/ggml-org/llama.cpp) making all this possible! |
| 151 | +- Original model [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) |
| 152 | +- The original [`bailing_hybrid.py`](https://github.com/ml-explore/mlx-lm/pull/1227) implementation from [mlx-lm#1227](https://github.com/ml-explore/mlx-lm/pull/1227) |
| 153 | +- MLX reference implementation: [mlx-community/Ling-2.6-flash-mlx-4bit-DWQ](https://huggingface.co/mlx-community/Ling-2.6-flash-mlx-4bit-DWQ) |
| 154 | +- Custom llama.cpp fork [ljubomirj/llama.cpp @ LJ-Ling-2.6-flash-r2](https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2) |
| 155 | + |
0 commit comments