Skip to content

Commit ff0dcd6

Browse files
author
Ljubomir Josifovski
committed
HF README
1 parent 6682088 commit ff0dcd6

2 files changed

Lines changed: 913 additions & 0 deletions

File tree

README.HF

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
---
2+
tags:
3+
- gguf
4+
- llama.cpp
5+
- text-generation
6+
- moe
7+
- quantized
8+
- bailing
9+
license: apache-2.0
10+
language:
11+
- en
12+
- zh
13+
pipeline_tag: text-generation
14+
base_model: inclusionAI/Ling-2.6-flash
15+
base_model_relation: quantized
16+
---
17+
18+
# Ling-2.6-flash GGUF
19+
20+
Quantized GGUF of [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash) — a 104B parameter MoE model (7.4B active) with hybrid MLA/GLA architecture.
21+
22+
## Files
23+
24+
| File | Size | Format |
25+
|------|------|--------|
26+
| `Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf` | ~57 GB | IQ4_NL |
27+
28+
## Running in llama.cpp
29+
30+
**This model requires a custom llama.cpp branch with Bailing Hybrid architecture support:**
31+
32+
*https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2*
33+
34+
While the mtp works (llama-server accepts '--spec-type mtp') atm it actually slows down the decode. So the speed test below are *without* mtp in. IDK why mtp does not help. (can think o reasonsf: the mtp implementation is poor or buggy, Ling-2.6 has only 1 extra head, giving only 1 extra token - does not suffice, or maybe the quantisation is detremental)
35+
36+
### Build
37+
38+
```bash
39+
git clone https://github.com/ljubomirj/llama.cpp.git
40+
cd llama.cpp
41+
git checkout LJ-Ling-2.6-flash-r2
42+
mkdir -p build && cd build
43+
cmake .. -DLLAMA_METAL=ON
44+
make -j llama-cli llama-server llama-batched-bench
45+
```
46+
47+
### CLI
48+
49+
```bash
50+
./bin/llama-cli \
51+
-m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
52+
-st -p "The capital of France is"
53+
```
54+
55+
```bash
56+
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
57+
ggml_metal_library_init: using embedded metal library
58+
ggml_metal_library_init: loaded in 0.013 sec
59+
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
60+
ggml_metal_device_init: GPU name: MTL0 (Apple M2 Max)
61+
ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008)
62+
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
63+
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
64+
ggml_metal_device_init: simdgroup reduction = true
65+
ggml_metal_device_init: simdgroup matrix mul. = true
66+
ggml_metal_device_init: has unified memory = true
67+
ggml_metal_device_init: has bfloat = true
68+
ggml_metal_device_init: has tensor = false
69+
ggml_metal_device_init: use residency sets = true
70+
ggml_metal_device_init: use shared buffers = true
71+
ggml_metal_device_init: recommendedMaxWorkingSetSize = 92274.69 MB
72+
73+
Loading model...
74+
75+
> The capital of France is
76+
77+
The capital of France is Paris.
78+
79+
[ Prompt: 96.1 t/s | Generation: 33.3 t/s ]
80+
81+
Exiting...
82+
common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
83+
common_memory_breakdown_print: | - MTL0 (Apple M2 Max) | 88000 = 27848 + (59447 = 58324 + 632 + 490) + 704 |
84+
common_memory_breakdown_print: | - Host | 653 = 345 + 0 + 308 |
85+
ggml_metal_free: deallocating
86+
```
87+
88+
### Server
89+
90+
```bash
91+
./bin/llama-server \
92+
-m Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf \
93+
-ctx 4096 -fa -ngl 99
94+
```
95+
96+
## Performance (MacBook Pro M2 Max, 96 GB)
97+
98+
- Prefill: ~250-400 tok/s
99+
- Generation: ~30-45 tok/s
100+
101+
```bash
102+
./bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000
103+
```
104+
105+
```bash
106+
main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8
107+
108+
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
109+
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
110+
| 512 | 128 | 1 | 640 | 1.169 | 437.96 | 2.739 | 46.73 | 3.908 | 163.75 |
111+
| 1024 | 128 | 1 | 1152 | 2.855 | 358.72 | 3.534 | 36.22 | 6.389 | 180.32 |
112+
| 2048 | 128 | 1 | 2176 | 6.073 | 337.25 | 3.535 | 36.20 | 9.608 | 226.48 |
113+
| 4096 | 128 | 1 | 4224 | 12.564 | 326.00 | 3.753 | 34.10 | 16.318 | 258.86 |
114+
| 8192 | 128 | 1 | 8320 | 26.474 | 309.43 | 3.938 | 32.50 | 30.412 | 273.57 |
115+
| 16384 | 128 | 1 | 16512 | 57.800 | 283.46 | 4.252 | 30.10 | 62.052 | 266.10 |
116+
| 32768 | 128 | 1 | 32896 | 131.884 | 248.46 | 4.631 | 27.64 | 136.515 | 240.97 |
117+
118+
llama_perf_context_print: load time = 7196.80 ms
119+
llama_perf_context_print: prompt eval time = 239042.77 ms / 65040 tokens ( 3.68 ms per token, 272.09 tokens per second)
120+
llama_perf_context_print: eval time = 26374.75 ms / 896 runs ( 29.44 ms per token, 33.97 tokens per second)
121+
llama_perf_context_print: total time = 272401.59 ms / 65936 tokens
122+
llama_perf_context_print: graphs reused = 889
123+
```
124+
125+
## Implementation Notes
126+
127+
### Reference: `bailing_hybrid.py`
128+
129+
The [`docs/bailing_hybrid.py`](https://github.com/ljubomirj/llama.cpp/blob/LJ-Ling-2.6-flash-r2/docs/bailing_hybrid.py) in the llama.cpp fork is the original MLX model implementation from [mlx-lm PR #1227](https://github.com/ml-explore/mlx-lm/pull/1227). It was the primary reference for porting the Bailing Hybrid architecture to llama.cpp — covering MLA attention, GLA (Gated Linear Attention) with the recurrent state kernel, MoE expert routing, and the MTP speculative decoding head.
130+
131+
### GLA Slope Fix
132+
133+
The upstream model had an [off-by-one bug in the GLA decay slope](https://huggingface.co/inclusionAI/Ling-2.6-flash/commit/7c60792051a885a3f14a75576f01f7f5cb6a08ff): `(self.layer_idx - 1)` was used instead of `self.layer_idx` in the layer-dependent decay scaling. This caused incorrect decay rates for GLA layers, with the most severe effect on layer 0 (which got a negative slope). Our llama.cpp implementation used the correct formula from the start: `layer_factor = 1.0 - il / (n_layer - 1) + 1e-5`.
134+
135+
### MTP (Multi-Token Prediction)
136+
137+
The MTP speculative decoding head works (100% draft acceptance with greedy sampling) but provides no speedup for this model. Ling-2.6 has only 1 MTP head (`nextn_predict_layers=1`), limiting speculative decoding to 1 draft per trunk verification pass. With the MTP head on CPU, the extra draft overhead exceeds any trunk pass savings. Models with multiple MTP heads (e.g. DeepSeek-V3 with 3 heads) would benefit more.
138+
139+
## Quantization Method
140+
141+
This GGUF quantization was developed entirely by AI coding agents reading the [`bailing_hybrid.py`](https://github.com/ml-explore/mlx-lm/pull/1227) implementation from [mlx-lm#1227](https://github.com/ml-explore/mlx-lm/pull/1227) and adapting it for llama.cpp compatibility.
142+
143+
Agents / LLMs used to make this run on my M2 Max:
144+
- **Claude / GLM-5.1**
145+
- **OpenCode / Kimi-K2.6**
146+
- **OpenCode / DeepSeek-V4-Pro**
147+
148+
## Credits
149+
150+
- The OG [llama.cpp](https://github.com/ggml-org/llama.cpp) making all this possible!
151+
- Original model [inclusionAI/Ling-2.6-flash](https://huggingface.co/inclusionAI/Ling-2.6-flash)
152+
- The original [`bailing_hybrid.py`](https://github.com/ml-explore/mlx-lm/pull/1227) implementation from [mlx-lm#1227](https://github.com/ml-explore/mlx-lm/pull/1227)
153+
- MLX reference implementation: [mlx-community/Ling-2.6-flash-mlx-4bit-DWQ](https://huggingface.co/mlx-community/Ling-2.6-flash-mlx-4bit-DWQ)
154+
- Custom llama.cpp fork [ljubomirj/llama.cpp @ LJ-Ling-2.6-flash-r2](https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2)
155+

0 commit comments

Comments
 (0)