Skip to content

Commit f270185

Browse files
committed
squash for rebase
Signed-off-by: Ceng23333 <441651826@qq.com>
1 parent 5e61074 commit f270185

69 files changed

Lines changed: 8762 additions & 326 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,4 +30,7 @@ __pycache__/
3030

3131
*.http
3232

33-
*.nsys-rep
33+
**/*.nsys-rep
34+
**/*.jsonl
35+
*.jsonl
36+
**/*.mem

MINICPM_SALA_BUILD_AND_CHANGES.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# MiniCPM-SALA on InfiniLM: Build Guide and Change Summary
2+
3+
This document describes the changes in **InfiniCore** and **InfiniLM** from their baseline commits to support MiniCPM-SALA with InfLLM-v2, the **prerequisites**, and a **step-by-step build and run guide**. With these changes, `InfiniLM/examples/jiuge.py` produces **reasonable MiniCPM-SALA generation output** when run with the correct environment.
4+
5+
**Baseline commits (for reference):**
6+
7+
- **InfiniLM:** `main`
8+
- **InfiniCore:** `5fc85c8b1e6728839993f1b743a525a066da585f`
9+
10+
To see the exact diff from baseline:
11+
`git diff 5fc85c8b1e6728839993f1b743a525a066da585f -- InfiniCore` and
12+
`git diff main -- InfiniLM`.
13+
14+
---
15+
16+
## 1. Changes in InfiniCore (from `5fc85c8b1e6728839993f1b743a525a066da585f`)
17+
18+
InfiniCore was extended to **wire InfLLM-v2** (Stage-2 sparse attention) so that when built with `--infllmv2=y`, the C++ API calls `mha_varlen_fwd` and `mha_fwd_kvcache` from the infllmv2_cuda_impl .so.
19+
20+
### 1.1 New or modified files (summary)
21+
22+
| Area | Path | Purpose |
23+
|------|------|--------|
24+
| API (decl) | `include/infinicore/ops/infllmv2_api.hpp` | Declares `mha_varlen_fwd`, `mha_fwd_kvcache` (must be provided by infllmv2 .so at link/runtime). |
25+
| API (decl) | `include/infinicore/ops/infllmv2_attention.hpp` | Public op header for infllmv2 attention. |
26+
| Ops impl | `src/infinicore/ops/infllmv2_attention/infllmv2_attention.cc` | Implements `infllmv2_varlen` and `infllmv2_kvcache` by calling the above APIs when `ENABLE_INFLLMV2` and `ENABLE_ATEN` are set. |
27+
| Pybind | `src/infinicore/pybind11/ops/infllmv2_attention.hpp` | Exposes infllmv2 ops to Python. |
28+
| Pybind | `src/infinicore/pybind11/ops.hpp` | Includes infllmv2 op bindings. |
29+
| Python | `python/infinicore/ops/infllmv2_attention.py` | Python wrapper for `infllmv2_varlen` / `infllmv2_kvcache`. |
30+
| Python | `python/infinicore/__init__.py` | Exports `infllmv2_varlen`, `infllmv2_kvcache`. |
31+
| Build | `xmake.lua` | New option `--infllmv2=y`; when set with `--aten=y`, defines `ENABLE_INFLLMV2` and links/rpath to the auto-detected .so. |
32+
| Test | `test/infinicore/ops/test_infllmv2_attention.py` | Unit tests for infllmv2 varlen/kvcache (skipped if not built or no CUDA). |
33+
| Example | `examples/infllmv2_sanity.py` | Sanity script for InfLLM-v2 (skips if .so absent or no CUDA). |
34+
35+
### 1.2 Build option
36+
37+
- **Option:** `infllmv2` (enable InfLLM-v2; xmake auto-detects `infllm_v2/*.so` under `InfiniCore/third_party/infllmv2_cuda_impl/build/...`).
38+
- **Requires:** `aten=y` (InfiniCore must be built with PyTorch/ATen).
39+
- **Effect:** Defines `ENABLE_INFLLMV2`, adds link and rpath to the auto-detected infllmv2 .so. At runtime, `libinfinicore_cpp_api.so` resolves `mha_varlen_fwd` / `mha_fwd_kvcache` from that .so (via `LD_LIBRARY_PATH` or `LD_PRELOAD`).
40+
41+
---
42+
43+
## 2. Changes in InfiniLM (from `main`)
44+
45+
InfiniLM was extended to support the **MiniCPM-SALA** model (embedding, layers, attention, MLP, LM head) and to use InfiniCore (including InfLLM-v2 when available) for inference.
46+
47+
### 2.1 New or modified files (summary)
48+
49+
| Area | Path | Purpose |
50+
|------|------|--------|
51+
| C++ model | `csrc/models/minicpm_sala/*.cpp`, `*.hpp` | MiniCPM-SALA model: `minicpm_sala_attention`, `minicpm_sala_decoder_layer`, `minicpm_sala_model`, `minicpm_sala_for_causal_lm`, `minicpm_sala_mlp`. Per-layer dense KV cache; lightning (GLA) and optional InfLLM-v2 (minicpm4) attention paths. |
52+
| C++ factory | `csrc/models/model_factory.cpp` | Registers MiniCPM-SALA model type. |
53+
| Config | `python/infinilm/auto_config.py` | MiniCPM-SALA config handling. |
54+
| Weights | `python/infinilm/modeling_utils.py` | MiniCPM-SALA weight loading (MuP scaling, etc.). |
55+
| Examples | `examples/jiuge.py` | Generic InferEngine generation script; docstring updated with env (PYTHONPATH, LD_LIBRARY_PATH, LD_PRELOAD) for MiniCPM-SALA. |
56+
| Examples | `examples/minicpm_sala_logits_sanity.py` | HF vs InfiniLM logits sanity (prefill/decode1/decodeN); single-token decode for correct KV cache; one-prompt output comparison. |
57+
| Examples | `examples/modeling_minicpm_sala.py` | HF-side MiniCPM-SALA modeling (reference). |
58+
| Docs | `MiniCPM_SALA_alignment_progress.md` | Alignment and debugging notes. |
59+
60+
### 2.2 Behaviour notes
61+
62+
- **Attention:** Layer 0 (minicpm4) can use compiled InfLLM-v2 when InfiniCore is built with `--infllmv2=y` and the .so is preloaded; other layers use lightning (GLA) path.
63+
- **Attention overhead optimizations:** In `minicpm_sala_attention.cpp`: (1) sequence lengths are read in one place when both `past_sequence_lengths` and `total_sequence_lengths` are present (`has_cache_meta`), avoiding duplicate logic; (2) Q/K/V use a single `contiguous()->view` chain after projections; (3) lightning path builds `q_bthd` via one `permute->contiguous` from `q_perm`; (4) sparse path uses `q_perm` directly (already contiguous) and only calls `contiguous()` on K/V when repeating heads. Semantics and logits are unchanged.
64+
- **KV cache:** Decode must use **single-token input** per step; passing the full sequence each step would misalign the per-layer KV cache (see sanity script).
65+
- **Engine / KV cache config:** MiniCPM-SALA uses per-layer dense KV cache in C++; the engine’s `cache_config` is used only for scheduling (e.g. `past_sequence_lengths` / `total_sequence_lengths`). **Static cache** is recommended (default in `jiuge.py` when not passing `--enable-paged-attn`). For static, `jiuge.py` sets `max_cache_len = max(initial_capacity, max_position_embeddings)` when `model_type == "minicpm_sala"` so long contexts are supported without re-alloc.
66+
67+
---
68+
69+
## 3. Prerequisites
70+
71+
### 3.1 System and toolchain
72+
73+
- **OS:** Linux.
74+
- **Python:** 3.12 recommended (match the infllmv2 .so and InfiniCore pybind ABI).
75+
- **CUDA:** 11.6+ (e.g. 12.x); `nvcc` in `PATH` (e.g. via `CUDA_HOME=/usr/local/cuda` and `PATH=$CUDA_HOME/bin:$PATH`).
76+
- **C++:** GCC (e.g. `CC=gcc CXX=g++`) for infllmv2_cuda_impl and InfiniCore.
77+
- **xmake:** For building InfiniCore (install from https://xmake.io or use a project-provided path).
78+
- **PyTorch:** Installed in the same Python env used to build infllmv2 and to run InfiniLM (InfiniCore with `aten=y` links against this PyTorch’s libs).
79+
80+
### 3.2 Python environment
81+
82+
Use a **single venv** (or env) that has:
83+
84+
- `torch`
85+
- `transformers`
86+
- `triton` (e.g. 3.2.0; for MiniCPM-SALA HF path; if CUDA 12.8, a small patch may be needed for Triton’s `ptx_get_version` or use a Triton version that supports 12.8)
87+
- `flash-linear-attention` (or HF deps for MiniCPM-SALA)
88+
- Other InfiniLM/InfiniCore runtime deps
89+
90+
Build **infllmv2_cuda_impl** and **InfiniCore** with this same Python (and thus same PyTorch ABI).
91+
92+
### 3.3 Repo layout
93+
94+
- **minicpm-sala-support** (repo root) contains:
95+
- **InfiniCore/** — InfiniCore with InfLLM-v2 wiring.
96+
- **InfiniLM/** — InfiniLM with MiniCPM-SALA.
97+
- **InfiniCore/third_party/infllmv2_cuda_impl/** — InfLLM-v2 CUDA kernel implementation (provides `mha_varlen_fwd`, `mha_fwd_kvcache`).
98+
99+
---
100+
101+
## 4. Build Guide
102+
103+
### 4.1 Build InfLLM-v2 (infllmv2_cuda_impl)
104+
105+
This produces the `.so` that provides `mha_varlen_fwd` and `mha_fwd_kvcache`. InfiniCore must be built with a PyTorch/ABI-compatible env (same Python/torch as here).
106+
107+
1. **From repo root:**
108+
```bash
109+
cd InfiniCore/third_party/infllmv2_cuda_impl
110+
```
111+
2. **Submodules:**
112+
```bash
113+
git submodule update --init --recursive
114+
```
115+
3. **Env (recommended):**
116+
```bash
117+
export CC=gcc CXX=g++
118+
export CUDA_HOME=/usr/local/cuda # or your CUDA path
119+
export PATH=$CUDA_HOME/bin:$PATH
120+
```
121+
4. **Build/install** (use the Python that has torch and that you will use for InfiniLM):
122+
```bash
123+
python setup.py install
124+
```
125+
Or: `pip install -e .`
126+
5. **Locate the .so:**
127+
Typically under `build/lib.linux-x86_64-cpython-312/infllm_v2/` (name like `C.cpython-312-x86_64-linux-gnu.so`). Set:
128+
```bash
129+
INFLLMV2_SO_DIR="<repo>/InfiniCore/third_party/infllmv2_cuda_impl/build/lib.linux-x86_64-cpython-312/infllm_v2"
130+
```
131+
132+
### 4.2 Build InfiniCore (with InfLLM-v2)
133+
134+
InfiniCore must be built with **aten** and, for MiniCPM-SALA with InfLLM-v2, with **infllmv2=y** enabled (xmake auto-detects the .so).
135+
136+
1. **Install Infini dependencies** (if not already):
137+
Build and install Infini libs so they are under `$INFINI_ROOT` (default `~/.infini`). InfiniCore’s xmake expects `include/` and `lib/` there (e.g. `libinfinicore_cpp_api.so`, `libinfiniop.so`, etc.).
138+
139+
2. **From repo root:**
140+
```bash
141+
cd InfiniCore
142+
```
143+
3. **Configure** (use the same Python/torch as infllmv2):
144+
```bash
145+
xmake config -y --root --nv-gpu=y --aten=y --infllmv2=y
146+
```
147+
Omit `--infllmv2=y` for a build without InfLLM-v2 (then no MiniCPM-SALA layer0 infllmv2 path).
148+
4. **Build the Python extension:**
149+
```bash
150+
xmake --root _infinicore
151+
```
152+
5. **Optional – install to ~/.infini:**
153+
```bash
154+
xmake install
155+
```
156+
The Python loadable is also copied under `InfiniCore/python/infinicore/lib/` by the build.
157+
158+
### 4.3 Run jiuge.py (MiniCPM-SALA)
159+
160+
Use the **same venv** that has `torch`, `transformers`, etc., and set env so InfiniCore and the infllmv2 .so are found and symbols resolve.
161+
162+
**Required:**
163+
164+
- `PYTHONPATH`: InfiniLM and InfiniCore Python packages.
165+
- `LD_LIBRARY_PATH`: Torch lib, Infini lib (`/root/.infini/lib` or your `INFINI_ROOT/lib`), and optionally `INFLLMV2_SO_DIR` (if not using `LD_PRELOAD`).
166+
- If InfiniCore was built with InfLLM-v2: **`LD_PRELOAD`** the infllmv2 .so so `libinfinicore_cpp_api.so` resolves `mha_varlen_fwd` (and `mha_fwd_kvcache`).
167+
168+
**Example (from repo root):**
169+
170+
```bash
171+
INFLLMV2_SO_DIR="$(pwd)/InfiniCore/third_party/infllmv2_cuda_impl/build/lib.linux-x86_64-cpython-312/infllm_v2"
172+
173+
PYTHONPATH="$(pwd)/InfiniLM/python:$(pwd)/InfiniCore/python:$PYTHONPATH" \
174+
LD_LIBRARY_PATH="$(python -c 'import torch; print(torch.__path__[0])')/lib:/root/.infini/lib:${INFLLMV2_SO_DIR}:$LD_LIBRARY_PATH" \
175+
LD_PRELOAD="${INFLLMV2_SO_DIR}/C.cpython-312-x86_64-linux-gnu.so" \
176+
python InfiniLM/examples/jiuge.py --nvidia --model_path /root/.cache/modelscope/hub/models/OpenBMB/MiniCPM-SALA
177+
```
178+
179+
Use the **venv** Python explicitly if needed, e.g.:
180+
181+
```bash
182+
/path/to/venv/bin/python InfiniLM/examples/jiuge.py ...
183+
```
184+
185+
For Triton (HF path) on CUDA 12.8 you may need:
186+
187+
```bash
188+
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
189+
```
190+
191+
---
192+
193+
## 5. Verification
194+
195+
- **InfiniCore InfLLM-v2 ops:**
196+
`PYTHONPATH=InfiniCore/python:InfiniCore/test/infinicore LD_LIBRARY_PATH=<torch_lib>:${INFLLMV2_SO_DIR}:/root/.infini/lib LD_PRELOAD=${INFLLMV2_SO_DIR}/C.cpython-312-x86_64-linux-gnu.so python InfiniCore/test/infinicore/ops/test_infllmv2_attention.py --nvidia`
197+
198+
- **HF vs InfiniLM logits (one-prompt decode):**
199+
Same env + `LD_PRELOAD` and (if needed) `TRITON_PTXAS_PATH`:
200+
`python InfiniLM/examples/minicpm_sala_logits_sanity.py --model_path <path> --mode decodeN --decode_steps 64`
201+
202+
- **Generation:**
203+
`jiuge.py` with the same env should produce **reasonable MiniCPM-SALA output** (e.g. for prompt "How are you").
204+
205+
---
206+
207+
## 6. Related docs
208+
209+
- **CURRENT_PROGRESS.md** — Local progress, InfLLM-v2 plan, and run commands.
210+
- **InfiniLM/MiniCPM_SALA_alignment_progress.md** — Alignment and debugging details.
211+
- **InfiniCore/third_party/infllmv2_cuda_impl/README.md** — InfLLM-v2 kernel design and install.
212+
- **InfiniLM/examples/jiuge.py** — Docstring at top with env summary.
213+
214+
---
215+
216+
## 7. TODO
217+
218+
- **Remove temporal log and dump code** — Strip or gate debug logging, `INFINI_DEBUG_*`, and temporary dump paths (e.g. `/tmp/` tensor dumps, `dump_tensor_to_bin_if_enabled`, `log_tensor_stats_if_enabled`) from InfiniLM/InfiniCore once alignment and bring-up are stable.
219+
- **Adapt inference_server.py** — Wire MiniCPM-SALA (and InfiniLM InferEngine) into the inference server (e.g. `inference_server.py` or equivalent in the workspace) so that the server can load and serve MiniCPM-SALA with the same env (PYTHONPATH, LD_LIBRARY_PATH, LD_PRELOAD) and run generation endpoints.
220+
221+
### 7.1 Debug and sanity env and code (for future erasing)
222+
223+
When removing temporal log and dump code, use this as the reference for **env parsing** and **locations to erase or gate**.
224+
225+
**Environment variables (debug / sanity):**
226+
227+
| Env var | Parsing / behavior | Purpose |
228+
|---------|---------------------|--------|
229+
| `INFINI_DEBUG_LOG` | Set to a file path (e.g. `/tmp/minicpm_sala_sanity_debug.log`). When set, C++ and Python append JSON/text lines to this file. | Text log for alignment debugging. |
230+
| `INFINI_DEBUG_ATTN_DUMP` | Presence = enable (e.g. `"1"` or any). When set, tensors are written to fixed `/tmp/` paths below. | Enable binary tensor dumps and per-layer stats. |
231+
232+
**Where they are read:**
233+
234+
- **InfiniLM C++:** `std::getenv("INFINI_DEBUG_LOG")`, `std::getenv("INFINI_DEBUG_ATTN_DUMP")` in:
235+
- `InfiniLM/csrc/models/minicpm_sala/minicpm_sala_attention.cpp` (dump_tensor_f32, layer q/k/v/g_gamma and attn out dumps)
236+
- `InfiniLM/csrc/models/minicpm_sala/minicpm_sala_decoder_layer.cpp` (log_tensor_stats_if_enabled, tensor_to_f32_and_dump, layer input/out dumps)
237+
- `InfiniLM/csrc/models/minicpm_sala/minicpm_sala_model.cpp` (dump_tensor_to_bin_if_enabled, log_tensor_stats_if_enabled; embed and final hidden dumps)
238+
- **InfiniLM Python (sanity script):** `os.environ["INFINI_DEBUG_LOG"]`, `os.environ["INFINI_DEBUG_ATTN_DUMP"]` set in `InfiniLM/examples/minicpm_sala_logits_sanity.py` before runs; `os.getenv("INFINI_DEBUG_*")` in `InfiniLM/examples/modeling_minicpm_sala.py` (HF-side hooks that write `/tmp/hf_*.pt` and log to `INFINI_DEBUG_LOG`).
239+
240+
**Temporary paths to remove or stop writing:**
241+
242+
- **C++ dumps (binary):** `/tmp/inf_embed_out.bin`, `/tmp/inf_final_hidden.bin`, `/tmp/inf_layer0_q.bin`, `/tmp/inf_layer0_k.bin`, `/tmp/inf_layer0_v.bin`, `/tmp/inf_layer0_g_gamma.bin`, `/tmp/inf_layer1_q.bin`, `/tmp/inf_layer1_k.bin`, `/tmp/inf_layer1_v.bin`, `/tmp/inf_layer1_g_gamma.bin`, `/tmp/inf_layer0_attn_input.bin`, `/tmp/inf_attn_out_layer0.bin`, `/tmp/inf_attn_out_layer1.bin`, `/tmp/inf_layer_out_<N>.bin`.
243+
- **Python (sanity) writes:** `DEBUG_LOG_PATH` (e.g. `/tmp/minicpm_sala_sanity_debug.log`); `/tmp/hf_embed_out.pt`, `/tmp/hf_final_hidden.pt`, `/tmp/hf_layer0_attn_input.pt`, `/tmp/hf_layer_out_<idx>.pt`, `/tmp/hf_layer0_q.pt`, `/tmp/hf_layer0_k.pt`, `/tmp/hf_layer0_v.pt`, `/tmp/hf_attn_out_layer0.pt`, `/tmp/hf_layer1_q.pt`, `/tmp/hf_layer1_k.pt`, `/tmp/hf_layer1_v.pt`, `/tmp/hf_attn_out_layer1.pt`.
244+
- **Helpers to remove or gate:** `dump_tensor_f32`, `dump_tensor_to_bin_if_enabled`, `log_tensor_stats_if_enabled`, `tensor_to_f32_and_dump`; sanity script’s `_append_debug_log`, and all `torch.save(..., "/tmp/...")` / `np.fromfile("/tmp/...")` / `os.path.isfile("/tmp/...")` blocks that exist only for alignment comparison.

0 commit comments

Comments
 (0)