Skip to content

Commit f2bb3cb

Browse files
committed
Merge branch 'main' of github.com:SlugLab/CXLMemSim
2 parents 004b638 + f6958fc commit f2bb3cb

17 files changed

Lines changed: 2139 additions & 32 deletions

README.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -248,7 +248,7 @@ The command protocol includes:
248248
- kernel launch,
249249
- stream and event operations,
250250
- bulk transfer commands,
251-
- cache flush, invalidate, and writeback commands,
251+
- cache flush, invalidate, writeback, and prefetch commands,
252252
- Type 2 to Type 3 peer-to-peer DMA discovery and transfer commands,
253253
- coherent shared-memory pool commands,
254254
- host/device bias commands,
@@ -707,6 +707,26 @@ For an existing CUDA Driver API program:
707707
LD_PRELOAD=./libcuda.so.1 ./your_cuda_program
708708
```
709709

710+
For `../llama.cpp-cxl`, build with ggml-cxl enabled and place only the KV cache on
711+
the CXL Type 2 device:
712+
713+
```bash
714+
cmake -S ../llama.cpp-cxl -B ../llama.cpp-cxl/build-cxl -DGGML_CXL=ON
715+
cmake --build ../llama.cpp-cxl/build-cxl -j --target llama-cli
716+
717+
../llama.cpp-cxl/build-cxl/bin/llama-cli \
718+
-m /path/to/model.gguf \
719+
--cxl-kv \
720+
--cxl-kv-device CXL0 \
721+
--cxl-kv-prefetch \
722+
-p "prompt"
723+
```
724+
725+
`--cxl-kv` sets the internal `LLAMA_CXL_KV=1` path, allocates K/V cache tensors
726+
with the ggml-cxl buffer type, initializes the selected CXL backend for KV-only
727+
offload, and issues Type 2 cache-prefetch commands for K/V attention views unless
728+
`--no-cxl-kv-prefetch` is used.
729+
710730
Build all guest-side tests:
711731

712732
```bash

lib/qemu

Submodule qemu updated from 96691a3 to c63f34b

qemu_integration/README.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,92 @@ Query counters again from the host with:
199199
python3 ./zettai_host_dcd_gfam_test.py --query
200200
```
201201

202+
### Zettai Type2 tmatmul and CXL.mem ioctl test
203+
204+
The Zettai switch CCI device (`7a74:a123`) creates a guest char device such as
205+
`/dev/zettai_cxl0d003`. The current Linux driver ABI for this device is
206+
`ioctl()`, not `io_uring_cmd`; `/tmp/zettai-qmp.sock` remains a host-side QMP
207+
socket used for bind/add/query orchestration.
208+
209+
Build the guest helper:
210+
211+
```bash
212+
gcc -O2 -Wall -Wextra -o zettai_tmatmul_ctl zettai_tmatmul_ctl.c
213+
```
214+
215+
Check whether QEMU exposed the tmatmul CSR block:
216+
217+
```bash
218+
./zettai_tmatmul_ctl --dev /dev/zettai_cxl0d003 --info
219+
```
220+
221+
If dmesg reports `tmatmul=0` or the tool prints `tmatmul_present=no`, QEMU only
222+
exposed the switch CCI BAR and tmatmul smoke runs will return `ENODEV`. CXL.mem
223+
read/write can still be tested by passing a real nonzero HPA base from a CXL
224+
region or decoder resource:
225+
226+
```bash
227+
cxl list -R -u
228+
./zettai_tmatmul_ctl --dev /dev/zettai_cxl0d003 \
229+
--mem-write --hpa-base 0xYOUR_REGION_RESOURCE --hpa-size 0x10000000 \
230+
--offset 0 --size 4096 --pattern 0x5a
231+
./zettai_tmatmul_ctl --dev /dev/zettai_cxl0d003 \
232+
--mem-read --hpa-base 0xYOUR_REGION_RESOURCE --hpa-size 0x10000000 \
233+
--offset 0 --size 64
234+
```
235+
236+
Once the QEMU Zettai device exposes a BAR large enough for the tmatmul CSR window
237+
at `BAR0 + 0x1c0000`, run:
238+
239+
```bash
240+
./zettai_tmatmul_ctl --dev /dev/zettai_cxl0d003 \
241+
--smoke --hpa-base 0xYOUR_REGION_RESOURCE --hpa-size 0x10000000
242+
```
243+
244+
### Zettai benchmark harness
245+
246+
For a repeatable host-side smoke benchmark, use:
247+
248+
```bash
249+
QEMU_NET_MODE=none \
250+
KERNEL_IMAGE=/path/to/bzImage \
251+
DISK_IMAGE=/path/to/rootfs.img \
252+
./zettai_benchmark.sh --launch --keep-qemu
253+
```
254+
255+
The harness launches QEMU with a QMP socket, binds `cxl-dcd0`, adds a 256 MiB
256+
DCD extent, queries CXLMemSim DCD/GFAM counters, and writes logs under
257+
`build/zettai-bench/`. If QEMU is already running, omit `--launch` and keep the
258+
same `ZETTAI_QMP_SOCKET` value used by `QEMU_EXTRA_ARGS`.
259+
260+
To include the in-guest DCD region setup and Type2 fabric-memory BAR benchmark,
261+
provide SSH access to the guest:
262+
263+
```bash
264+
ZETTAI_GUEST_SSH="ssh root@192.168.122.10" \
265+
ZETTAI_GUEST_DIR=/root/CXLMemSim/qemu_integration \
266+
./zettai_benchmark.sh --guest --run-type2-bench
267+
```
268+
269+
The Type2 benchmark is `guest_libcuda/cxl_bar_benchmark.c`. It discovers the
270+
`cxl-type2` endpoint (`8086:0d92`), reports BAR register and data-region
271+
latency/bandwidth, then exercises the Zettai fabric-memory controls exposed by
272+
QEMU: `DCD_GET_INFO`, optional DCD add/release when free capacity exists,
273+
`GFAM_GET_INFO`, and `MHSLD_GET_INFO/SET_HEAD`.
274+
275+
For a bounded CXL.cache command-path check, build the optional static binary
276+
and run only the prefetch section:
277+
278+
```bash
279+
make -C guest_libcuda static
280+
sudo ./guest_libcuda/cxl_bar_benchmark.static \
281+
--prefetch-only --prefetch-iters 5
282+
```
283+
284+
This mode is useful when the guest is reached through a serial shell because it
285+
avoids the full BAR bandwidth suite while still exercising read- and
286+
write-intent `CACHE_PREFETCH`.
287+
202288
## Features
203289

204290
- **Cacheline-granular access**: All memory operations are performed at 64-byte cacheline granularity
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
CC ?= gcc
2+
CFLAGS ?= -O2 -Wall -Wextra
3+
LDFLAGS ?=
4+
LDLIBS ?=
5+
6+
CUDA_SHIM := libcuda.so.1
7+
CUDA_LINK := libcuda.so
8+
9+
CUDA_TESTS := cuda_test cuda_advanced_test
10+
BAR_TESTS := cxl_bar_benchmark
11+
BAR_STATIC := cxl_bar_benchmark.static
12+
TARGETS := $(CUDA_SHIM) $(CUDA_LINK) $(CUDA_TESTS) $(BAR_TESTS)
13+
14+
.PHONY: all clean static test-binaries
15+
16+
all: $(TARGETS)
17+
18+
test-binaries: $(CUDA_TESTS) $(BAR_TESTS)
19+
20+
$(CUDA_SHIM): libcuda.c cxl_gpu_cmd.h
21+
$(CC) $(CFLAGS) -shared -fPIC -o $@ libcuda.c -ldl $(LDFLAGS) $(LDLIBS)
22+
23+
$(CUDA_LINK): $(CUDA_SHIM)
24+
ln -sf $(CUDA_SHIM) $@
25+
26+
cuda_test: cuda_test.c $(CUDA_LINK)
27+
$(CC) $(CFLAGS) -o $@ cuda_test.c -L. -lcuda -ldl -lrt -Wl,-rpath,. $(LDFLAGS) $(LDLIBS)
28+
29+
cuda_advanced_test: cuda_advanced_test.c $(CUDA_LINK)
30+
$(CC) $(CFLAGS) -o $@ cuda_advanced_test.c -L. -lcuda -ldl -lrt -Wl,-rpath,. $(LDFLAGS) $(LDLIBS)
31+
32+
cxl_bar_benchmark: cxl_bar_benchmark.c cxl_gpu_cmd.h
33+
$(CC) $(CFLAGS) -o $@ cxl_bar_benchmark.c -lrt -lpthread $(LDFLAGS) $(LDLIBS)
34+
35+
static: $(BAR_STATIC)
36+
37+
$(BAR_STATIC): cxl_bar_benchmark.c cxl_gpu_cmd.h
38+
$(CC) $(CFLAGS) -static -o $@ cxl_bar_benchmark.c -lrt -lpthread $(LDFLAGS) $(LDLIBS)
39+
40+
clean:
41+
rm -f $(TARGETS) $(BAR_STATIC)

0 commit comments

Comments
 (0)