Replace cachegen kernels with more performant coder kernels by colinreyn · Pull Request #209 · LMCache/LMCache-Ascend

colinreyn · 2026-04-16T10:35:44Z

#143 wired up a series of cachegen kernels. They were however limited by the interface imposed by LMCache that invoked an expensive RepeatInterleave in encode. This PR replaces those kernels with more performant kernels that remove the need for the work, done in collect_bytes, that was previously imposing a bottle neck. An incisive change is injected into path of LMCache's cachegen encode/decode to properly invoke the new kernels

These kernels are in-themselves faster (~4x) and on the encode side remove other expensive operations leading to a substantial performance improvement.

Here are some representative timings showing a few key results:

Cachegen decode is comparable to a naive serde with a remote backend (model and bandwidth dependenant)
Cachegen encode is slower by 30 - 50% compared to naive
The compression ratio ranges from 3.5x - 6x depending strongly on chunk size

A minimal drop in accuracy is measured against the gsm8k benchmark although it should be noted that these results have been observed to depend on chunk size and model

Qwen3-8B - No Cache
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8817|±  |0.0089|
|     |       |strict-match    |     5|exact_match|↑  |0.8772|±  |0.0090|

Qwen3-8B - Pure cachegen cache
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8560|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8605|±  |0.0095|

Qwen3-30B-A3B - No Cache
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8423|±  |0.0100|
|     |       |strict-match    |     5|exact_match|↑  |0.8825|±  |0.0089|

Qwen3-30B-A3B - Pure cachegen cache
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8400|±  |0.0101|
|     |       |strict-match    |     5|exact_match|↑  |0.8431|±  |0.0100|

Qwen2.5-7B-Instruct - No Cache
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8196|±  |0.0106|
|     |       |strict-match    |     5|exact_match|↑  |0.7877|±  |0.0113|

Qwen2.5-7B-Instruct - Cacgen Cache
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6710|±  |0.0129|
|     |       |strict-match    |     5|exact_match|↑  |0.6399|±  |0.0132|

gemini-code-assist

Code Review

This pull request replaces the existing cachegen implementation with a new PAC (Prepare, Encode, Decode) kernel suite, stubbing out the previous kernels and introducing new C++ and Python logic. The review feedback identifies several performance improvement opportunities, such as reducing redundant tensor clones and allocations in the encoding loop, avoiding inefficient device transfers by allocating directly on the NPU, and removing blocking synchronizations to allow for better task overlap. Additionally, a typo was identified in the naming of the metadata preparation function.

gemini-code-assist · 2026-04-16T10:37:57Z

+        local_out_buf = output_buffer.clone()
+        local_output_lengths = output_lengths.clone()
+        tmp_in = encode_input[:, start:end, :].clone()
+
+        lmc_ops.pac_encode(tmp_in, meta_data, local_out_buf, local_output_lengths)
+        max_len = local_output_lengths[-1, -1]
+
+        data_chunks.append(
+            CacheGenGPUBytestream(
+                bytestream=local_out_buf.flatten()[0:max_len],
+                bytestream_lengths=local_output_lengths,
+                ntokens=end - start,
+            )
+        )


The current loop implementation is inefficient due to multiple redundant tensor clones and allocations.

local_out_buf and local_output_lengths are cloned from the full-sized buffers on every iteration.

tmp_in is cloned unnecessarily; slicing returns a view and the C++ kernel takes a const reference.

Instead, reuse the pre-allocated buffers for the kernel call and only clone the resulting slice to ensure the data is preserved for the CacheGenGPUBytestream objects.

Suggested change

local_out_buf = output_buffer.clone()

local_output_lengths = output_lengths.clone()

tmp_in = encode_input[:, start:end, :].clone()

lmc_ops.pac_encode(tmp_in, meta_data, local_out_buf, local_output_lengths)

max_len = local_output_lengths[-1, -1]

data_chunks.append(

CacheGenGPUBytestream(

bytestream=local_out_buf.flatten()[0:max_len],

bytestream_lengths=local_output_lengths,

ntokens=end - start,

)

)

tmp_in = encode_input[:, start:end, :]

output_buffer.zero_()

output_lengths.zero_()

lmc_ops.pac_encode(tmp_in, meta_data, output_buffer, output_lengths)

max_len = output_lengths[-1, -1]

data_chunks.append(

CacheGenGPUBytestream(

bytestream=output_buffer.flatten()[0:max_len].clone(),

bytestream_lengths=output_lengths.clone(),

ntokens=end - start,

)

)

Gemini is wrong here about cloning the slice for tmp_in . input_sym.sizes() will return different results depending on if the clone is present or not. The number of tokens is detected from this so removing the clone would require broader changes. As the copy is not actually a performance bottleneck I'm ignoring that part of this suggestion. (moving the other clones is done)

gemini-code-assist · 2026-04-16T10:37:57Z

+
+static constexpr uint32_t AIV_MAX = 20;
+
+void pac_perpare_enc_metadata(const at::Tensor &input_sym,


There is a typo in the function name: pac_perpare_enc_metadata should be pac_prepare_enc_metadata. This typo is also present in the header file, the pybind registration, and the Python caller.

Suggested change

void pac_perpare_enc_metadata(const at::Tensor &input_sym,

void pac_prepare_enc_metadata(const at::Tensor &input_sym,

gemini-code-assist · 2026-04-16T10:37:57Z

+  auto output_lengths_data_ptr =
+      static_cast<uint8_t *>(output_lengths.data_ptr());
+
+  auto workGM = torch::zeros({40 * 32}, input_sym.options().dtype(torch::kI32));


Allocating a workspace tensor (workGM) on every call to pac_encode introduces unnecessary overhead. Since the size is constant (1280 elements), consider using a pre-allocated buffer or a more efficient memory management strategy to avoid repeated NPU allocations during the encoding process.

Perhaps, but it looks to be around 0.5% (30us vs 6ms) of the overhead and, implemented as it is, removes any concerns around concurrency or generality. I'm inclined to keep this simple and pay the small overhead cost

As discussed, worth changing the Int8 instead of Int32.

gemini-code-assist · 2026-04-16T10:37:57Z

+        key, value = out.float()
+
+    e = decode_stream.record_event(e)
+    e.synchronize()


The use of e.synchronize() blocks the host CPU until the NPU finishes the decoding process. This prevents the caller from overlapping other tasks with the NPU work. Consider removing this synchronization and letting the caller handle it, or only synchronize when the result is actually needed.

The AI proposed fix isn't workable but revisiting the problem the sync is fixing, actually it might be addressable by adding some (seemingly) redundant copies instead. First indication is that it works and is more efficient

First indication was wrong, correctness issues come back in - in principle I agree that this isn't perfect but practically speaking it works

gemini-code-assist · 2026-04-16T10:37:57Z

+    meta_data = torch.zeros((nlayers, nchannels, n_bins), dtype=torch.int16).to(
+        device="npu"
+    )


Allocating a tensor on the CPU and then moving it to the NPU using .to(device="npu") is less efficient than direct allocation on the device. Use the device argument in the factory function.

meta_data = torch.zeros((nlayers, nchannels, n_bins), dtype=torch.int16, device="npu")

gemini-code-assist · 2026-04-16T10:37:57Z

+    output_buffer = torch.zeros((nlayers, nchannels, chunk_size), dtype=torch.uint8).to(
+        device="npu"
+    )
+    output_lengths = torch.zeros((nlayers, nchannels), dtype=torch.int32).to(
+        device="npu"
+    )


Directly allocate these tensors on the NPU to avoid the overhead of CPU allocation followed by a device transfer.

output_buffer = torch.zeros((nlayers, nchannels, chunk_size), dtype=torch.uint8, device="npu") output_lengths = torch.zeros((nlayers, nchannels), dtype=torch.int32, device="npu")

matthewygf · 2026-04-28T10:56:46Z

+              "Number of channels exceeds that supported be encode, contact "
+              "LMCache Ascend about changing this limitation");
+  TORCH_CHECK(nbins <= 32,
+              "Number of bins exceeds that supported be encode, contact."


nit: contact. -> contact

matthewygf · 2026-04-28T10:57:38Z

+      "of 32, contact LMCache Ascend about changing this limitation");
+
+  const c10::OptionalDeviceGuard device_guard(device_of(input_sym));
+  const aclrtStream stream = c10_npu::getCurrentNPUStream().stream();


nit: replace stream() -> stream(false) if do not require synchronization of the task queue.

matthewygf · 2026-04-28T10:59:35Z

+              "chunking the input."
+              "Contact LMCache Ascend about changing this limitation");
+  TORCH_CHECK(nbins <= 32,
+              "Number of bins exceeds that supported be encode, contact."


nit: contact. -> contact

matthewygf · 2026-04-28T11:05:48Z

+  auto output_lengths_data_ptr =
+      static_cast<uint8_t *>(output_lengths.data_ptr());
+
+  auto workGM = torch::zeros({40 * 32}, input_sym.options().dtype(torch::kI32));


As discussed, worth changing the Int8 instead of Int32.

matthewygf · 2026-04-28T11:15:46Z

+    ) -> List[Optional[MemoryObj]]:
+        source_bufs = old_batched_get_blocking(self, keys)
+
+        allocator = self.get_allocator_backend()


As discussed, please only target cachegen backend for now.

colinreyn requested review from chloroethylene and matthewygf as code owners April 16, 2026 10:35

gemini-code-assist Bot reviewed Apr 16, 2026

View reviewed changes

Invoke more performant coder kernel for cachegen serde

83313d6

colinreyn force-pushed the cr1/pac_coder branch from 8633b30 to 83313d6 Compare April 21, 2026 16:38

matthewygf reviewed Apr 28, 2026

View reviewed changes

matthewygf requested changes Apr 28, 2026

View reviewed changes


		static constexpr uint32_t AIV_MAX = 20;

		void pac_perpare_enc_metadata(const at::Tensor &input_sym,

	void pac_perpare_enc_metadata(const at::Tensor &input_sym,
	void pac_prepare_enc_metadata(const at::Tensor &input_sym,

Conversation

colinreyn commented Apr 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewygf Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matthewygf Apr 28, 2026 •

edited

Loading