Skip to content

Add Sync Loading Support to AscendP2PBackend#236

Draft
matthewygf wants to merge 11 commits into
LMCache:mainfrom
matthewygf:p2p_sync
Draft

Add Sync Loading Support to AscendP2PBackend#236
matthewygf wants to merge 11 commits into
LMCache:mainfrom
matthewygf:p2p_sync

Conversation

@matthewygf
Copy link
Copy Markdown
Collaborator

@matthewygf matthewygf commented May 15, 2026

Summary

This PR addresses #229 and does the following:

  1. Added Sync Loading Support For AscendP2PBackend.
  2. Update ProxyMemObj ref_count_down behavior to prevent leaks when not consumed.

Why

  1. We have observed that when async_loading is turned on, vLLM-Ascend extra scheduling steps can increase latency for requests.
  2. We have also observed that given we use RDMA for transfer, the prefetch has no benefits, rather coupled with the extra scheduling steps lead to further latency
  3. Similar to other backend such as Mooncake where the lookup + contains can be used within sync loading, we extend p2p to allow such usages.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces synchronous P2P lookup capabilities and enhances the robustness of asynchronous operations within the Ascend P2P backend. Key implementations include a dedicated ZMQ DEALER for synchronous queries with a TTL-based lookup cache, and a patch for the cache controller worker to ensure ZMQ operations are marshaled to the correct event loop. The changes also improve proxy memory management by ensuring resources are released if a proxy is discarded before consumption, and add support for blocking operations like batched_contains and batched_get_blocking. I have no feedback to provide as there were no review comments.

@matthewygf
Copy link
Copy Markdown
Collaborator Author

# Ran 1: Qwen3-30B-A3B
===
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8605|±  |0.0095|
|     |       |strict-match    |     5|exact_match|↑  |0.8340|±  |0.0102|

# Ran 2
===
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8590|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.8378|±  |0.0102|

@matthewygf
Copy link
Copy Markdown
Collaborator Author

# Sync Loading P2P
# Qwen3-30B-A3B DP2 TP4 EP8
============ Serving Benchmark Result ============
Successful requests:                     30        
Failed requests:                         0         
Request rate configured (RPS):           1.00      
Benchmark duration (s):                  33.18     
Total input tokens:                      676140    
Total generated tokens:                  3000      
Request throughput (req/s):              0.90      
Output token throughput (tok/s):         90.41     
Peak output token throughput (tok/s):    459.00    
Peak concurrent requests:                21.00     
Total token throughput (tok/s):          20466.72  
---------------Time to First Token----------------
Mean TTFT (ms):                          2397.34   
Median TTFT (ms):                        2352.54   
P99 TTFT (ms):                           5544.75   
-----Time per Output Token (excl. 1st token)------       
Mean TPOT (ms):                          76.85                               
Median TPOT (ms):                        65.81                                                   
P99 TPOT (ms):                           155.93    
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.85     
Median ITL (ms):                         36.98     
P99 ITL (ms):                            831.06    
==================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant