Infini-AI-Lab
diff --git a/‎AI/AGENTS.md‎
Lines changed: 6 additions & 6 deletions b/‎AI/AGENTS.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎AI/developer_guides/developer_guide.md‎
Lines changed: 2 additions & 2 deletions b/‎AI/developer_guides/developer_guide.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎AI/generate_claude_folder.py‎
Lines changed: 12 additions & 12 deletions b/‎AI/generate_claude_folder.py‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎docker/Dockerfile.pd‎
Lines changed: 106 additions & 0 deletions b/‎docker/Dockerfile.pd‎
Lines changed: 106 additions & 0 deletions
diff --git a/‎examples/algo3.sh‎
Lines changed: 5 additions & 5 deletions b/‎examples/algo3.sh‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎third_party/sglang/v0.5.9/sglang/python/sglang/srt/layers/attention/attention_registry.py‎
Lines changed: 17 additions & 0 deletions b/‎third_party/sglang/v0.5.9/sglang/python/sglang/srt/layers/attention/attention_registry.py‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎third_party/sglang/v0.5.9/sglang/python/sglang/srt/model_executor/model_runner.py‎
Lines changed: 18 additions & 7 deletions b/‎third_party/sglang/v0.5.9/sglang/python/sglang/srt/model_executor/model_runner.py‎
Lines changed: 18 additions & 7 deletions
diff --git a/‎third_party/sglang/v0.5.9/sglang/python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py‎
Lines changed: 24 additions & 2 deletions b/‎third_party/sglang/v0.5.9/sglang/python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py‎
Lines changed: 24 additions & 2 deletions
diff --git a/‎third_party/sglang/v0.5.9/sglang/python/sglang/srt/server_args.py‎
Lines changed: 18 additions & 0 deletions b/‎third_party/sglang/v0.5.9/sglang/python/sglang/srt/server_args.py‎
Lines changed: 18 additions & 0 deletions
@@ -35,23 +35,23 @@ Example existing flows live at the top level (`submissions/example_*`)
 and are not under any agent tag — those are framework reference
 materials, not work product.
 
-### Environment — activate the `vortex_v04` conda env first
+### Environment — activate the `vortex_v1` conda env first
 
 Every python invocation in this contract (`check_engine_config`,
 `run_submission_aime24.py`, the pre-flight loops in §5/§5c/§5f,
-the iterate driver, etc.) expects the **`vortex_v04`** conda
+the iterate driver, etc.) expects the **`vortex_v1`** conda
 environment. Activate it once at session start, before running
 any bash snippet below:
 
 ```bash
 source "$(conda info --base)/etc/profile.d/conda.sh"
-conda activate vortex_v04
-python -c "import sys; print(sys.executable)"   # expect .../envs/vortex_v04/...
+conda activate vortex_v1
+python -c "import sys; print(sys.executable)"   # expect .../envs/vortex_v1/...
 ```
 
 If `conda activate` is unavailable in the current shell, fall
-back to `conda run -n vortex_v04 python ...` per call. Either
-way, the running interpreter must be the one inside `vortex_v04`
+back to `conda run -n vortex_v1 python ...` per call. Either
+way, the running interpreter must be the one inside `vortex_v1`
 — a system / base / wrong-env python will fail to import
 `vortex_torch`'s C extension and the framework's Triton kernels.
 
 
@@ -128,7 +128,7 @@ vortex_torch/
 │   └── compiler/               # mirror of indexer/compiler
 ├── engine/
 │   └── sgl.py                  # get_engine_from_json + check_engine_config
-└── third_party/sglang/v0.4.9/sglang/...  # patched sglang with the VTX backend
+└── third_party/sglang/v0.5.9/sglang/...  # patched sglang with the VTX backend
 ```
 
 Rule of thumb: **every op has one class and one codegen function**;
@@ -1363,7 +1363,7 @@ class, and returns it.
 
 ## 13. Runtime integration with sglang
 
-Patched sglang ships in `third_party/sglang/v0.4.9/sglang/`. Two glue files:
+Patched sglang ships in `third_party/sglang/v0.5.9/sglang/`. Two glue files:
 
 ### 13.1 `VTXGraphAttnBackend` (sglang/srt/layers/attention/vtx_graph_backend.py)
 
 
@@ -146,25 +146,25 @@
       across requests with matching prompt prefixes, corrupting
       Save/Load values. `check_engine_config` rejects the violation.
 
-    ## Environment — activate the `vortex_v04` conda env first
+    ## Environment — activate the `vortex_v1` conda env first
 
     Every python invocation in this project (`check_engine_config`,
     `run_submission_aime24.py`, the pre-flight loops in the slash
-    commands, etc.) expects the **`vortex_v04`** conda environment.
+    commands, etc.) expects the **`vortex_v1`** conda environment.
     **Activate it once at session start** before running any of the
     bash snippets below:
 
     ```bash
     source "$(conda info --base)/etc/profile.d/conda.sh"
-    conda activate vortex_v04
-    python -c "import sys; print(sys.executable)"   # expect a path under .../envs/vortex_v04/
+    conda activate vortex_v1
+    python -c "import sys; print(sys.executable)"   # expect a path under .../envs/vortex_v1/
     ```
 
     If `conda activate` isn't available in the current shell (e.g. a
     non-interactive sub-shell that didn't source the conda profile),
-    fall back to `conda run -n vortex_v04 python ...` for every
+    fall back to `conda run -n vortex_v1 python ...` for every
     python call. Either form is acceptable; what matters is that the
-    running interpreter is the one inside `vortex_v04`.
+    running interpreter is the one inside `vortex_v1`.
 
     ## Running the benchmark — policy
 
@@ -413,19 +413,19 @@
     yet exist, create it; otherwise resume into it. Confirm the tag
     with the user only if you cannot determine your model name.
 
-    ### Second action — activate the `vortex_v04` conda env
+    ### Second action — activate the `vortex_v1` conda env
 
     Every python call in this workflow must run inside the
-    **`vortex_v04`** conda env. Activate once at session start:
+    **`vortex_v1`** conda env. Activate once at session start:
 
     ```bash
     source "$(conda info --base)/etc/profile.d/conda.sh"
-    conda activate vortex_v04
-    python -c "import sys; print(sys.executable)"   # must be .../envs/vortex_v04/...
+    conda activate vortex_v1
+    python -c "import sys; print(sys.executable)"   # must be .../envs/vortex_v1/...
     ```
 
     If `conda activate` isn't usable in the current shell, prefix
-    each python invocation with `conda run -n vortex_v04` instead.
+    each python invocation with `conda run -n vortex_v1` instead.
     A wrong-env python will fail to import the framework's C
     extension and every pre-flight / benchmark call below will error.
 
@@ -1074,7 +1074,7 @@
     ```bash
     CONDA_BASE=$(conda info --base 2>/dev/null || echo /root/anaconda3)
     source "$CONDA_BASE/etc/profile.d/conda.sh"
-    conda activate vortex_v04
+    conda activate vortex_v1
     python -c "import sys; print(sys.executable)"
     ```
 
 
@@ -0,0 +1,106 @@
+# vortex_torch + sglang v0.5.9, with PD (prefill/decode) disaggregation.
+#
+# Builds the v0.5 branch (vendored sglang lives at
+# third_party/sglang/v0.5.9/sglang) and adds the RDMA/InfiniBand userspace
+# stack + the Mooncake transfer engine that sglang's disaggregation backend
+# needs to move KV cache between prefill and decode workers.
+#
+# Build:
+#   DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.pd \
+#     --build-arg VORTEX_TORCH_REF=v0.5 \
+#     --build-arg TORCH_CUDA_ARCH_LIST="9.0" \
+#     -t vortex-torch:pd-0.5.9 .
+#   # Blackwell (B200/sm100): pass TORCH_CUDA_ARCH_LIST="10.0".
+#
+# Run (needs RDMA devices + IPC/host net for the transfer engine), e.g.:
+#   docker run --gpus all --ipc=host --network=host \
+#     --device=/dev/infiniband --cap-add=IPC_LOCK \
+#     -v /raid/catalyst/models:/models -e HF_HOME=/models \
+#     -it vortex-torch:pd-0.5.9
+#   # then: bash marks/pd/run_p1d1.sh   (mooncake backend, --disaggregation-ib-device mlx5_0)
+
+# CUDA 12.9 matches sglang 0.5.9's pins (cuda-python==12.9, torch==2.9.1).
+ARG CUDA_VERSION=12.9.1
+FROM nvidia/cuda:${CUDA_VERSION}-cudnn-devel-ubuntu24.04
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Hopper=9.0, Blackwell=10.0. Override at build time as needed.
+ARG TORCH_CUDA_ARCH_LIST="9.0"
+ENV TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}"
+
+SHELL ["/bin/bash", "-c"]
+
+# --- system deps: toolchain + libnuma + RDMA/InfiniBand userspace (for Mooncake) ---
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    python3 \
+    python3-dev \
+    python3-pip \
+    python3-venv \
+    build-essential \
+    ca-certificates \
+    cmake \
+    curl \
+    git \
+    ninja-build \
+    wget \
+    libnuma1 \
+    libnuma-dev \
+    # InfiniBand / RDMA userspace — required by the Mooncake transfer engine
+    rdma-core \
+    libibverbs-dev \
+    libibverbs1 \
+    libibumad3 \
+    librdmacm1 \
+    ibverbs-providers \
+    infiniband-diags \
+    perftest \
+    && ln -sf /usr/bin/python3 /usr/bin/python \
+    && rm -rf /var/lib/apt/lists/*
+
+# Isolated venv (Ubuntu 24.04 is PEP-668 externally-managed).
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:${PATH}"
+ENV VIRTUAL_ENV="/opt/venv"
+RUN python -m pip install --upgrade pip setuptools wheel
+
+ARG VORTEX_TORCH_REF=v0.5
+ARG MOONCAKE_VERSION=0.3.9
+
+WORKDIR /workspace
+
+# The PD-disaggregation support (server-args overlap-schedule force, the
+# decode-side rebuild_aux hook, the page-major get_contiguous_buf_infos) lives
+# in the vendored sglang on this branch — make sure VORTEX_TORCH_REF is pushed.
+RUN git clone -b "${VORTEX_TORCH_REF}" --recursive \
+    https://github.com/Infini-AI-Lab/vortex_torch.git
+
+# --- sglang v0.5.9 (vendored): editable install of its python package ---
+# (v0.5.9 has no install.sh; pulls torch==2.9.1, flashinfer==0.6.3,
+#  sgl-kernel==0.3.21 as wheels.)
+WORKDIR /workspace/vortex_torch/third_party/sglang/v0.5.9/sglang
+RUN pip install --no-cache-dir -e "python"
+
+# --- vortex_torch (pure Python + Triton JIT; no compiled C extension) ---
+WORKDIR /workspace/vortex_torch
+RUN pip install --no-cache-dir -e .
+
+# --- Mooncake transfer engine (KV transport for PD disaggregation) ---
+# CUDA 12.x → pip wheel. (CUDA>=13 would need a from-source build.)
+RUN pip install --no-cache-dir "mooncake-transfer-engine==${MOONCAKE_VERSION}"
+
+# --- sanity checks ---
+RUN which python && python --version && which pip && pip --version
+RUN python - <<'PY'
+import ctypes
+for lib in ("libnuma.so.1", "libibverbs.so.1", "librdmacm.so.1"):
+    ctypes.CDLL(lib)
+    print(f"OK: {lib} loaded")
+import sglang, vortex_torch
+print("OK: import sglang", getattr(sglang, "__version__", "?"))
+print("OK: import vortex_torch")
+from mooncake.engine import TransferEngine  # mooncake transport entrypoint
+print("OK: mooncake TransferEngine importable")
+PY
+
+CMD ["/bin/bash"]
@@ -10,10 +10,10 @@ models=(
 MiniMaxAI/MiniMax-M2.7
 )
 trials=(
-32
+16
 )
 topk_val=(
-125
+61
 )
 for algo in "${sparse_algos[@]}"; do
   for model in "${models[@]}"; do
@@ -23,9 +23,9 @@ for algo in "${sparse_algos[@]}"; do
         python examples/verify_algo.py \
             --trials ${trial} \
             --topk-val ${k_val} \
-            --page-size 16 \
+            --page-size 32 \
             --workload-chunk-size 64 \
-            --block-size 16 \
+            --block-size 32 \
             --topk-ratio 0.00 \
             --vortex-module-name "${algo}" \
             --model-name  "${model}" \
@@ -37,7 +37,7 @@ for algo in "${sparse_algos[@]}"; do
             --vortex-attention-backend trtllm \
             --vortex-impl-backend triton \
             --vortex-use-tensor-core \
-            --vortex-layers-skip 0 \
+            --vortex-layers-skip \
             --summary-dir summary-MiniMax-M2.7-sglang-trtllm \
             --skip-already-finished-check
       done
 
@@ -50,6 +50,9 @@ def create_flashinfer_backend(runner):
                 runner, init_new_workspace=runner.init_new_workspace
             )
     else:
+        # MLA + vortex sparsity is wired on the trtllm_mla backend (see
+        # create_trtllm_mla_backend) so prefill dispatches through the
+        # materialized MHA path. The flashinfer MLA path stays dense-only.
         from sglang.srt.layers.attention.flashinfer_mla_backend import (
             FlashInferMLAAttnBackend,
         )
@@ -61,6 +64,13 @@ def create_flashinfer_backend(runner):
 def create_trtllm_mla_backend(runner):
     if not runner.use_mla_backend:
         raise ValueError("trtllm_mla backend can only be used with MLA models.")
+    if runner.server_args.enable_vortex_sparsity:
+        # MLA + vortex sparsity. Use trtllm_mla (not flashinfer) so the model
+        # dispatches PREFILL through the materialized MHA path (192/128) — same
+        # as the dense baseline — and DECODE through absorb (the sparse path).
+        from vortex_torch.engine.sgl.attention_backend import VortexTRTLLMMLABackend
+
+        return VortexTRTLLMMLABackend(runner)
     from sglang.srt.layers.attention.trtllm_mla_backend import TRTLLMMLABackend
 
     return TRTLLMMLABackend(runner)
@@ -108,6 +118,13 @@ def create_triton_backend(runner):
         )
 
         return DoubleSparseAttnBackend(runner)
+    elif runner.use_mla_backend and runner.server_args.enable_vortex_sparsity:
+        # MLA + vortex on the Triton decode kernel (not geometry-locked like
+        # trtllm_mla; handles GLM-4.7-Flash's qk_nope=192 / v_head=256). Prefill
+        # + skipped-layer decode delegate to the dense TritonAttnBackend.
+        from vortex_torch.engine.sgl.attention_backend import VortexTritonMLABackend
+
+        return VortexTritonMLABackend(runner)
     else:
         from sglang.srt.layers.attention.triton_backend import TritonAttnBackend
 
 
@@ -594,13 +594,24 @@ def initialize(self, min_per_gpu_memory: float):
                 self.server_args.vortex_module_name,
                 user_file=self.server_args.vortex_module_path
             )
-            self.sparse_attention.initialize(
-                block_size=self.block_size,
-                head_dim=self.model_config.head_dim,
-                kv_cache_dtype=self.kv_cache_dtype,
-                q_data_type=self.dtype,
-                intermediate_dtype=self.server_args.vortex_dtype,
-            )
+            if isinstance(self.sparse_attention, vortex_torch.flow.vFlowMLA):
+                # MLA flow: latent geometry instead of a single head_dim.
+                self.sparse_attention.initialize(
+                    block_size=self.block_size,
+                    kv_lora_rank=self.model_config.kv_lora_rank,
+                    qk_rope_head_dim=self.model_config.qk_rope_head_dim,
+                    kv_cache_dtype=self.kv_cache_dtype,
+                    q_data_type=self.dtype,
+                    intermediate_dtype=self.server_args.vortex_dtype,
+                )
+            else:
+                self.sparse_attention.initialize(
+                    block_size=self.block_size,
+                    head_dim=self.model_config.head_dim,
+                    kv_cache_dtype=self.kv_cache_dtype,
+                    q_data_type=self.dtype,
+                    intermediate_dtype=self.server_args.vortex_dtype,
+                )
 
         # Init memory pool and attention backends
         self.init_memory_pool(min_per_gpu_memory)
 
@@ -546,7 +546,12 @@ def init_memory_pool(self: ModelRunner, total_gpu_memory: int):
                 end_layer=self.end_layer,
                 index_head_dim=get_nsa_index_head_dim(self.model_config.hf_config),
             )
-        elif self.use_mla_backend and not self.mambaish_config:
+        elif (
+            self.use_mla_backend
+            and not self.mambaish_config
+            and not self.server_args.enable_vortex_sparsity
+        ):
+            # vortex+MLA falls through to the VortexMLACachePool branch below.
             assert not is_nsa_model
             if is_float4_e2m1fn_x2(self.kv_cache_dtype):
                 self.token_to_kv_pool = MLATokenToKVPoolFP4(
@@ -588,8 +593,25 @@ def init_memory_pool(self: ModelRunner, total_gpu_memory: int):
                 start_layer=self.start_layer,
                 end_layer=self.end_layer,
             )
+        elif self.server_args.enable_vortex_sparsity and self.use_mla_backend:
+            # MLA: single fused latent pool (kv_c | k_pe), no per-head K/V.
+            from vortex_torch.engine.sgl.memory_pool_mla import VortexMLACachePool
+            self.token_to_kv_pool = VortexMLACachePool(
+                self.max_total_num_tokens,
+                page_size=self.page_size,
+                dtype=self.kv_cache_dtype,
+                kv_lora_rank=self.model_config.kv_lora_rank,
+                qk_rope_head_dim=self.model_config.qk_rope_head_dim,
+                layer_num=self.num_effective_layers,
+                device=self.device,
+                enable_memory_saver=self.server_args.enable_memory_saver,
+                sparse_attention=self.sparse_attention,
+                model_runner=self,
+                start_layer=self.start_layer,
+                end_layer=self.end_layer,
+                )
         elif self.server_args.enable_vortex_sparsity:
-                    
+
             from vortex_torch.engine.sgl.memory_pool import VortexCachePool
             self.token_to_kv_pool = VortexCachePool(
                 self.max_total_num_tokens,
 
@@ -2565,6 +2565,24 @@ def _handle_load_format(self):
             )
 
     def _handle_pd_disaggregation(self):
+        # Vortex (PD disagg Option B) rebuilds its auxiliary cache on the decode
+        # side from the transferred K/V. That rebuild shares the cache-side
+        # compiled scratch with the live decode forward; the overlap scheduler
+        # would run them on different streams concurrently and corrupt it. Force
+        # overlap off on any vortex PD server — mirrors the non-PD engine helper
+        # in vortex_torch/engine/sgl/api.py, which hardcodes the same.
+        if self.enable_vortex_sparsity and self.disaggregation_mode in (
+            "prefill",
+            "decode",
+        ):
+            if not self.disable_overlap_schedule:
+                self.disable_overlap_schedule = True
+                logger.warning(
+                    "Overlap schedule is disabled for vortex sparsity under PD "
+                    "disaggregation (the decode-side aux rebuild shares cache "
+                    "scratch with the decode forward)."
+                )
+
         if self.disaggregation_mode == "decode":
             assert (
                 self.disaggregation_decode_tp is None