[Bugfix][MoE] Fix hardcoded SharedExperts output buffer size for DBO ubatches by Gregory-Pereira · Pull Request #39033 · vllm-project/vllm

Gregory-Pereira · 2026-04-05T17:12:49Z

Summary

The SharedExperts class hardcodes its output buffer to [None, None] regardless of the actual number of ubatches configured. This mirrors the same bug that was fixed in WorkspaceManager in #38853 (cc @yewentao256 and @LucasWilkinson based on similar fix)

Test plan

No new tests — this is a one-line buffer sizing fix matching an already-landed pattern ([Bug] Fix workspace manager _current_workspaces size #38853)
I will however be testing the default non-DBO path to ensure its not affected

test_moe-smoke-505passed.log
test_moe-full-3537passed.log

These were tested against a tiny qwen3-0.6B no DBO. If we need I can test DBO path too

…ubatches Signed-off-by: greg pereira <grpereir@redhat.com>

…a boolean Signed-off-by: greg pereira <grpereir@redhat.com>

gemini-code-assist

Code Review

This pull request replaces the enable_dbo boolean flag with a num_ubatches integer across the Fused MoE layers and runners to support dynamic micro-batching and remove hardcoded buffer sizes. While SharedExperts was updated to use this new parameter for buffer allocation, the DefaultMoERunner currently receives the parameter without storing it or updating its internal indexing and buffer logic, which remains hardcoded or reliant on the old flag.

gemini-code-assist · 2026-04-05T17:14:41Z

+        num_ubatches: int = 1,
    ):
        super().__init__()
        self.moe_config = moe_config


The num_ubatches parameter is introduced here but not stored in the instance. This makes the fix incomplete because DefaultMoERunner itself manages internal buffers in _maybe_init_dp_chunking (lines 270-271) and indexing logic in _slice_and_copy_input (lines 671-674) that are still hardcoded to size 2 or rely solely on enable_dbo.

To fully address the hardcoding issue (mirroring the fix in SharedExperts), please store num_ubatches and update the aforementioned methods to use it for buffer allocation and indexing when num_ubatches > 1.

Suggested change

num_ubatches: int = 1,

):

super().__init__()

self.moe_config = moe_config

num_ubatches: int = 1,

):

super().__init__()

self.moe_config = moe_config

self.num_ubatches = num_ubatches

I had previously intended to scope this change just to shared experts / DBO but ill do it for defaultMoERunner too

Signed-off-by: greg pereira <grpereir@redhat.com>

Gregory-Pereira · 2026-04-05T18:52:40Z

logs:

=== Ready ===
vLLM 0.19.1rc1.dev41+gc5b071d48 from /tmp/vllm-fix/vllm/__init__.py

=== Running tests: tests/kernels/moe/test_moe.py -k "test_mixtral or test_qwen" ===
============================= test session starts ==============================
platform linux -- Python 3.12.13, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /tmp/vllm-fix
configfile: pyproject.toml
plugins: typeguard-4.5.1, hypothesis-6.151.11, timeout-2.4.0, shard-0.1.2, rerunfailures-16.1, mock-3.15.1, forked-1.6.0, asyncio-1.3.0, hydra-core-1.3.2, buildkite-test-collector-0.1.9, cov-7.1.0, schemathesis-4.14.3, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 3553 items / 3551 deselected / 2 selected
Running 2 items in this shard: tests/kernels/moe/test_moe.py::test_mixtral_moe[False-True-dtype0], tests/kernels/moe/test_moe.py::test_mixtral_moe[False-False-dtype0]

tests/kernels/moe/test_moe.py::test_mixtral_moe[False-True-dtype0] PASSED [ 50%]
tests/kernels/moe/test_moe.py::test_mixtral_moe[False-False-dtype0] PASSED [100%]

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============== 2 passed, 3551 deselected, 16 warnings in 18.83s ===============

=== Tests PASSED ===
=== Starting vLLM server: Qwen/Qwen2.5-VL-3B-Instruct ===
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev41+gc5b071d48
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-VL-3B-Instruct
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:233] non-default args: {'model': 'Qwen/Qwen2.5-VL-3B-Instruct'}
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_SERVICE_HOST
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_FORK_URL
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT_8000_TCP_ADDR
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_PATH
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_SERVE_MODEL
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT_8000_TCP
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_SERVE_ARGS
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_SERVICE_PORT
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_BRANCH
(APIServer pid=1) INFO 04-05 18:41:07 [model.py:554] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(APIServer pid=1) INFO 04-05 18:41:07 [model.py:1684] Using max model len 128000
(APIServer pid=1) INFO 04-05 18:41:07 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-05 18:41:07 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 04-05 18:41:07 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore pid=1894) INFO 04-05 18:41:17 [core.py:105] Initializing a V1 LLM engine (v0.19.1rc1.dev41+gc5b071d48) with config: model='Qwen/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=1894) INFO 04-05 18:41:19 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.2.150:39689 backend=nccl
(EngineCore pid=1894) INFO 04-05 18:41:19 [parallel_state.py:1712] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=1894) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore pid=1894) INFO 04-05 18:41:22 [gpu_model_runner.py:4735] Starting to load model Qwen/Qwen2.5-VL-3B-Instruct...
(EngineCore pid=1894) INFO 04-05 18:41:23 [cuda.py:418] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=1894) INFO 04-05 18:41:23 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=1894) INFO 04-05 18:41:23 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=1894) INFO 04-05 18:41:23 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=1894) INFO 04-05 18:41:23 [cuda.py:362] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=1894) INFO 04-05 18:41:23 [flash_attn.py:622] Using FlashAttention version 3
(EngineCore pid=1894) INFO 04-05 18:41:32 [weight_utils.py:615] Time spent downloading weights for Qwen/Qwen2.5-VL-3B-Instruct: 8.600505 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.99it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.11it/s]
(EngineCore pid=1894)
(EngineCore pid=1894) INFO 04-05 18:41:33 [default_loader.py:384] Loading weights took 0.99 seconds
(EngineCore pid=1894) INFO 04-05 18:41:33 [gpu_model_runner.py:4820] Model loading took 7.16 GiB memory and 10.353895 seconds
(EngineCore pid=1894) INFO 04-05 18:41:34 [gpu_model_runner.py:5760] Encoder cache will be initialized with a budget of 114688 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=1894) WARNING 04-05 18:41:35 [op.py:236] Priority not set for op rms_norm, using native implementation.
(EngineCore pid=1894) INFO 04-05 18:41:45 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/65eaa184d7/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=1894) INFO 04-05 18:41:45 [backends.py:1111] Dynamo bytecode transform time: 4.50 s
(EngineCore pid=1894) INFO 04-05 18:41:47 [backends.py:372] Cache the graph of compile range (1, 8192) for later use
(EngineCore pid=1894) INFO 04-05 18:41:51 [backends.py:390] Compiling a graph for compile range (1, 8192) takes 6.06 s
(EngineCore pid=1894) INFO 04-05 18:41:52 [decorators.py:655] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/5e4e91cd0c1134ac47b1d40baef611a5e6acb22a6a561fce88340b9f3bb46a01/rank_0_0/model
(EngineCore pid=1894) INFO 04-05 18:41:52 [monitor.py:48] torch.compile took 11.85 s in total
(EngineCore pid=1894) INFO 04-05 18:41:52 [monitor.py:76] Initial profiling/warmup run took 0.11 s
(EngineCore pid=1894) INFO 04-05 18:41:57 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=1894) INFO 04-05 18:41:57 [gpu_model_runner.py:5883] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(EngineCore pid=1894) INFO 04-05 18:41:58 [gpu_model_runner.py:5962] Estimated CUDA graph memory: 0.37 GiB total
(EngineCore pid=1894) INFO 04-05 18:41:59 [gpu_worker.py:436] Available KV cache memory: 103.01 GiB
(EngineCore pid=1894) INFO 04-05 18:41:59 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9027 to maintain the same effective KV cache size.
(EngineCore pid=1894) INFO 04-05 18:41:59 [kv_cache_utils.py:1319] GPU KV cache size: 3,000,464 tokens
(EngineCore pid=1894) INFO 04-05 18:41:59 [kv_cache_utils.py:1324] Maximum concurrency for 128,000 tokens per request: 23.44x
(EngineCore pid=1894) 2026-04-05 18:41:59,170 - INFO - autotuner.py:446 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=1894) 2026-04-05 18:41:59,178 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:01<00:00, 28.09it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 51/51 [00:00<00:00, 54.24it/s]
(EngineCore pid=1894) INFO 04-05 18:42:02 [gpu_model_runner.py:6053] Graph capturing finished in 3 secs, took 0.54 GiB
(EngineCore pid=1894) INFO 04-05 18:42:02 [gpu_worker.py:597] CUDA graph pool memory: 0.54 GiB (actual), 0.37 GiB (estimated), difference: 0.17 GiB (31.4%).
(EngineCore pid=1894) INFO 04-05 18:42:02 [core.py:283] init engine (profile, create kv cache, warmup model) took 28.71 seconds
(EngineCore pid=1894) INFO 04-05 18:42:03 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) INFO 04-05 18:42:03 [api_server.py:604] Supported tasks: ['generate']
(APIServer pid=1) WARNING 04-05 18:42:03 [model.py:1441] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 1e-06}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 04-05 18:42:04 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 04-05 18:42:07 [base.py:245] Multi-modal warmup completed in 3.063s
(APIServer pid=1) INFO 04-05 18:42:07 [api_server.py:608] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

robertgshaw2-redhat · 2026-04-06T13:28:09Z

this feature will only be trigger with MoE models --- Qwen/Qwen2.5-VL-3B-Instruct and qwen3-0.6B wont trigger it

Did you run into a concrete issue? I dont think we support anything besides 2 ubatches

mergify · 2026-04-06T17:44:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Gregory-Pereira.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-05-13T22:11:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Gregory-Pereira.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

yewentao256

Is this issue still in main? Please solve the conflicts and take another look

Gregory-Pereira added 2 commits April 5, 2026 08:33

[Bugfix][MoE] Fix hardcoded SharedExperts output buffer size for DBO …

ff7f3b5

…ubatches Signed-off-by: greg pereira <grpereir@redhat.com>

refactor to more thoroughly solve by passing num_ubatches instead of …

744060d

…a boolean Signed-off-by: greg pereira <grpereir@redhat.com>

Gregory-Pereira requested review from mgoin and pavanimajety as code owners April 5, 2026 17:12

mergify Bot added the bug Something isn't working label Apr 5, 2026

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

extend variable ubatch size refactor to defaultMoERunner

c5b071d

Signed-off-by: greg pereira <grpereir@redhat.com>

mergify Bot added the needs-rebase label Apr 6, 2026

mergify Bot removed the needs-rebase label May 13, 2026

mergify Bot added the needs-rebase label May 13, 2026

yewentao256 reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][MoE] Fix hardcoded SharedExperts output buffer size for DBO ubatches#39033

[Bugfix][MoE] Fix hardcoded SharedExperts output buffer size for DBO ubatches#39033
Gregory-Pereira wants to merge 3 commits into
vllm-project:mainfrom
Gregory-Pereira:fix/shared-experts-dbo-output-size

Gregory-Pereira commented Apr 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

Gregory-Pereira Apr 5, 2026

Uh oh!

Gregory-Pereira commented Apr 5, 2026

Uh oh!

robertgshaw2-redhat commented Apr 6, 2026

Uh oh!

mergify Bot commented Apr 6, 2026

Uh oh!

mergify Bot commented May 13, 2026

Uh oh!

yewentao256 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Gregory-Pereira commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Gregory-Pereira Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Gregory-Pereira commented Apr 5, 2026

Uh oh!

robertgshaw2-redhat commented Apr 6, 2026

Uh oh!

mergify Bot commented Apr 6, 2026

Uh oh!

mergify Bot commented May 13, 2026

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gregory-Pereira commented Apr 5, 2026 •

edited

Loading