Skip to content

[Bugfix][MoE] Fix hardcoded SharedExperts output buffer size for DBO ubatches#39033

Open
Gregory-Pereira wants to merge 3 commits into
vllm-project:mainfrom
Gregory-Pereira:fix/shared-experts-dbo-output-size
Open

[Bugfix][MoE] Fix hardcoded SharedExperts output buffer size for DBO ubatches#39033
Gregory-Pereira wants to merge 3 commits into
vllm-project:mainfrom
Gregory-Pereira:fix/shared-experts-dbo-output-size

Conversation

@Gregory-Pereira
Copy link
Copy Markdown
Contributor

@Gregory-Pereira Gregory-Pereira commented Apr 5, 2026

Summary

The SharedExperts class hardcodes its output buffer to [None, None] regardless of the actual number of ubatches configured. This mirrors the same bug that was fixed in WorkspaceManager in #38853 (cc @yewentao256 and @LucasWilkinson based on similar fix)

Test plan

test_moe-smoke-505passed.log
test_moe-full-3537passed.log

These were tested against a tiny qwen3-0.6B no DBO. If we need I can test DBO path too

…ubatches

Signed-off-by: greg pereira <grpereir@redhat.com>
…a boolean

Signed-off-by: greg pereira <grpereir@redhat.com>
@mergify mergify Bot added the bug Something isn't working label Apr 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the enable_dbo boolean flag with a num_ubatches integer across the Fused MoE layers and runners to support dynamic micro-batching and remove hardcoded buffer sizes. While SharedExperts was updated to use this new parameter for buffer allocation, the DefaultMoERunner currently receives the parameter without storing it or updating its internal indexing and buffer logic, which remains hardcoded or reliant on the old flag.

Comment on lines +189 to 192
num_ubatches: int = 1,
):
super().__init__()
self.moe_config = moe_config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The num_ubatches parameter is introduced here but not stored in the instance. This makes the fix incomplete because DefaultMoERunner itself manages internal buffers in _maybe_init_dp_chunking (lines 270-271) and indexing logic in _slice_and_copy_input (lines 671-674) that are still hardcoded to size 2 or rely solely on enable_dbo.

To fully address the hardcoding issue (mirroring the fix in SharedExperts), please store num_ubatches and update the aforementioned methods to use it for buffer allocation and indexing when num_ubatches > 1.

Suggested change
num_ubatches: int = 1,
):
super().__init__()
self.moe_config = moe_config
num_ubatches: int = 1,
):
super().__init__()
self.moe_config = moe_config
self.num_ubatches = num_ubatches

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously intended to scope this change just to shared experts / DBO but ill do it for defaultMoERunner too

Signed-off-by: greg pereira <grpereir@redhat.com>
@Gregory-Pereira
Copy link
Copy Markdown
Contributor Author

logs:

=== Ready ===
vLLM 0.19.1rc1.dev41+gc5b071d48 from /tmp/vllm-fix/vllm/__init__.py

=== Running tests: tests/kernels/moe/test_moe.py -k "test_mixtral or test_qwen" ===
============================= test session starts ==============================
platform linux -- Python 3.12.13, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /tmp/vllm-fix
configfile: pyproject.toml
plugins: typeguard-4.5.1, hypothesis-6.151.11, timeout-2.4.0, shard-0.1.2, rerunfailures-16.1, mock-3.15.1, forked-1.6.0, asyncio-1.3.0, hydra-core-1.3.2, buildkite-test-collector-0.1.9, cov-7.1.0, schemathesis-4.14.3, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 3553 items / 3551 deselected / 2 selected
Running 2 items in this shard: tests/kernels/moe/test_moe.py::test_mixtral_moe[False-True-dtype0], tests/kernels/moe/test_moe.py::test_mixtral_moe[False-False-dtype0]

tests/kernels/moe/test_moe.py::test_mixtral_moe[False-True-dtype0] PASSED [ 50%]
tests/kernels/moe/test_moe.py::test_mixtral_moe[False-False-dtype0] PASSED [100%]

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============== 2 passed, 3551 deselected, 16 warnings in 18.83s ===============

=== Tests PASSED ===
=== Starting vLLM server: Qwen/Qwen2.5-VL-3B-Instruct ===
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev41+gc5b071d48
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]   █▄█▀ █     █     █     █  model   Qwen/Qwen2.5-VL-3B-Instruct
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:299]
(APIServer pid=1) INFO 04-05 18:41:00 [utils.py:233] non-default args: {'model': 'Qwen/Qwen2.5-VL-3B-Instruct'}
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_SERVICE_HOST
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_FORK_URL
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT_8000_TCP_ADDR
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_PATH
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT_8000_TCP_PROTO
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_SERVE_MODEL
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT_8000_TCP
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_SERVE_ARGS
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_PORT_8000_TCP_PORT
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_TEST_UBATCH_SERVICE_PORT
(APIServer pid=1) WARNING 04-05 18:41:00 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_BRANCH
(APIServer pid=1) INFO 04-05 18:41:07 [model.py:554] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(APIServer pid=1) INFO 04-05 18:41:07 [model.py:1684] Using max model len 128000
(APIServer pid=1) INFO 04-05 18:41:07 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-05 18:41:07 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 04-05 18:41:07 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore pid=1894) INFO 04-05 18:41:17 [core.py:105] Initializing a V1 LLM engine (v0.19.1rc1.dev41+gc5b071d48) with config: model='Qwen/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-3B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=1894) INFO 04-05 18:41:19 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.2.150:39689 backend=nccl
(EngineCore pid=1894) INFO 04-05 18:41:19 [parallel_state.py:1712] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=1894) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore pid=1894) INFO 04-05 18:41:22 [gpu_model_runner.py:4735] Starting to load model Qwen/Qwen2.5-VL-3B-Instruct...
(EngineCore pid=1894) INFO 04-05 18:41:23 [cuda.py:418] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=1894) INFO 04-05 18:41:23 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=1894) INFO 04-05 18:41:23 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=1894) INFO 04-05 18:41:23 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=1894) INFO 04-05 18:41:23 [cuda.py:362] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=1894) INFO 04-05 18:41:23 [flash_attn.py:622] Using FlashAttention version 3
(EngineCore pid=1894) INFO 04-05 18:41:32 [weight_utils.py:615] Time spent downloading weights for Qwen/Qwen2.5-VL-3B-Instruct: 8.600505 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.99it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.13it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.11it/s]
(EngineCore pid=1894)
(EngineCore pid=1894) INFO 04-05 18:41:33 [default_loader.py:384] Loading weights took 0.99 seconds
(EngineCore pid=1894) INFO 04-05 18:41:33 [gpu_model_runner.py:4820] Model loading took 7.16 GiB memory and 10.353895 seconds
(EngineCore pid=1894) INFO 04-05 18:41:34 [gpu_model_runner.py:5760] Encoder cache will be initialized with a budget of 114688 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=1894) WARNING 04-05 18:41:35 [op.py:236] Priority not set for op rms_norm, using native implementation.
(EngineCore pid=1894) INFO 04-05 18:41:45 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/65eaa184d7/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=1894) INFO 04-05 18:41:45 [backends.py:1111] Dynamo bytecode transform time: 4.50 s
(EngineCore pid=1894) INFO 04-05 18:41:47 [backends.py:372] Cache the graph of compile range (1, 8192) for later use
(EngineCore pid=1894) INFO 04-05 18:41:51 [backends.py:390] Compiling a graph for compile range (1, 8192) takes 6.06 s
(EngineCore pid=1894) INFO 04-05 18:41:52 [decorators.py:655] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/5e4e91cd0c1134ac47b1d40baef611a5e6acb22a6a561fce88340b9f3bb46a01/rank_0_0/model
(EngineCore pid=1894) INFO 04-05 18:41:52 [monitor.py:48] torch.compile took 11.85 s in total
(EngineCore pid=1894) INFO 04-05 18:41:52 [monitor.py:76] Initial profiling/warmup run took 0.11 s
(EngineCore pid=1894) INFO 04-05 18:41:57 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=1894) INFO 04-05 18:41:57 [gpu_model_runner.py:5883] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(EngineCore pid=1894) INFO 04-05 18:41:58 [gpu_model_runner.py:5962] Estimated CUDA graph memory: 0.37 GiB total
(EngineCore pid=1894) INFO 04-05 18:41:59 [gpu_worker.py:436] Available KV cache memory: 103.01 GiB
(EngineCore pid=1894) INFO 04-05 18:41:59 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9027 to maintain the same effective KV cache size.
(EngineCore pid=1894) INFO 04-05 18:41:59 [kv_cache_utils.py:1319] GPU KV cache size: 3,000,464 tokens
(EngineCore pid=1894) INFO 04-05 18:41:59 [kv_cache_utils.py:1324] Maximum concurrency for 128,000 tokens per request: 23.44x
(EngineCore pid=1894) 2026-04-05 18:41:59,170 - INFO - autotuner.py:446 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=1894) 2026-04-05 18:41:59,178 - INFO - autotuner.py:455 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:01<00:00, 28.09it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 51/51 [00:00<00:00, 54.24it/s]
(EngineCore pid=1894) INFO 04-05 18:42:02 [gpu_model_runner.py:6053] Graph capturing finished in 3 secs, took 0.54 GiB
(EngineCore pid=1894) INFO 04-05 18:42:02 [gpu_worker.py:597] CUDA graph pool memory: 0.54 GiB (actual), 0.37 GiB (estimated), difference: 0.17 GiB (31.4%).
(EngineCore pid=1894) INFO 04-05 18:42:02 [core.py:283] init engine (profile, create kv cache, warmup model) took 28.71 seconds
(EngineCore pid=1894) INFO 04-05 18:42:03 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) INFO 04-05 18:42:03 [api_server.py:604] Supported tasks: ['generate']
(APIServer pid=1) WARNING 04-05 18:42:03 [model.py:1441] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 1e-06}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 04-05 18:42:04 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 04-05 18:42:07 [base.py:245] Multi-modal warmup completed in 3.063s
(APIServer pid=1) INFO 04-05 18:42:07 [api_server.py:608] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO 04-05 18:42:07 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

this feature will only be trigger with MoE models --- Qwen/Qwen2.5-VL-3B-Instruct and qwen3-0.6B wont trigger it

Did you run into a concrete issue? I dont think we support anything besides 2 ubatches

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 6, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Gregory-Pereira.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Apr 6, 2026
@mergify mergify Bot removed the needs-rebase label May 13, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Gregory-Pereira.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 13, 2026
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this issue still in main? Please solve the conflicts and take another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants