This document provides detailed information about all startup parameters and their usage for LightLLM APIServer.
.. option:: --run_mode
Set the running mode, optional values:
* ``normal``: Single server mode (default)
* ``prefill``: Prefill mode (for pd disaggregation running mode)
* ``decode``: Decode mode (for pd disaggregation running mode)
* ``pd_master``: pd master node mode (for pd disaggregation running mode)
* ``config_server``: Configuration server mode (for pd disaggregation mode, used to register pd_master nodes and get pd_master node list), specifically designed for large-scale, high-concurrency scenarios, used when `pd_master` encounters significant CPU bottlenecks.
.. option:: --performance_mode, --p_mode
Performance mode for different scenarios, optional values:
* ``None``: No performance mode applied (default)
* ``personal``: Private personal running mode, automatically sets:
- ``running_max_req_size`` to 3
- ``batch_max_tokens`` to 2048 (2k)
- ``chunked_prefill_size`` to 1024 (1k)
- ``mem_fraction`` to 0.85
.. option:: --host
Server listening address, default is ``127.0.0.1``
.. option:: --port
Server listening port, default is ``8000``
.. option:: --httpserver_workers
HTTP server worker process count, default is ``1``
.. option:: --zmq_mode
ZMQ communication mode, optional values:
* ``tcp://``: TCP mode
* ``ipc:///tmp/``: IPC mode (default)
Can only choose from ``['tcp://', 'ipc:///tmp/']``
.. option:: --pd_master_ip
PD master node IP address, default is ``0.0.0.0``
This parameter needs to be set when run_mode is set to prefill or decode
.. option:: --pd_master_port
PD master node port, default is ``1212``
This parameter needs to be set when run_mode is set to prefill or decode
.. option:: --pd_decode_rpyc_port
Port used by decode nodes for kv move manager rpyc server in PD mode, default is ``42000``
.. option:: --config_server_host
Host address in configuration server mode
.. option:: --config_server_port
Port number in configuration server mode
.. option:: --model_name
Model name, used to distinguish internal model names, default is ``default_model_name``
Can be obtained via ``host:port/get_model_name``
.. option:: --model_dir
Model weight directory path, the application will load configuration, weights, and tokenizer from this directory
.. option:: --tokenizer_mode
Tokenizer loading mode, optional values:
* ``slow``: Slow mode, loads fast but runs slow, suitable for debugging and testing
* ``fast``: Fast mode (default), achieves best performance
* ``auto``: Auto mode, tries to use fast mode, falls back to slow mode if it fails
.. option:: --load_way
Model weight loading method, default is ``HF`` (Huggingface format)
Llama models also support ``DS`` (Deepspeed) format
.. option:: --trust_remote_code
Whether to allow using custom model definition files on Hub
.. option:: --max_total_token_num
Total token number of kv cache.
If not specified, will be automatically calculated based on mem_fraction
.. option:: --mem_fraction
Memory usage ratio, default is ``0.9``
If OOM occurs during runtime, you can specify a smaller value
.. option:: --batch_max_tokens
Maximum token count for new batches, controls prefill batch size to prevent OOM
.. option:: --running_max_req_size
Maximum number of requests for simultaneous forward inference, default is ``1000``
.. option:: --max_req_total_len
Maximum value of request input length + request output length, default is ``16384``
.. option:: --eos_id
End stop token ID, can specify multiple values. If None, will be loaded from config.json
.. option:: --tool_call_parser
OpenAI interface tool call parser type, optional values:
* ``qwen25``
* ``llama3``
* ``mistral``
.. option:: --nnodes
Number of nodes, default is ``1``
.. option:: --node_rank
Current node rank, default is ``0``
.. option:: --multinode_httpmanager_port
Multi-node HTTP manager port, default is ``12345``
.. option:: --multinode_router_gloo_port
Multi-node router gloo port, default is ``20001``
.. option:: --tp
Model tensor parallelism size, default is ``1``
.. option:: --dp
Data parallelism size, default is ``1``
This is a useful parameter for deepseekv2. When using deepseekv2 model, set dp equal to the tp parameter.
In other cases, please do not set it, keep the default value of 1.
.. option:: --nccl_host
nccl_host used to build PyTorch distributed environment, default is ``127.0.0.1``
For multi-node deployment, should be set to the master node's IP
.. option:: --nccl_port
nccl_port used to build PyTorch distributed environment, default is ``28765``
.. option:: --use_config_server_to_init_nccl
Use tcp store server started by config_server to initialize nccl, default is False
When set to True, --nccl_host must equal config_server_host, --nccl_port must be unique for config_server,
do not use the same nccl_port for different inference nodes, this will be a serious error
.. option:: --router_token_ratio
Threshold for determining if the service is busy, default is ``0.0``. Once the kv cache usage exceeds this value, it will directly switch to conservative scheduling.
.. option:: --router_max_wait_tokens
Trigger scheduling of new requests every router_max_wait_tokens decoding steps, default is ``6``
.. option:: --disable_aggressive_schedule
Disable aggressive scheduling
Aggressive scheduling may cause frequent prefill interruptions during decoding. Disabling it can make the router_max_wait_tokens parameter work more effectively.
.. option:: --disable_dynamic_prompt_cache
Disable kv cache caching
.. option:: --chunked_prefill_size
Chunked prefill size, default is ``4096``
.. option:: --disable_chunked_prefill
Whether to disable chunked prefill
.. option:: --diverse_mode
Multi-result output mode
.. option:: --schedule_time_interval
Schedule time interval, default is ``0.03``, unit is seconds
.. option:: --token_healing_mode
.. option:: --output_constraint_mode
Set the output constraint backend, optional values:
* ``outlines``: Use outlines backend
* ``xgrammar``: Use xgrammar backend
* ``none``: No output constraint (default)
.. option:: --first_token_constraint_mode
Constrain the allowed range of the first token
Use environment variable FIRST_ALLOWED_TOKENS to set the range, e.g., FIRST_ALLOWED_TOKENS=1,2
.. option:: --disable_vision
If the model is a multimodal model, set this to not load the vision part model (default is None, auto-detected based on model)
.. option:: --disable_audio
If the model is a multimodal model, set this to not load the audio part model (default is None, auto-detected based on model)
.. option:: --enable_mps
Whether to enable nvidia mps for multimodal services
.. option:: --cache_capacity
Cache server capacity for multimodal resources, default is ``200``
.. option:: --visual_infer_batch_size
Number of images processed in each inference batch, default is ``1``
.. option:: --visual_batch_max_tokens
Per-step ViT admission budget, measured in image output tokens (post
spatial_merge). The multimodal analogue of ``--batch_max_tokens``: the
ViT scheduler stops adding images to the current batch once their
cumulative ``token_num`` would exceed this value. Useful for bounding
peak ViT memory on dynamic-resolution models (Qwen2.5/3/3.5-VL, etc.)
where one 4K image or long video can contain more patches than many
small images combined. One image is always admitted per step to avoid
deadlock when a single request is larger than the budget — to make that
safe, the same value also drives the per-image budget: oversized images
are auto-resized by the Qwen-VL processor ``max_pixels`` clamp, and any
image that still exceeds the budget is rejected with a ``ValueError``
before reaching the ViT.
**Default behavior with** ``--enable_multimodal``: auto-derived from
``--batch_max_tokens`` so multimodal deployments get OOM protection
without explicit opt-in. Pass an explicit positive integer to override.
Pass ``0`` to opt out and restore the pre-budget behavior (only
``--visual_infer_batch_size`` applies).
.. option:: --visual_gpu_ids
List of GPU IDs to use, e.g., 0 1 2
.. option:: --visual_tp
Number of tensor parallel instances for ViT, default is ``1``
.. option:: --visual_dp
Number of data parallel instances for ViT, default is ``1``
.. option:: --visual_nccl_ports
List of NCCL ports for ViT, e.g., 29500 29501 29502, default is [29500]
.. option:: --vit_att_backend
Set the attention backend for ViT. Available options:
* ``auto``: Automatically select the best backend (default), with priority fa3 > xformers > sdpa > triton
* ``fa3``: Use Flash-Attention 3 backend
* ``xformers``: Use xformers backend
* ``sdpa``: Use sdpa backend
* ``triton``: Use Triton backend
.. option:: --disable_custom_allreduce
Whether to disable custom allreduce
.. option:: --enable_custom_allgather
Whether to enable custom allgather
.. option:: --enable_tpsp_mix_mode
The inference backend will use TP SP mixed running mode
Currently only supports llama and deepseek series models
.. option:: --enable_prefill_microbatch_overlap
The inference backend will use microbatch overlap mode for prefill
Currently only supports deepseek series.
.. option:: --enable_decode_microbatch_overlap
The inference backend will use microbatch overlap mode for decoding
.. option:: --llm_prefill_att_backend
Set the attention backend for the prefill phase. Available options:
* ``auto``: Automatically select the best backend (default), with priority fa3 > flashinfer > triton
* ``fa3``: Use Flash-Attention 3 backend
* ``flashinfer``: Use FlashInfer backend
* ``triton``: Use Triton backend
.. option:: --llm_decode_att_backend
Set the attention backend for the decode phase. Available options:
* ``auto``: Automatically select the best backend (default), with priority fa3 > flashinfer > triton
* ``fa3``: Use Flash-Attention 3 backend
* ``flashinfer``: Use FlashInfer backend
* ``triton``: Use Triton backend
.. option:: --llm_kv_type
Set the KV cache data type for inference. Available options:
* ``None``: Use the dtype from model's config.json
* ``int8kv``: INT8 KV quantization
* ``int4kv``: INT4 KV quantization
* ``fp8kv_sph``: FP8 static per-head quantization, uses fa3 backend
* ``fp8kv_spt``: FP8 static per-tensor quantization, uses flashinfer backend
.. option:: --linear_att_hash_page_size
Hash page size for linear attention, default is ``512``.
This controls the number of tokens per hash bucket, which can affect radix cache reuse.
.. option:: --linear_att_page_block_num
Number of blocks used for linear-attention state storage, default is ``10000000``.
This controls the available pages for attention state data, which can affect memory usage and multi-turn chat performance.
In current behavior, block size can be approximated as
``linear_att_page_block_num * linear_att_hash_page_size``.
When ``linear_att_page_block_num * linear_att_hash_page_size > max_req_total_len``,
block-level matching in radix cache is effectively disabled, and request-level small-page matching
(small page size is ``linear_att_hash_page_size``) becomes dominant.
Under high load, limited small-page capacity plus internal LRU eviction can reduce cache hit rate.
When ``--enable_cpu_cache`` is enabled, CPU cache page size is forced to
``linear_att_page_block_num * linear_att_hash_page_size`` to satisfy internal reuse constraints.
.. option:: --linear_att_cache_size
Size of linear-attention cache.
If not specified, it will be automatically derived from cache-related settings.
If small-page cache hits are poor under high load (for example, due to limited small-page count and LRU eviction),
increasing this value can improve cache hit rate, at the cost of more memory usage.
.. option:: --linear_att_ssm_data_type
Data type of linear-attention SSM state, optional values:
* ``bfloat16``
* ``float32`` (default)
.. option:: --disable_cudagraph
Disable cudagraph in the decoding phase
.. option:: --graph_max_batch_size
Maximum batch size that can be captured by cuda graph in the decoding phase, default is ``256``
.. option:: --graph_split_batch_size
Controls the interval for generating CUDA graphs during decoding, default is ``32``
For values from 1 to the specified graph_split_batch_size, CUDA graphs will be generated continuously.
For values from graph_split_batch_size to graph_max_batch_size,
a new CUDA graph will be generated for every increase of graph_grow_step_size.
Properly configuring this parameter can help optimize the performance of CUDA graph execution.
.. option:: --graph_grow_step_size
For batch_size values from graph_split_batch_size to graph_max_batch_size,
a new CUDA graph will be generated for every increase of graph_grow_step_size, default is ``16``
.. option:: --graph_max_len_in_batch
Maximum sequence length that can be captured by cuda graph in the decoding phase, default is ``max_req_total_len``
.. option:: --quant_type
Quantization method, optional values:
* ``vllm-w8a8``
* ``vllm-fp8w8a8``
* ``vllm-fp8w8a8-b128``
* ``deepgemm-fp8w8a8-b128``
* ``triton-fp8w8a8-block128``
* ``triton-fp8w8a8g128``: weight per-channel quant and activation per-group 128 quant
* ``triton-fp8w8a8g64``: weight per-channel quantization with group size 64
* ``awq``
* ``awq_marlin``
* ``none`` (default)
.. option:: --quant_cfg
Path to quantization configuration file. Can be used for mixed quantization.
Examples can be found in test/advanced_config/mixed_quantization/llamacls-mix-down.yaml.
.. option:: --vit_quant_type
ViT quantization method, optional values:
* ``vllm-w8a8``
* ``vllm-fp8w8a8``
* ``none`` (default)
.. option:: --vit_quant_cfg
Path to ViT quantization configuration file. Can be used for mixed quantization.
Examples can be found in lightllm/common/quantization/configs.
.. option:: --sampling_backend
Implementation used for sampling, optional values:
* ``triton``: Use torch and triton kernel (default)
* ``sglang_kernel``: Use sglang_kernel implementation
.. option:: --return_all_prompt_logprobs
Return logprobs for all prompt tokens
.. option:: --use_reward_model
Use reward model
.. option:: --long_truncation_mode
How to handle when input_token_len + max_new_tokens > max_req_total_len, optional values:
* ``None``: Throw exception (default)
* ``head``: Remove some head tokens to make input_token_len + max_new_tokens <= max_req_total_len
* ``center``: Remove some tokens at the center position to make input_token_len + max_new_tokens <= max_req_total_len
.. option:: --use_tgi_api
Use tgi input and output format
.. option:: --mtp_mode
Supported mtp modes, it is recommended to use eagle_with_att for better performance, optional values:
* ``vanilla_with_att``
* ``eagle_with_att``
* ``vanilla_no_att``
* ``eagle_no_att``
* ``None``: Do not enable mtp (default)
.. option:: --mtp_draft_model_dir
Path to the draft model for MTP multi-prediction functionality
Used to load the MTP multi-output token model.
.. option:: --mtp_step
Specify the number of additional tokens predicted by the draft model, default is ``0``
Currently this feature only supports DeepSeekV3/R1 models.
Increasing this value allows more predictions, but ensure the model is compatible with the specified number of steps.
Currently deepseekv3/r1 models only support 1 step
.. option:: --ep_redundancy_expert_config_path
Path to redundant expert configuration. Can be used for deepseekv3 models.
.. option:: --auto_update_redundancy_expert
Whether to update redundant experts for deepseekv3 models through online expert usage counters.
.. option:: --disable_log_stats
Disable throughput statistics logging
.. option:: --log_stats_interval
Interval for recording statistics (seconds), default is ``10``
.. option:: --health_monitor
Check service health status and restart on error
.. option:: --metric_gateway
Address for collecting monitoring metrics
.. option:: --job_name
Job name for monitoring, default is ``lightllm``
.. option:: --grouping_key
Grouping key for monitoring, format is key=value, can specify multiple
.. option:: --push_interval
Interval for pushing monitoring metrics (seconds), default is ``10``
.. option:: --enable_monitor_auth
Whether to enable authentication for push_gateway