Skip to content

Commit b4e3069

Browse files
authored
feat(backends): add sglang (#9359)
* feat(backends): add sglang Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): force AVX-512 CXXFLAGS and disable CI e2e job sgl-kernel's shm.cpp uses __m512 AVX-512 intrinsics unconditionally; -march=native fails on CI runners without AVX-512 in /proc/cpuinfo. Force -march=sapphirerapids so the build always succeeds, matching sglang upstream's docker/xeon.Dockerfile recipe. The resulting binary still requires an AVX-512 capable CPU at runtime, so disable tests-sglang-grpc in test-extra.yml for the same reason tests-vllm-grpc is disabled. Local runs with make test-extra-backend-sglang still work on hosts with the right SIMD baseline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): patch CMakeLists.txt instead of CXXFLAGS for AVX-512 CXXFLAGS with -march=sapphirerapids was being overridden by add_compile_options(-march=native) in sglang's CPU CMakeLists.txt, since CMake appends those flags after CXXFLAGS. Sed-patch the CMakeLists.txt directly after cloning to replace -march=native. --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
1 parent 61d34cc commit b4e3069

21 files changed

+926
-2
lines changed

.github/workflows/backend.yml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,19 @@ jobs:
6666
dockerfile: "./backend/Dockerfile.python"
6767
context: "./"
6868
ubuntu-version: '2404'
69+
- build-type: ''
70+
cuda-major-version: ""
71+
cuda-minor-version: ""
72+
platforms: 'linux/amd64'
73+
tag-latest: 'auto'
74+
tag-suffix: '-cpu-sglang'
75+
runs-on: 'ubuntu-latest'
76+
base-image: "ubuntu:24.04"
77+
skip-drivers: 'true'
78+
backend: "sglang"
79+
dockerfile: "./backend/Dockerfile.python"
80+
context: "./"
81+
ubuntu-version: '2404'
6982
- build-type: ''
7083
cuda-major-version: ""
7184
cuda-minor-version: ""
@@ -411,6 +424,19 @@ jobs:
411424
dockerfile: "./backend/Dockerfile.python"
412425
context: "./"
413426
ubuntu-version: '2404'
427+
- build-type: 'cublas'
428+
cuda-major-version: "12"
429+
cuda-minor-version: "8"
430+
platforms: 'linux/amd64'
431+
tag-latest: 'auto'
432+
tag-suffix: '-gpu-nvidia-cuda-12-sglang'
433+
runs-on: 'arc-runner-set'
434+
base-image: "ubuntu:24.04"
435+
skip-drivers: 'false'
436+
backend: "sglang"
437+
dockerfile: "./backend/Dockerfile.python"
438+
context: "./"
439+
ubuntu-version: '2404'
414440
- build-type: 'cublas'
415441
cuda-major-version: "12"
416442
cuda-minor-version: "8"
@@ -1427,6 +1453,19 @@ jobs:
14271453
dockerfile: "./backend/Dockerfile.python"
14281454
context: "./"
14291455
ubuntu-version: '2404'
1456+
- build-type: 'hipblas'
1457+
cuda-major-version: ""
1458+
cuda-minor-version: ""
1459+
platforms: 'linux/amd64'
1460+
tag-latest: 'auto'
1461+
tag-suffix: '-gpu-rocm-hipblas-sglang'
1462+
runs-on: 'arc-runner-set'
1463+
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
1464+
skip-drivers: 'false'
1465+
backend: "sglang"
1466+
dockerfile: "./backend/Dockerfile.python"
1467+
context: "./"
1468+
ubuntu-version: '2404'
14301469
- build-type: 'hipblas'
14311470
cuda-major-version: ""
14321471
cuda-minor-version: ""
@@ -1689,6 +1728,19 @@ jobs:
16891728
dockerfile: "./backend/Dockerfile.python"
16901729
context: "./"
16911730
ubuntu-version: '2404'
1731+
- build-type: 'intel'
1732+
cuda-major-version: ""
1733+
cuda-minor-version: ""
1734+
platforms: 'linux/amd64'
1735+
tag-latest: 'auto'
1736+
tag-suffix: '-gpu-intel-sglang'
1737+
runs-on: 'arc-runner-set'
1738+
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
1739+
skip-drivers: 'false'
1740+
backend: "sglang"
1741+
dockerfile: "./backend/Dockerfile.python"
1742+
context: "./"
1743+
ubuntu-version: '2404'
16921744
- build-type: 'intel'
16931745
cuda-major-version: ""
16941746
cuda-minor-version: ""

.github/workflows/test-extra.yml

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ jobs:
3333
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
3434
turboquant: ${{ steps.detect.outputs.turboquant }}
3535
vllm: ${{ steps.detect.outputs.vllm }}
36+
sglang: ${{ steps.detect.outputs.sglang }}
3637
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
3738
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
3839
voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -589,6 +590,48 @@ jobs:
589590
# - name: Build vllm (cpu) backend image and run gRPC e2e tests
590591
# run: |
591592
# make test-extra-backend-vllm
593+
# tests-sglang-grpc is currently disabled in CI for the same reason as
594+
# tests-vllm-grpc: sglang's CPU kernel (sgl-kernel) uses __m512 AVX-512
595+
# intrinsics unconditionally in shm.cpp, so the from-source build
596+
# requires `-march=sapphirerapids` (already set in install.sh) and the
597+
# resulting binary SIGILLs at import on CPUs without AVX-512 VNNI/BF16.
598+
# The ubuntu-latest runner pool does not guarantee that ISA baseline.
599+
#
600+
# The test itself (tests/e2e-backends + make test-extra-backend-sglang)
601+
# is fully working and validated locally on a host with the right
602+
# SIMD baseline. Run it manually with:
603+
#
604+
# make test-extra-backend-sglang
605+
#
606+
# Re-enable this job once we have a self-hosted runner label with
607+
# guaranteed AVX-512 VNNI/BF16 support.
608+
#
609+
# tests-sglang-grpc:
610+
# needs: detect-changes
611+
# if: needs.detect-changes.outputs.sglang == 'true' || needs.detect-changes.outputs.run-all == 'true'
612+
# runs-on: bigger-runner
613+
# timeout-minutes: 90
614+
# steps:
615+
# - name: Clone
616+
# uses: actions/checkout@v6
617+
# with:
618+
# submodules: true
619+
# - name: Dependencies
620+
# run: |
621+
# sudo apt-get update
622+
# sudo apt-get install -y --no-install-recommends \
623+
# make build-essential curl unzip ca-certificates git tar
624+
# - name: Setup Go
625+
# uses: actions/setup-go@v5
626+
# with:
627+
# go-version: '1.25.4'
628+
# - name: Free disk space
629+
# run: |
630+
# sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
631+
# df -h
632+
# - name: Build sglang (cpu) backend image and run gRPC e2e tests
633+
# run: |
634+
# make test-extra-backend-sglang
592635
tests-acestep-cpp:
593636
needs: detect-changes
594637
if: needs.detect-changes.outputs.acestep-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'

Makefile

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Disable parallel execution for backend builds
2-
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
2+
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
33

44
GOCMD=go
55
GOTEST=$(GOCMD) test
@@ -419,6 +419,7 @@ prepare-test-extra: protogen-python
419419
$(MAKE) -C backend/python/chatterbox
420420
$(MAKE) -C backend/python/vllm
421421
$(MAKE) -C backend/python/vllm-omni
422+
$(MAKE) -C backend/python/sglang
422423
$(MAKE) -C backend/python/vibevoice
423424
$(MAKE) -C backend/python/moonshine
424425
$(MAKE) -C backend/python/pocket-tts
@@ -602,6 +603,17 @@ test-extra-backend-tinygrad-all: \
602603
test-extra-backend-tinygrad-sd \
603604
test-extra-backend-tinygrad-whisper
604605

606+
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
607+
## tool-call extraction via sglang's native qwen parser. CPU builds use
608+
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
609+
test-extra-backend-sglang: docker-build-sglang
610+
BACKEND_IMAGE=local-ai-backend:sglang \
611+
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct \
612+
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
613+
BACKEND_TEST_OPTIONS=tool_parser:qwen \
614+
$(MAKE) test-extra-backend
615+
616+
605617
## mlx is Apple-Silicon-first — the MLX backend auto-detects the right tool
606618
## parser from the chat template, so no tool_parser: option is needed (it
607619
## would be ignored at runtime). Run this on macOS / arm64 with Metal; the
@@ -741,6 +753,7 @@ BACKEND_NEUTTS = neutts|python|.|false|true
741753
BACKEND_KOKORO = kokoro|python|.|false|true
742754
BACKEND_VLLM = vllm|python|.|false|true
743755
BACKEND_VLLM_OMNI = vllm-omni|python|.|false|true
756+
BACKEND_SGLANG = sglang|python|.|false|true
744757
BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
745758
BACKEND_CHATTERBOX = chatterbox|python|.|false|true
746759
BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
@@ -811,6 +824,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
811824
$(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
812825
$(eval $(call generate-docker-build-target,$(BACKEND_VLLM)))
813826
$(eval $(call generate-docker-build-target,$(BACKEND_VLLM_OMNI)))
827+
$(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
814828
$(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
815829
$(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
816830
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
@@ -839,7 +853,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
839853
docker-save-%: backend-images
840854
docker save local-ai-backend:$* -o backend-images/$*.tar
841855

842-
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp
856+
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp
843857

844858
########################################################
845859
### Mock Backend for E2E Tests

backend/index.yaml

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,28 @@
227227
intel: "intel-vllm"
228228
nvidia-cuda-12: "cuda12-vllm"
229229
cpu: "cpu-vllm"
230+
- &sglang
231+
name: "sglang"
232+
license: apache-2.0
233+
urls:
234+
- https://github.com/sgl-project/sglang
235+
tags:
236+
- text-to-text
237+
- multimodal
238+
icon: https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png
239+
description: |
240+
SGLang is a fast serving framework for large language models and vision language models.
241+
It co-designs the backend runtime (RadixAttention, continuous batching, structured
242+
decoding) and the frontend language to make interaction with models faster and more
243+
controllable. Features include fast backend runtime, flexible frontend language,
244+
extensive model support, and an active community.
245+
alias: "sglang"
246+
capabilities:
247+
nvidia: "cuda12-sglang"
248+
amd: "rocm-sglang"
249+
intel: "intel-sglang"
250+
nvidia-cuda-12: "cuda12-sglang"
251+
cpu: "cpu-sglang"
230252
- &vllm-omni
231253
name: "vllm-omni"
232254
license: apache-2.0
@@ -1766,6 +1788,54 @@
17661788
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-vllm"
17671789
mirrors:
17681790
- localai/localai-backends:master-cpu-vllm
1791+
# sglang
1792+
- !!merge <<: *sglang
1793+
name: "sglang-development"
1794+
capabilities:
1795+
nvidia: "cuda12-sglang-development"
1796+
amd: "rocm-sglang-development"
1797+
intel: "intel-sglang-development"
1798+
cpu: "cpu-sglang-development"
1799+
- !!merge <<: *sglang
1800+
name: "cuda12-sglang"
1801+
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
1802+
mirrors:
1803+
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
1804+
- !!merge <<: *sglang
1805+
name: "rocm-sglang"
1806+
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-sglang"
1807+
mirrors:
1808+
- localai/localai-backends:latest-gpu-rocm-hipblas-sglang
1809+
- !!merge <<: *sglang
1810+
name: "intel-sglang"
1811+
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sglang"
1812+
mirrors:
1813+
- localai/localai-backends:latest-gpu-intel-sglang
1814+
- !!merge <<: *sglang
1815+
name: "cpu-sglang"
1816+
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sglang"
1817+
mirrors:
1818+
- localai/localai-backends:latest-cpu-sglang
1819+
- !!merge <<: *sglang
1820+
name: "cuda12-sglang-development"
1821+
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
1822+
mirrors:
1823+
- localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
1824+
- !!merge <<: *sglang
1825+
name: "rocm-sglang-development"
1826+
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-sglang"
1827+
mirrors:
1828+
- localai/localai-backends:master-gpu-rocm-hipblas-sglang
1829+
- !!merge <<: *sglang
1830+
name: "intel-sglang-development"
1831+
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sglang"
1832+
mirrors:
1833+
- localai/localai-backends:master-gpu-intel-sglang
1834+
- !!merge <<: *sglang
1835+
name: "cpu-sglang-development"
1836+
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sglang"
1837+
mirrors:
1838+
- localai/localai-backends:master-cpu-sglang
17691839
# vllm-omni
17701840
- !!merge <<: *vllm-omni
17711841
name: "vllm-omni-development"

backend/python/sglang/Makefile

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
.PHONY: sglang
2+
sglang:
3+
bash install.sh
4+
5+
.PHONY: run
6+
run: sglang
7+
@echo "Running sglang..."
8+
bash run.sh
9+
@echo "sglang run."
10+
11+
.PHONY: protogen-clean
12+
protogen-clean:
13+
$(RM) backend_pb2_grpc.py backend_pb2.py
14+
15+
.PHONY: clean
16+
clean: protogen-clean
17+
rm -rf venv __pycache__

0 commit comments

Comments
 (0)