SGLang is a fast serving framework for large language models. It features RadixAttention for efficient prefix caching, optimized CUDA kernels, and continuous batching for high-throughput, low-latency inference.
Use SGLang for ultra-low latency, high-throughput production serving with many concurrent requests.
SGLang requires a CUDA-compatible GPU. For CPU-only environments, consider using llama.cpp instead.
MoE model support has been merged into SGLang but is not yet included in a stable release โ install from main to use MoE models now. Vision models are not yet supported in SGLang โ use Transformers for vision workloads.
Dense LFM models work with any recent SGLang Docker tag. For MoE models (e.g., LFM2-8B-A1B), use the dev tag as MoE support is not yet in a stable release.
For CUDA 13 environments (B300/GB300), use lmsysorg/sglang:dev-cu13
The HF_TOKEN env var is optional, but can speed up downloads and reduce retry errors. We recommend a read-only Hugging Face token for reliability.
Running a 1.2B model on a B300 may sound counterintuitive, but combining --enable-torch-compile with Blackwellโs architecture unlocks extremely low latency โ ideal for latency-sensitive workloads like RAG, search, and real-time chat.
We recommend --enable-torch-compile for workloads with concurrency under 256. For pure throughput batch processing at very high concurrency, skip this flag.
Key flags for low latency:
--enable-torch-compile: Compiles the model with Torch for faster execution. Adds startup time but significantly reduces per-token latency.
--chunked-prefill-size -1: Disables chunked prefill, processing the full prompt in one pass. This lowers TTFT at the cost of slightly reduced throughput under high concurrency.