SGLang

Use SGLang for ultra-low latency, high-throughput production serving with many concurrent requests.

SGLang requires a CUDA-compatible GPU. For CPU-only environments, consider using llama.cpp instead.

Supported Models

Model Type	Status	Examples
Dense text models	Supported	LFM2-350M, LFM2.5-1.2B-Instruct, LFM2-2.6B
MoE text models	Coming in 0.5.9	LFM2-8B-A1B
Vision models	Not yet supported	LFM2-VL

MoE model support has been merged into SGLang but is not yet included in a stable release — install from main to use MoE models now. Vision models are not yet supported in SGLang — use Transformers for vision workloads.

Installation

Install SGLang following the official installation guide. The recommended method is (requires sglang>=0.5.8):

pip install --upgrade pip
pip install uv
uv pip install "sglang>=0.5.8"

Install from Main (MoE Support)

To use MoE models (e.g., LFM2-8B-A1B) before the 0.5.9 release, install SGLang from the main branch:

uv pip install "sglang @ git+https://github.com/sgl-project/sglang.git@main#subdirectory=python"

Launching the Server

By default the model runs in bfloat16. To use float16 instead, add --dtype float16 and set export SGLANG_MAMBA_CONV_DTYPE=float16 before launching.

Python
Docker

python3 -m sglang.launch_server \
    --model-path LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tool-call-parser lfm2

Dense LFM models work with any recent SGLang Docker tag. For MoE models (e.g., LFM2-8B-A1B), use the dev tag as MoE support is not yet in a stable release.

For CUDA 13 environments (B300/GB300), use lmsysorg/sglang:dev-cu13
The HF_TOKEN env var is optional, but can speed up downloads and reduce retry errors. We recommend a read-only Hugging Face token for reliability.

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:dev \
    python3 -m sglang.launch_server \
        --model-path LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 \
        --port 30000 \
        --tool-call-parser lfm2

Usage

SGLang exposes an OpenAI-compatible API.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="None"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[
        {"role": "user", "content": "What's the weather in San Francisco?"}
    ],
    tools=tools,
    tool_choice="auto",
    temperature=0
)

print(response.choices[0].message)

For more details on tool use with LFM models, see Tool Use.

Curl request example

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "LiquidAI/LFM2.5-1.2B-Instruct",
    "messages": [
      {"role": "user", "content": "What is AI?"}
    ],
    "temperature": 0
  }'

Low Latency on Blackwell (B300)

Running a 1.2B model on a B300 may sound counterintuitive, but combining --enable-torch-compile with Blackwell’s architecture unlocks extremely low latency — ideal for latency-sensitive workloads like RAG, search, and real-time chat.

We recommend --enable-torch-compile for workloads with concurrency under 256. For pure throughput batch processing at very high concurrency, skip this flag.

Key flags for low latency:

--enable-torch-compile: Compiles the model with Torch for faster execution. Adds startup time but significantly reduces per-token latency.
--chunked-prefill-size -1: Disables chunked prefill, processing the full prompt in one pass. This lowers TTFT at the cost of slightly reduced throughput under high concurrency.

python3 -m sglang.launch_server \
    --model-path LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tool-call-parser lfm2 \
    --enable-torch-compile \
    --chunked-prefill-size -1

On B300/CUDA 13, use the dedicated Docker image:

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:dev-cu13 \
    python3 -m sglang.launch_server \
        --model-path LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 \
        --port 30000 \
        --tool-call-parser lfm2 \
        --enable-torch-compile \
        --chunked-prefill-size -1

Example benchmark on a B300 GPU with CUDA 13 (256 prompts, 1024 input tokens, 128 output tokens, max concurrency 1):

Metric	Value
Mean TTFT (ms)	8.79
Mean TPOT (ms)	0.86
Output token throughput (tok/s)	1100.92

Benchmark command

python3 -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --num-prompts 256 \
    --max-concurrency 1 \
    --random-input-len 1024 \
    --random-output-len 128 \
    --warmup-requests 128

Get Started

Models

Key Concepts

Inference

Fine-tuning

Help

Supported Models

Installation

Install from Main (MoE Support)

Launching the Server

Usage

Low Latency on Blackwell (B300)

Get Started

Models

Key Concepts

Inference

Fine-tuning

Help

​Supported Models

​Installation

​Install from Main (MoE Support)

​Launching the Server

​Usage

​Low Latency on Blackwell (B300)

Supported Models

Installation

Install from Main (MoE Support)

Launching the Server

Usage

Low Latency on Blackwell (B300)