Skip to main content
Use SGLang for ultra-low latency, high-throughput production serving with many concurrent requests.
SGLang requires a CUDA-compatible GPU. For CPU-only environments, consider using llama.cpp instead.

Supported Models

Model TypeStatusExamples
Dense text modelsSupportedLFM2-350M, LFM2.5-1.2B-Instruct, LFM2-2.6B
MoE text modelsComing in 0.5.9LFM2-8B-A1B
Vision modelsNot yet supportedLFM2-VL
MoE model support has been merged into SGLang but is not yet included in a stable release โ€” install from main to use MoE models now. Vision models are not yet supported in SGLang โ€” use Transformers for vision workloads.

Installation

Install SGLang following the official installation guide. The recommended method is (requires sglang>=0.5.8):
pip install --upgrade pip
pip install uv
uv pip install "sglang>=0.5.8"

Install from Main (MoE Support)

To use MoE models (e.g., LFM2-8B-A1B) before the 0.5.9 release, install SGLang from the main branch:
uv pip install "sglang @ git+https://github.com/sgl-project/sglang.git@main#subdirectory=python"

Launching the Server

By default the model runs in bfloat16. To use float16 instead, add --dtype float16 and set export SGLANG_MAMBA_CONV_DTYPE=float16 before launching.
python3 -m sglang.launch_server \
    --model-path LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tool-call-parser lfm2

Usage

SGLang exposes an OpenAI-compatible API.
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="None"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[
        {"role": "user", "content": "What's the weather in San Francisco?"}
    ],
    tools=tools,
    tool_choice="auto",
    temperature=0
)

print(response.choices[0].message)
For more details on tool use with LFM models, see Tool Use.
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "LiquidAI/LFM2.5-1.2B-Instruct",
    "messages": [
      {"role": "user", "content": "What is AI?"}
    ],
    "temperature": 0
  }'

Low Latency on Blackwell (B300)

Running a 1.2B model on a B300 may sound counterintuitive, but combining --enable-torch-compile with Blackwellโ€™s architecture unlocks extremely low latency โ€” ideal for latency-sensitive workloads like RAG, search, and real-time chat.
We recommend --enable-torch-compile for workloads with concurrency under 256. For pure throughput batch processing at very high concurrency, skip this flag.
Key flags for low latency:
  • --enable-torch-compile: Compiles the model with Torch for faster execution. Adds startup time but significantly reduces per-token latency.
  • --chunked-prefill-size -1: Disables chunked prefill, processing the full prompt in one pass. This lowers TTFT at the cost of slightly reduced throughput under high concurrency.
python3 -m sglang.launch_server \
    --model-path LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tool-call-parser lfm2 \
    --enable-torch-compile \
    --chunked-prefill-size -1
On B300/CUDA 13, use the dedicated Docker image:
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:dev-cu13 \
    python3 -m sglang.launch_server \
        --model-path LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 \
        --port 30000 \
        --tool-call-parser lfm2 \
        --enable-torch-compile \
        --chunked-prefill-size -1
Example benchmark on a B300 GPU with CUDA 13 (256 prompts, 1024 input tokens, 128 output tokens, max concurrency 1):
MetricValue
Mean TTFT (ms)8.79
Mean TPOT (ms)0.86
Output token throughput (tok/s)1100.92
python3 -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --num-prompts 256 \
    --max-concurrency 1 \
    --random-input-len 1024 \
    --random-output-len 128 \
    --warmup-requests 128