> ## Documentation Index
> Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# SGLang

> SGLang is a fast serving framework for large language models. It features RadixAttention for efficient prefix caching, optimized CUDA kernels, and continuous batching for high-throughput, low-latency inference.

<Tip>
  Use SGLang for ultra-low latency, high-throughput production serving with many concurrent requests.
</Tip>

SGLang requires a CUDA-compatible GPU. For CPU-only environments, consider using [llama.cpp](/deployment/on-device/llama-cpp) instead.

## Supported Models

| Model Type        | Status    | Examples                                   |
| ----------------- | --------- | ------------------------------------------ |
| Dense text models | Supported | LFM2-350M, LFM2.5-1.2B-Instruct, LFM2-2.6B |
| MoE text models   | Supported | LFM2-8B-A1B, LFM2-24B-A2B                  |
| Vision models     | Supported | LFM2-VL-450M, LFM2-VL-3B, LFM2.5-VL-1.6B   |

<Note>
  All LFM model types are supported as of SGLang v0.5.10.
</Note>

## Installation

Install SGLang following the [official installation guide](https://docs.sglang.io/get_started/install_sglang.html). The recommended method is (requires `sglang>=0.5.10`):

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
pip install --upgrade pip
pip install uv
uv pip install "sglang>=0.5.10"
```

## Launching the Server

<Note>
  By default the model runs in bfloat16. To use float16 instead, add `--dtype float16` and set `export SGLANG_MAMBA_CONV_DTYPE=float16` before launching.
</Note>

<Tabs>
  <Tab title="Python">
    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    sglang serve \
        --model-path LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 \
        --port 30000 \
        --tool-call-parser lfm2
    ```
  </Tab>

  <Tab title="Docker">
    All LFM model types (dense, MoE, vision) are supported in the `v0.5.10` tag and later.

    * For CUDA 13 environments (B300/GB300), use `lmsysorg/sglang:v0.5.10-cu13`
    * The `HF_TOKEN` env var is optional, but can speed up downloads and reduce retry errors. We recommend a read-only Hugging Face token for reliability.

    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    docker run --gpus all \
        --shm-size 32g \
        -p 30000:30000 \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        --env "HF_TOKEN=<secret>" \
        --ipc=host \
        lmsysorg/sglang:v0.5.10 \
        sglang serve \
            --model-path LiquidAI/LFM2.5-1.2B-Instruct \
            --host 0.0.0.0 \
            --port 30000 \
            --tool-call-parser lfm2
    ```
  </Tab>
</Tabs>

## Usage

SGLang exposes an OpenAI-compatible API.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="None"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[
        {"role": "user", "content": "What's the weather in San Francisco?"}
    ],
    tools=tools,
    tool_choice="auto",
    temperature=0
)

print(response.choices[0].message)
```

For more details on tool use with LFM models, see [Tool Use](/lfm/key-concepts/tool-use).

### Vision Models

Launch a vision-language model:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
sglang serve \
    --model-path LiquidAI/LFM2.5-VL-1.6B \
    --host 0.0.0.0 \
    --port 30000
```

Query with an image:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-VL-1.6B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/val2017/000000039769.jpg"}},
            {"type": "text", "text": "Describe what you see in this image."},
        ],
    }],
    temperature=0.0,
    max_tokens=256,
)

print(response.choices[0].message.content)
```

<Accordion title="Curl request example">
  ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "LiquidAI/LFM2.5-1.2B-Instruct",
      "messages": [
        {"role": "user", "content": "What is AI?"}
      ],
      "temperature": 0
    }'
  ```
</Accordion>

## Offline Inference

SGLang's `Engine` class provides a simple interface for offline inference without launching a server. This is useful for scripts, notebooks, and batch processing.

<Note>
  In Jupyter notebooks, you must apply `nest_asyncio` before creating the engine, because SGLang uses an async event loop internally.
</Note>

### Text Generation

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# Required in Jupyter notebooks:
# import nest_asyncio
# nest_asyncio.apply()

import sglang as sgl

llm = sgl.Engine(model_path="LiquidAI/LFM2-8B-A1B")

tokenizer = llm.tokenizer_manager.tokenizer
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

output = llm.generate(prompt=prompt, sampling_params={"max_new_tokens": 128, "temperature": 0})
print(output["text"])

llm.shutdown()
```

### Vision Models

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
# Required in Jupyter notebooks:
# import nest_asyncio
# nest_asyncio.apply()

import sglang as sgl

vlm = sgl.Engine(model_path="LiquidAI/LFM2.5-VL-1.6B")

processor = vlm.tokenizer_manager.processor
messages = [{"role": "user", "content": [
    {"type": "image", "image": "placeholder"},
    {"type": "text", "text": "Describe what you see in this image."},
]}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

output = vlm.generate(
    prompt=prompt,
    image_data="http://images.cocodataset.org/val2017/000000039769.jpg",
    sampling_params={"max_new_tokens": 256, "temperature": 0},
)
print(output["text"])

vlm.shutdown()
```

## Low Latency on Blackwell (B300)

Running a 1.2B model on a B300 may sound counterintuitive, but combining `--enable-torch-compile` with Blackwell's architecture unlocks extremely low latency — ideal for latency-sensitive workloads like RAG, search, and real-time chat.

<Tip>
  We recommend `--enable-torch-compile` for workloads with concurrency under 256. For pure throughput batch processing at very high concurrency, skip this flag.
</Tip>

Key flags for low latency:

* `--enable-torch-compile`: Compiles the model with Torch for faster execution. Adds startup time but significantly reduces per-token latency.
* `--chunked-prefill-size -1`: Disables chunked prefill, processing the full prompt in one pass. This lowers TTFT at the cost of slightly reduced throughput under high concurrency.

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
sglang serve \
    --model-path LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tool-call-parser lfm2 \
    --enable-torch-compile \
    --chunked-prefill-size -1
```

On B300/CUDA 13, use the dedicated Docker image:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:v0.5.10-cu13 \
    sglang serve \
        --model-path LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 \
        --port 30000 \
        --tool-call-parser lfm2 \
        --enable-torch-compile \
        --chunked-prefill-size -1
```

Example benchmark on a B300 GPU with CUDA 13 (256 prompts, 1024 input tokens, 128 output tokens, max concurrency 1):

| Metric                          | Value   |
| ------------------------------- | ------- |
| Mean TTFT (ms)                  | 8.79    |
| Mean TPOT (ms)                  | 0.86    |
| Output token throughput (tok/s) | 1100.92 |

<Accordion title="Benchmark command">
  ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
  python3 -m sglang.bench_serving \
      --backend sglang-oai-chat \
      --num-prompts 256 \
      --max-concurrency 1 \
      --random-input-len 1024 \
      --random-output-len 128 \
      --warmup-requests 128
  ```
</Accordion>
