Skip to main content
SGLang serves LFM2.5 at low latency under high concurrency and exposes an OpenAI-compatible API. The dense, MoE, and vision-language models are supported natively, along with the lfm2 tool-call parser and <think> reasoning, so a single sglang serve puts any LFM2.5 checkpoint behind a production endpoint.
Reach for SGLang when you want the lowest latency at high concurrency. It requires a CUDA GPU. On CPU-only machines, use llama.cpp instead.
The SGLang cookbook generates the launch command for each model and GPU, including parsers and Blackwell attention backends. This page gets a server running.

LFM2.5 SGLang Cookbook

An interactive command generator for LFM2.5 on SGLang, verified on real hardware and updated with each release.

What’s in the cookbook

  • A command for every model and GPU. An interactive generator for the verified sglang serve line for your hardware.
  • Reasoning and tool calling. The right --reasoning-parser per model and --tool-call-parser lfm2, included in each command.
  • Blackwell attention backends. The --attention-backend and --mm-attention-backend choices that matter on B200/B300.
  • Vision tuning. Backends and memory flags for the VL models, with throughput numbers.
  • Recommended sampling. Per-checkpoint presets from the model cards.
  • Benchmarks. Measured TTFT, TPOT, and throughput per model and GPU.

Quickstart

Install SGLang from the official guide:
pip install --upgrade pip
uv pip install sglang
Recent SGLang releases and the lmsysorg/sglang dev image include LFM2.5 support: the dense / MoE / VL model classes and the lfm2 tool-call parser. On an older release, install from source or use the dev image. The cookbook lists the minimum version. Serve an LFM2.5 model behind an OpenAI-compatible API:
sglang serve --model-path LiquidAI/LFM2.5-1.2B-Instruct --tool-call-parser lfm2
This covers the dense Instruct model. The right --reasoning-parser for the thinking models, the vision and Blackwell attention backends, and the verified command for every model and GPU are generated by the cookbook. Use it for flags that change across releases.
Send your first request with the OpenAI client:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[{"role": "user", "content": "What is C. elegans? Answer in one sentence."}],
    temperature=0.1,
    extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)

Sampling parameters

LFM2.5 uses per-model sampling presets. The cookbook’s configuration tips list the exact values per checkpoint. Pass them on every request, since some checkpoints ship no defaults in generation_config.json and the server won’t apply them for you. Put top_k, min_p, and repetition_penalty in extra_body, and leave max_tokens unset unless you mean to cap output.