Skip to main content
vLLM is the highest-throughput way to serve LFM2.5 in production. Point vLLM at an LFM2.5 checkpoint to start an OpenAI-compatible endpoint with one command. The dense, MoE, and vision-language models are all native to vLLM (Lfm2ForCausalLM, Lfm2MoeForCausalLM, and Lfm2VlForConditionalGeneration), so there’s no --trust-remote-code and nothing to patch.
Reach for vLLM when you need high-throughput serving, batch processing, or an OpenAI-compatible API. It requires a CUDA GPU. On CPU-only machines, use llama.cpp instead.
The vLLM cookbook keeps the serve flags, sampling presets, and tuning notes current. This page gets you from install to first request.

LFM2.5 vLLM Cookbook

Serving commands, sampling settings, and hardware notes for LFM2.5 on vLLM, verified on real hardware and updated with each release.

What’s in the cookbook

  • Commands for every model. The full LFM2.5 matrix (dense, MoE, reasoning, Japanese, base, and vision), with the exact command for each.
  • Reasoning and tool calling. When to pass --reasoning-parser qwen3 and --tool-call-parser lfm2, and how the parsed output comes back.
  • Recommended sampling. Per-checkpoint presets from the model cards.
  • Vision. Serving the VL models and sending image-and-text turns.
  • Blackwell tuning and benchmarks. Throughput-impacting flags, with benchmark numbers.
  • One-command Modal deploy. A ready-to-run script for cloud GPUs.

Quickstart

Install vLLM. LFM2.5’s dense, MoE, and VL architectures ship in v0.23.0 and later:
uv pip install -U vllm
Serve any LFM2.5 model behind an OpenAI-compatible API:
vllm serve LiquidAI/LFM2.5-1.2B-Instruct
This is the minimal path. The per-model and per-hardware flags (--tool-call-parser lfm2, --reasoning-parser qwen3, vision options, and Blackwell tuning) live in the cookbook, where they stay in sync with each release.
Send your first request with the OpenAI client:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[{"role": "user", "content": "What is C. elegans? Answer in one sentence."}],
    temperature=0.1,
    extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)

Sampling parameters

LFM2.5 ships per-model sampling presets on its Hugging Face model cards. The cookbook’s sampling table has the exact values for every checkpoint. Two details matter: top_k, min_p, and repetition_penalty are vLLM extras, so pass them through extra_body rather than as top-level arguments; and leave max_tokens unset unless you mean to cap output, since a low cap cuts off the reasoning models mid-thought.