vLLM

Use vLLM for high-throughput production deployments, batch processing, or serving models via an API.

vLLM offers significantly higher throughput than Transformers, making it ideal for serving many concurrent requests. However, it requires a CUDA-compatible GPU. For CPU-only environments, consider using llama.cpp instead.

Installation

You need to install vLLM v0.14 or a more recent version:

uv pip install vllm==0.14

Basic Usage

The LLM class provides a simple interface for offline inference. Use the chat() method to automatically apply the chat template and generate text:

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="LiquidAI/LFM2.5-1.2B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_tokens=512
)

# Generate answer
prompt = "What is C. elegans?"
output = llm.chat(prompt, sampling_params)
print(output[0].outputs[0].text)

Sampling Parameters

Control text generation behavior using SamplingParams. Key parameters:

temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int, default -1): Limits to top-k most probable tokens (-1 = disabled). Typical range: 1-100
min_p (float): Minimum token probability threshold. Typical range: 0.01-0.2
max_tokens (int): Maximum number of tokens to generate
repetition_penalty (float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
stop (str or list[str]): Strings that terminate generation when encountered

Create a SamplingParams object:

from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_tokens=512,
)

For a complete list of parameters, see the vLLM Sampling Parameters documentation.

Batched Generation

vLLM automatically batches multiple prompts for efficient processing. You can control batch behavior and generate responses for large datasets:

from vllm import LLM, SamplingParams

llm = LLM(model="LiquidAI/LFM2.5-1.2B-Instruct")

sampling_params = SamplingParams(
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_tokens=512
)

# Large batch of prompts
prompts = [
    "Explain quantum computing in one sentence.",
    "What are the benefits of exercise?",
    "Write a haiku about programming.",
    # ... many more prompts
]

# Generate list of answers
outputs = llm.chat(prompts, sampling_params)

for i, output in enumerate(outputs):
    print(f"Prompt {i}: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}\n")

OpenAI-Compatible Server

vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries:

vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --dtype auto

Optional parameters:

--max-model-len L: Set maximum context length
--gpu-memory-utilization 0.9: Set GPU memory usage (0.0-1.0)

Chat Completions

Once running, you can use the OpenAI Python client or any OpenAI-compatible tool:

from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # vLLM doesn't require authentication by default
)

# Chat completion
response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.1,
    max_tokens=512,
    extra_body={"top_k": 50, "repetition_penalty": 1.05}
)
print(response.choices[0].message.content)

# Streaming response
stream = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[
        {"role": "user", "content": "Tell me a story."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Curl request example

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "LiquidAI/LFM2.5-1.2B-Instruct",
    "messages": [
      {"role": "user", "content": "What is AI?"}
    ],
    "temperature": 0.1,
    "top_k": 50,
    "repetition_penalty": 1.05,
    "max_tokens": 256
  }'

Vision Models

vLLM support for LFM Vision Models requires a specific version that includes changes not yet merged upstream. You must install vLLM from a custom source to use vision models. See installation instructions below.

Installation for Vision Models

To use LFM Vision Models with vLLM, install the precompiled wheel along with the required transformers version:

VLLM_PRECOMPILED_WHEEL_COMMIT=72506c98349d6bcd32b4e33eec7b5513453c1502 VLLM_USE_PRECOMPILED=1 uv pip install git+https://github.com/vllm-project/vllm.git

uv pip install "transformers>=5.0.0" pillow

Transformers v5 is newly released. If you encounter issues, fall back to the pinned git source:

uv pip install git+https://github.com/huggingface/transformers.git@3c2517727ce28a30f5044e01663ee204deb1cdbe pillow

This installs vLLM with the necessary changes for LFM Vision Model support. Once these changes are merged upstream, you’ll be able to use the standard vLLM installation.

Basic Usage

Initialize a vision model and process text and image inputs:

from vllm import LLM, SamplingParams

def build_messages(parts):
    content = []
    for item in parts:
        if item["type"] == "text":
            content.append({"type": "text", "text": item["value"]})
        elif item["type"] == "image":
            content.append({"type": "image_url", "image_url": {"url": item["value"]}})
        else:
            raise ValueError(f"Unknown item type: {item['type']}")
    return [{"role": "user", "content": content}]

IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"

llm = LLM(
    model="LiquidAI/LFM2.5-VL-1.6B",
    max_model_len=1024,
)

sampling_params = SamplingParams(
    temperature=0.1,
    min_p=0.15,
    repetition_penalty=1.05,
    max_tokens=1024,
)

# Batch multiple prompts - text-only and multimodal
prompts = [
    [{"type": "text", "value": "What is C. elegans?"}],
    [{"type": "text", "value": "Say hi in JSON format"}],
    [{"type": "text", "value": "Define AI in Spanish"}],
    [
        {"type": "image", "value": IMAGE_URL},
        {"type": "text", "value": "Describe what you see in this image."},
    ],
]

conversations = [build_messages(p) for p in prompts]
outputs = llm.chat(conversations, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

OpenAI-Compatible API

You can also serve vision models through the OpenAI-compatible API:

vllm serve LiquidAI/LFM2.5-VL-1.6B \
    --host 0.0.0.0 \
    --port 8000 \
    --dtype auto

Then use the OpenAI client with image content:

from openai import OpenAI
from PIL import Image
import base64
from io import BytesIO

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

# Load and encode image
image = Image.open("path/to/image.jpg")
buffered = BytesIO()
image.save(buffered, format="JPEG")
image_base64 = base64.b64encode(buffered.getvalue()).decode()

# Chat completion with image
response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-VL-1.6B",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image in detail."},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
            ]
        }
    ],
    temperature=0.1,
    max_tokens=512,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05}
)

print(response.choices[0].message.content)

For a complete working example, see the vLLM Vision Model Colab notebook.

Getting Started

On-Device

GPU Inference

Tools

Installation

Basic Usage

Sampling Parameters

Batched Generation

OpenAI-Compatible Server

Chat Completions

Vision Models

Installation for Vision Models

Basic Usage

OpenAI-Compatible API

Getting Started

On-Device

GPU Inference

Tools

​Installation

​Basic Usage

​Sampling Parameters

​Batched Generation

​OpenAI-Compatible Server

​Chat Completions

​Vision Models

​Installation for Vision Models

​Basic Usage

​OpenAI-Compatible API

Installation

Basic Usage

Sampling Parameters

Batched Generation

OpenAI-Compatible Server

Chat Completions

Vision Models

Installation for Vision Models

Basic Usage

OpenAI-Compatible API