Skip to main content

vLLM

vLLM is a high-throughput and memory-efficient inference engine for LLMs. It supports efficient serving with PagedAttention, continuous batching, and optimized CUDA kernels.

Use vLLM for:
  • High-throughput inference and production deployments with GPU acceleration
  • Batch processing of many prompts
  • Serving models via an API

vLLM offers significantly higher throughput than Transformers, making it ideal for serving many concurrent requests. However, it requires a CUDA-compatible GPU. For CPU-only environments, consider using llama.cpp instead.

Installation

You need to install vLLM v0.10.2 or a more recent version:

pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Basic Usage

The LLM class provides a simple interface for offline inference. Use the chat() method to automatically apply the chat template and generate text:

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="LiquidAI/LFM2-1.2B")

# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.3,
min_p=0.15,
repetition_penalty=1.05,
max_tokens=512
)

# Generate answer
prompt = "What is C. elegans?"
output = llm.chat(prompt, sampling_params)
print(output[0].outputs[0].text)

Sampling Parameters

Control text generation behavior using SamplingParams. Key parameters:

  • temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
  • top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
  • top_k (int, default -1): Limits to top-k most probable tokens (-1 = disabled). Typical range: 1-100
  • min_p (float): Minimum token probability threshold. Typical range: 0.01-0.2
  • max_tokens (int): Maximum number of tokens to generate
  • repetition_penalty (float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
  • stop (str or list[str]): Strings that terminate generation when encountered

Create a SamplingParams object:

from vllm import SamplingParams

sampling_params = SamplingParams(
temperature=0.3,
min_p=0.15,
repetition_penalty=1.05,
max_tokens=512,
)

For a complete list of parameters, see the vLLM Sampling Parameters documentation.

Batched Generation

vLLM automatically batches multiple prompts for efficient processing. You can control batch behavior and generate responses for large datasets:

from vllm import LLM, SamplingParams

llm = LLM(model="LiquidAI/LFM2-1.2B")

sampling_params = SamplingParams(
temperature=0.3,
min_p=0.15,
repetition_penalty=1.05,
max_tokens=512
)

# Large batch of prompts
prompts = [
"Explain quantum computing in one sentence.",
"What are the benefits of exercise?",
"Write a haiku about programming.",
# ... many more prompts
]

# Generate list of answers
outputs = llm.chat(prompts, sampling_params)
for i, output in enumerate(outputs):
print(f"Prompt {i}: {output.prompt}")
print(f"Generated: {output.outputs[0].text}\n")

OpenAI-Compatible Server

vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries:

vllm serve LiquidAI/LFM2-1.2B \
--host 0.0.0.0 \
--port 8000 \
--dtype auto

Optional parameters:

  • --max-model-len L: Set maximum context length
  • --gpu-memory-utilization 0.9: Set GPU memory usage (0.0-1.0)

Chat Completions

Once running, you can use the OpenAI Python client or any OpenAI-compatible tool:

from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # vLLM doesn't require authentication by default
)

# Chat completion
response = client.chat.completions.create(
model="LiquidAI/LFM2-1.2B",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
temperature=0.3,
min_p=0.15,
repetition_penalty=1.05,
max_tokens=512
)

print(response.choices[0].message.content)

# Streaming response
stream = client.chat.completions.create(
model="LiquidAI/LFM2-1.2B",
messages=[
{"role": "user", "content": "Tell me a story."}
],
stream=True
)

for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Curl request example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "LiquidAI/LFM2-1.2B",
"messages": [
{"role": "user", "content": "What is AI?"}
],
"temperature": 0.7,
"max_tokens": 256
}'

Vision Models

vLLM supports LFM2-VL multimodal models for both offline inference and serving.

Offline Inference
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

# Initialize vision model
llm = LLM(model="LiquidAI/LFM2-VL-1.6B")

sampling_params = SamplingParams(
temperature=0.1,
min_p=0.15,
repetition_penalty=1.05,
max_tokens=512
)

# Load image
image = ImageAsset("path/to/image.jpg").pil_image

# Create prompt with image placeholder
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What's in this image?"}
]
}
]

# Generate
output = llm.chat(messages, sampling_params=sampling_params, images=[image])
print(output[0].outputs[0].text)
OpenAI-Compatabile Server

Start a server with a vision model:

vllm serve LiquidAI/LFM2-VL-1.6B --port 8000

Use the OpenAI client with image URLs or base64-encoded images:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)

response = client.chat.completions.create(
model="LiquidAI/LFM2-VL-1.6B",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]
}
],
temperature=0.1,
min_p=0.15,
repetition_penalty=1.05,
max_tokens=512
)

print(response.choices[0].message.content)