Documentation Index Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt
Use this file to discover all available pages before exploring further.
Use vLLM for high-throughput production deployments, batch processing, or serving models via an API.
vLLM offers significantly higher throughput than Transformers , making it ideal for serving many concurrent requests. However, it requires a CUDA-compatible GPU. For CPU-only environments, consider using llama.cpp instead.
Installation
Install vLLM v0.14 or a more recent version: uv pip install vllm== 0.14
vLLM provides a prebuilt Docker image that serves an OpenAI-compatible API: docker pull vllm/vllm-openai:latest
This image requires NVIDIA GPU access. See the OpenAI-Compatible Server section below for the full docker run command.
Basic Usage
The LLM class provides a simple interface for offline inference. Use the chat() method to automatically apply the chat template and generate text:
from vllm import LLM , SamplingParams
# Initialize the model
llm = LLM( model = "LiquidAI/LFM2.5-1.2B-Instruct" )
# Define sampling parameters
sampling_params = SamplingParams(
temperature = 0.1 ,
top_k = 50 ,
repetition_penalty = 1.05 ,
max_tokens = 512
)
# Generate answer
prompt = "What is C. elegans?"
output = llm.chat(prompt, sampling_params)
print (output[ 0 ].outputs[ 0 ].text)
Sampling Parameters
Control text generation behavior using SamplingParams . Key parameters:
temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int, default -1): Limits to top-k most probable tokens (-1 = disabled). Typical range: 1-100
min_p (float): Minimum token probability threshold. Typical range: 0.01-0.2
max_tokens (int): Maximum number of tokens to generate
repetition_penalty (float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
stop (str or list[str]): Strings that terminate generation when encountered
Create a SamplingParams object:
from vllm import SamplingParams
sampling_params = SamplingParams(
temperature = 0.1 ,
top_k = 50 ,
repetition_penalty = 1.05 ,
max_tokens = 512 ,
)
For a complete list of parameters, see the vLLM Sampling Parameters documentation .
Batched Generation
vLLM automatically batches multiple prompts for efficient processing. You can control batch behavior and generate responses for large datasets:
from vllm import LLM , SamplingParams
llm = LLM( model = "LiquidAI/LFM2.5-1.2B-Instruct" )
sampling_params = SamplingParams(
temperature = 0.1 ,
top_k = 50 ,
repetition_penalty = 1.05 ,
max_tokens = 512
)
# Large batch of prompts
prompts = [
"Explain quantum computing in one sentence." ,
"What are the benefits of exercise?" ,
"Write a haiku about programming." ,
# ... many more prompts
]
# Generate list of answers
outputs = llm.chat(prompts, sampling_params)
for i, output in enumerate (outputs):
print ( f "Prompt { i } : { output.prompt } " )
print ( f "Generated: { output.outputs[ 0 ].text }\n " )
OpenAI-Compatible Server
vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries.
vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype auto
Optional parameters:
--max-model-len L: Set maximum context length
--gpu-memory-utilization 0.9: Set GPU memory usage (0.0-1.0)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN= $HF_TOKEN " \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model LiquidAI/LFM2.5-1.2B-Instruct
Key flags:
--runtime nvidia --gpus all: GPU access (required)
--ipc=host: Shared memory for tensor parallelism
-v ~/.cache/huggingface:/root/.cache/huggingface: Cache models on host
HF_TOKEN: Set this env var if using gated models
Note: The Docker image does not include optional dependencies. If you need them, build a custom image from the vLLM Dockerfile .
Chat Completions
Once running, you can use the OpenAI Python client or any OpenAI-compatible tool:
from openai import OpenAI
# Point to your vLLM server
client = OpenAI(
base_url = "http://localhost:8000/v1" ,
api_key = "dummy" # vLLM doesn't require authentication by default
)
# Chat completion
response = client.chat.completions.create(
model = "LiquidAI/LFM2.5-1.2B-Instruct" ,
messages = [
{ "role" : "user" , "content" : "What is machine learning?" }
],
temperature = 0.1 ,
max_tokens = 512 ,
extra_body = { "top_k" : 50 , "repetition_penalty" : 1.05 }
)
print (response.choices[ 0 ].message.content)
# Streaming response
stream = client.chat.completions.create(
model = "LiquidAI/LFM2.5-1.2B-Instruct" ,
messages = [
{ "role" : "user" , "content" : "Tell me a story." }
],
stream = True
)
for chunk in stream:
if chunk.choices[ 0 ].delta.content is not None :
print (chunk.choices[ 0 ].delta.content, end = "" )
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "LiquidAI/LFM2.5-1.2B-Instruct",
"messages": [
{"role": "user", "content": "What is AI?"}
],
"temperature": 0.1,
"top_k": 50,
"repetition_penalty": 1.05,
"max_tokens": 256
}'
Vision Models
Installation for Vision Models
To use LFM Vision Models with vLLM, install the required versions:
uv pip install vllm== 0.19.0
uv pip install transformers== 5.5.0 pillow
Basic Usage
Initialize a vision model and process text and image inputs:
from vllm import LLM , SamplingParams
def build_messages (parts):
content = []
for item in parts:
if item[ "type" ] == "text" :
content.append({ "type" : "text" , "text" : item[ "value" ]})
elif item[ "type" ] == "image" :
content.append({ "type" : "image_url" , "image_url" : { "url" : item[ "value" ]}})
else :
raise ValueError ( f "Unknown item type: { item[ 'type' ] } " )
return [{ "role" : "user" , "content" : content}]
IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"
llm = LLM(
model = "LiquidAI/LFM2.5-VL-1.6B" ,
max_model_len = 1024 ,
)
sampling_params = SamplingParams(
temperature = 0.1 ,
min_p = 0.15 ,
repetition_penalty = 1.05 ,
max_tokens = 1024 ,
)
# Batch multiple prompts - text-only and multimodal
prompts = [
[{ "type" : "text" , "value" : "What is C. elegans?" }],
[{ "type" : "text" , "value" : "Say hi in JSON format" }],
[{ "type" : "text" , "value" : "Define AI in Spanish" }],
[
{ "type" : "image" , "value" : IMAGE_URL },
{ "type" : "text" , "value" : "Describe what you see in this image." },
],
]
conversations = [build_messages(p) for p in prompts]
outputs = llm.chat(conversations, sampling_params)
for output in outputs:
print (output.outputs[ 0 ].text)
OpenAI-Compatible API
You can also serve vision models through the OpenAI-compatible API:
vllm serve LiquidAI/LFM2.5-VL-1.6B \
--host 0.0.0.0 \
--port 8000 \
--dtype auto
Then use the OpenAI client with image content:
from openai import OpenAI
from PIL import Image
import base64
from io import BytesIO
client = OpenAI(
base_url = "http://localhost:8000/v1" ,
api_key = "dummy"
)
# Load and encode image
image = Image.open( "path/to/image.jpg" )
buffered = BytesIO()
image.save(buffered, format = "JPEG" )
image_base64 = base64.b64encode(buffered.getvalue()).decode()
# Chat completion with image
response = client.chat.completions.create(
model = "LiquidAI/LFM2.5-VL-1.6B" ,
messages = [
{
"role" : "user" ,
"content" : [
{ "type" : "text" , "text" : "Describe this image in detail." },
{ "type" : "image_url" , "image_url" : { "url" : f "data:image/jpeg;base64, { image_base64 } " }}
]
}
],
temperature = 0.1 ,
max_tokens = 512 ,
extra_body = { "min_p" : 0.15 , "repetition_penalty" : 1.05 }
)
print (response.choices[ 0 ].message.content)