> ## Documentation Index
> Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# vLLM

> vLLM is a high-throughput and memory-efficient inference engine for LLMs. It supports efficient serving with PagedAttention, continuous batching, and optimized CUDA kernels.

<Tip>
  Use vLLM for high-throughput production deployments, batch processing, or serving models via an API.
</Tip>

vLLM offers significantly higher throughput than [Transformers](/deployment/gpu-inference/transformers), making it ideal for serving many concurrent requests. However, it requires a CUDA-compatible GPU. For CPU-only environments, consider using [llama.cpp](/deployment/on-device/llama-cpp) instead.

<div className="colab-link">
  <a href="https://colab.research.google.com/github/Liquid4All/docs/blob/main/notebooks/LFM2_Inference_with_vLLM.ipynb" target="_blank">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" />
  </a>
</div>

## Installation

<Tabs>
  <Tab title="pip">
    Install [`vLLM`](https://github.com/vllm-project/vllm) v0.14 or a more recent version:

    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    uv pip install vllm==0.14
    ```
  </Tab>

  <Tab title="Docker">
    vLLM provides a prebuilt Docker image that serves an OpenAI-compatible API:

    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    docker pull vllm/vllm-openai:latest
    ```

    This image requires NVIDIA GPU access. See the [OpenAI-Compatible Server](#openai-compatible-server) section below for the full `docker run` command.
  </Tab>
</Tabs>

## Basic Usage

The `LLM` class provides a simple interface for offline inference. Use the [`chat()`](https://docs.vllm.ai/en/v0.6.0/dev/offline_inference/llm.html) method to automatically apply the chat template and generate text:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="LiquidAI/LFM2.5-1.2B-Instruct")

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_tokens=512
)

# Generate answer
prompt = "What is C. elegans?"
output = llm.chat(prompt, sampling_params)
print(output[0].outputs[0].text)
```

### Sampling Parameters

Control text generation behavior using [`SamplingParams`](https://docs.vllm.ai/en/v0.4.1/dev/sampling_params.html). Key parameters:

* **`temperature`** (`float`, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
* **`top_p`** (`float`, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top\_p. Typical range: 0.1-1.0
* **`top_k`** (`int`, default -1): Limits to top-k most probable tokens (-1 = disabled). Typical range: 1-100
* **`min_p`** (`float`): Minimum token probability threshold. Typical range: 0.01-0.2
* **`max_tokens`** (`int`): Maximum number of tokens to generate
* **`repetition_penalty`** (`float`, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
* **`stop`** (`str` or `list[str]`): Strings that terminate generation when encountered

Create a `SamplingParams` object:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_tokens=512,
)
```

For a complete list of parameters, see the [vLLM Sampling Parameters documentation](https://docs.vllm.ai/en/v0.4.1/dev/sampling_params.html).

## Batched Generation

vLLM automatically batches multiple prompts for efficient processing. You can control batch behavior and generate responses for large datasets:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from vllm import LLM, SamplingParams

llm = LLM(model="LiquidAI/LFM2.5-1.2B-Instruct")

sampling_params = SamplingParams(
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_tokens=512
)

# Large batch of prompts
prompts = [
    "Explain quantum computing in one sentence.",
    "What are the benefits of exercise?",
    "Write a haiku about programming.",
    # ... many more prompts
]

# Generate list of answers
outputs = llm.chat(prompts, sampling_params)

for i, output in enumerate(outputs):
    print(f"Prompt {i}: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}\n")
```

## OpenAI-Compatible Server

vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries.

<Tabs>
  <Tab title="vllm serve">
    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 \
        --port 8000 \
        --dtype auto
    ```

    Optional parameters:

    * `--max-model-len L`: Set maximum context length
    * `--gpu-memory-utilization 0.9`: Set GPU memory usage (0.0-1.0)
  </Tab>

  <Tab title="Docker">
    ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
    docker run --runtime nvidia --gpus all \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        --env "HF_TOKEN=$HF_TOKEN" \
        -p 8000:8000 \
        --ipc=host \
        vllm/vllm-openai:latest \
        --model LiquidAI/LFM2.5-1.2B-Instruct
    ```

    Key flags:

    * `--runtime nvidia --gpus all`: GPU access (required)
    * `--ipc=host`: Shared memory for tensor parallelism
    * `-v ~/.cache/huggingface:/root/.cache/huggingface`: Cache models on host
    * `HF_TOKEN`: Set this env var if using gated models

    **Note:** The Docker image does not include optional dependencies. If you need them, build a custom image from the [vLLM Dockerfile](https://docs.vllm.ai/en/stable/deployment/docker/).
  </Tab>
</Tabs>

### Chat Completions

Once running, you can use the OpenAI Python client or any OpenAI-compatible tool:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # vLLM doesn't require authentication by default
)

# Chat completion
response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.1,
    max_tokens=512,
    extra_body={"top_k": 50, "repetition_penalty": 1.05}
)
print(response.choices[0].message.content)

# Streaming response
stream = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[
        {"role": "user", "content": "Tell me a story."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")
```

<Accordion title="Curl request example">
  ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "LiquidAI/LFM2.5-1.2B-Instruct",
      "messages": [
        {"role": "user", "content": "What is AI?"}
      ],
      "temperature": 0.1,
      "top_k": 50,
      "repetition_penalty": 1.05,
      "max_tokens": 256
    }'
  ```
</Accordion>

## Vision Models

### Installation for Vision Models

To use LFM Vision Models with vLLM, install the required versions:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
uv pip install vllm==0.19.0
```

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
uv pip install transformers==5.5.0 pillow
```

### Basic Usage

Initialize a vision model and process text and image inputs:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from vllm import LLM, SamplingParams

def build_messages(parts):
    content = []
    for item in parts:
        if item["type"] == "text":
            content.append({"type": "text", "text": item["value"]})
        elif item["type"] == "image":
            content.append({"type": "image_url", "image_url": {"url": item["value"]}})
        else:
            raise ValueError(f"Unknown item type: {item['type']}")
    return [{"role": "user", "content": content}]

IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"

llm = LLM(
    model="LiquidAI/LFM2.5-VL-1.6B",
    max_model_len=1024,
)

sampling_params = SamplingParams(
    temperature=0.1,
    min_p=0.15,
    repetition_penalty=1.05,
    max_tokens=1024,
)

# Batch multiple prompts - text-only and multimodal
prompts = [
    [{"type": "text", "value": "What is C. elegans?"}],
    [{"type": "text", "value": "Say hi in JSON format"}],
    [{"type": "text", "value": "Define AI in Spanish"}],
    [
        {"type": "image", "value": IMAGE_URL},
        {"type": "text", "value": "Describe what you see in this image."},
    ],
]

conversations = [build_messages(p) for p in prompts]
outputs = llm.chat(conversations, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
```

### OpenAI-Compatible API

You can also serve vision models through the OpenAI-compatible API:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
vllm serve LiquidAI/LFM2.5-VL-1.6B \
    --host 0.0.0.0 \
    --port 8000 \
    --dtype auto
```

Then use the OpenAI client with image content:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from openai import OpenAI
from PIL import Image
import base64
from io import BytesIO

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

# Load and encode image
image = Image.open("path/to/image.jpg")
buffered = BytesIO()
image.save(buffered, format="JPEG")
image_base64 = base64.b64encode(buffered.getvalue()).decode()

# Chat completion with image
response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-VL-1.6B",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image in detail."},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
            ]
        }
    ],
    temperature=0.1,
    max_tokens=512,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05}
)

print(response.choices[0].message.content)
```

<Info>
  For a complete working example, see the [vLLM Vision Model Colab notebook](https://colab.research.google.com/drive/1sUfQlqAvuAVB4bZ6akYVQPGmHtTDUNpF#scrollTo=C14m2ZWdmZWb).
</Info>
