Transformers

Use Transformers for simple inference without extra dependencies, research and experimentation, or integration with the Hugging Face ecosystem.

Transformers provides the most flexibility for model development and is ideal for users who want direct access to model internals. For production deployments with high throughput, consider using vLLM.

Installation

We use uv to manage packages across all our code examples. It’s backwards compatible with pip.

Install the required dependencies:

uv pip install "transformers>=5.0.0" torch accelerate

Note: Transformers v5 is newly released. If you encounter issues, fall back to the pinned git source:
uv pip install git+https://github.com/huggingface/transformers.git@0c9a72e4576fe4c84077f066e585129c97bfd4e6 torch accelerate

GPU is recommended for faster inference.

Basic Usage

The Transformers library provides two interfaces for text generation: generate() for fine-grained control and pipeline() for simplicity. We use generate() here for direct access to model internals and explicit control over the generation process:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "LiquidAI/LFM2.5-1.2B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
#    attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate answer
prompt = "What is C. elegans?"
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
    return_dict=True,
).to(model.device)

output = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, max_new_tokens=512)

# Decode only the newly generated tokens (excluding the input prompt)
input_length = inputs["input_ids"].shape[1]
response = tokenizer.decode(output[0][input_length:], skip_special_tokens=True)
print(response)
# C. elegans, also known as Caenorhabditis elegans, is a small, free-living
# nematode worm (roundworm) that belongs to the phylum Nematoda.

Model loading notes:

model_id: Can be a Hugging Face model ID (e.g., "LiquidAI/LFM2.5-1.2B-Instruct") or a local path
device_map="auto": Automatically distributes across available GPUs/CPU (requires accelerate). Use device="cuda" for single GPU or device="cpu" for CPU only
dtype="bfloat16": Recommended for modern GPUs. Use "auto" for automatic selection, or "float32" (slower, more memory)

Click to see a pipeline() example

The pipeline() interface provides a simpler API for text generation with automatic chat template handling. It wraps model loading and tokenization, making it ideal for quick prototyping.

from transformers import pipeline

generator = pipeline(
    "text-generation",
    "LiquidAI/LFM2.5-1.2B-Instruct",
    dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Give me a short introduction to large language models."},
]

messages = generator(messages, max_new_tokens=512)[0]["generated_text"]
messages.append({"role": "user", "content": "In a single sentence."})
messages = generator(messages, max_new_tokens=512)[0]["generated_text"]

Key parameters:

"text-generation": Task type for the pipeline
model_name_or_path: Model ID (e.g., "LiquidAI/LFM2.5-1.2B-Instruct") or local path (download locally with hf download --local-dir ./LFM2.5-1.2B-Instruct LiquidAI/LFM2.5-1.2B-Instruct)
dtype="auto": Automatically selects optimal dtype (bfloat16 on modern devices). Can use "bfloat16" explicitly or "float32" (slower, more memory)
device_map="auto": Automatically distributes across available GPUs/CPU (requires accelerate). Alternative: device="cuda" for single GPU, device="cpu" for CPU only. Don’t mix device_map and device

The pipeline automatically handles chat templates and tokenization, returning structured output with the generated text.

Generation Parameters

Control text generation behavior using GenerationConfig. Key parameters:

do_sample (bool): Enable sampling (True) or greedy decoding (False, default)
temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int, default 50): Limits to top-k most probable tokens. Typical range: 1-100
min_p (float): Minimum token probability threshold. Typical range: 0.01-0.2
max_new_tokens (int): Maximum number of tokens to generate (preferred over max_length)
repetition_penalty (float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
stop_strings (str or list[str]): Strings that terminate generation when encountered

Use GenerationConfig to organize parameters:

from transformers import GenerationConfig

# Create a generation config
generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_new_tokens=512,
)

# Use it in generate()
output = model.generate(**inputs, generation_config=generation_config)

For a complete list of parameters, see the GenerationConfig documentation.

Streaming Generation

Stream responses as they’re generated using TextStreamer:

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

# Use the model and tokenizer setup from Basic Usage above
prompt = "Tell me a story about space exploration."
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
    return_dict=True,
).to(model.device)

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
output = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, streamer=streamer, max_new_tokens=512)

Batch Generation

Process multiple prompts in a single batch for efficiency. See the batching documentation for more details:

Batching is not automatically a win for performance. For high-performance batching with optimized throughput, consider using vLLM.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Use the model and tokenizer setup from Basic Usage above

# Prepare multiple prompts
prompts = [
    [{"role": "user", "content": "Give me a short introduction to large language models."}],
    [{"role": "user", "content": "Give me a detailed introduction to large language models."}],
]

# Apply chat templates and tokenize
inputs = tokenizer.apply_chat_template(
    prompts,
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
    padding=True,
    return_dict=True,
).to(model.device)

# Generate for all prompts in batch
outputs = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, max_new_tokens=512)

# Decode outputs
for output in outputs:
    print(tokenizer.decode(output, skip_special_tokens=True))

Vision Models

LFM2-VL models support both text and images as input. Use generate() with the vision model and processor:

from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Load model and processor
model_id = "LiquidAI/LFM2.5-VL-1.6B"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16"
)
# IMPORTANT: tie lm_head to input embeddings (transformers v5 bug)
model.lm_head.weight = model.get_input_embeddings().weight

processor = AutoProcessor.from_pretrained(model_id)

# Load image and create conversation
url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image = load_image(url)  # or use PIL Image: Image.open("path/to/image.jpg")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

# Apply chat template and generate
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, do_sample=True, temperature=0.1, min_p=0.15, repetition_penalty=1.05, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Multiple Images Example

from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Use the model and processor setup from above
image1 = load_image("path/to/first.jpg")
image2 = load_image("path/to/second.jpg")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image1},
            {"type": "image", "image": image2},
            {"type": "text", "text": "What are the differences between these two images?"}
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, do_sample=True, temperature=0.1, min_p=0.15, repetition_penalty=1.05, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

FAQ

You may find distributed inference with Transformers is not as fast as you would imagine. Transformers with device_map="auto" does not apply tensor parallelism, and it only uses one GPU at a time. For Transformers with tensor parallelism, please refer to its documentation.

Getting Started

On-Device

GPU Inference

Tools

Installation

Basic Usage

Generation Parameters

Streaming Generation

Batch Generation

Vision Models

FAQ

Getting Started

On-Device

GPU Inference

Tools

​Installation

​Basic Usage

​Generation Parameters

​Streaming Generation

​Batch Generation

​Vision Models

​FAQ

Installation

Basic Usage

Generation Parameters

Streaming Generation

Batch Generation

Vision Models

FAQ