Transformers provides the most flexibility for model development and is ideal for users who want direct access to model internals. For production deployments with high throughput, consider using vLLM.Documentation Index
Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt
Use this file to discover all available pages before exploring further.
Installation
We use uv to manage packages across all our code examples. It’s backwards compatible with pip.
Basic Usage
The Transformers library provides two interfaces for text generation:generate() for fine-grained control and pipeline() for simplicity. We use generate() here for direct access to model internals and explicit control over the generation process:
model_id: Can be a Hugging Face model ID (e.g.,"LiquidAI/LFM2.5-1.2B-Instruct") or a local pathdevice_map="auto": Automatically distributes across available GPUs/CPU (requiresaccelerate). Usedevice="cuda"for single GPU ordevice="cpu"for CPU onlydtype="bfloat16": Recommended for modern GPUs. Use"auto"for automatic selection, or"float32"(slower, more memory)
Click to see a pipeline() example
Click to see a pipeline() example
The Key parameters:
pipeline() interface provides a simpler API for text generation with automatic chat template handling. It wraps model loading and tokenization, making it ideal for quick prototyping."text-generation": Task type for the pipelinemodel_name_or_path: Model ID (e.g.,"LiquidAI/LFM2.5-1.2B-Instruct") or local path (download locally withhf download --local-dir ./LFM2.5-1.2B-Instruct LiquidAI/LFM2.5-1.2B-Instruct)dtype="auto": Automatically selects optimal dtype (bfloat16on modern devices). Can use"bfloat16"explicitly or"float32"(slower, more memory)device_map="auto": Automatically distributes across available GPUs/CPU (requiresaccelerate). Alternative:device="cuda"for single GPU,device="cpu"for CPU only. Don’t mixdevice_mapanddevice
Generation Parameters
Control text generation behavior usingGenerationConfig. Key parameters:
do_sample(bool): Enable sampling (True) or greedy decoding (False, default)temperature(float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0top_p(float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0top_k(int, default 50): Limits to top-k most probable tokens. Typical range: 1-100min_p(float): Minimum token probability threshold. Typical range: 0.01-0.2max_new_tokens(int): Maximum number of tokens to generate (preferred overmax_length)repetition_penalty(float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5stop_strings(strorlist[str]): Strings that terminate generation when encountered
GenerationConfig to organize parameters:
Streaming Generation
Stream responses as they’re generated usingTextStreamer:
Batch Generation
Process multiple prompts in a single batch for efficiency. See the batching documentation for more details:Batching is not automatically a win for performance. For high-performance batching with optimized throughput, consider using vLLM.
Vision Models
LFM2-VL models support both text and images as input. Usegenerate() with the vision model and processor:
Multiple Images Example
Multiple Images Example
FAQ
You may find distributed inference with Transformers is not as fast as you would imagine. Transformers withdevice_map="auto" does not apply tensor parallelism, and it only uses one GPU at a time. For Transformers with tensor parallelism, please refer to its documentation.