Transformers is a library for inference and training of pretrained models.
Use Transformers for simple inference without extra dependencies, research and experimentation, or integration with the Hugging Face ecosystem.
Transformers provides the most flexibility for model development and is ideal for users who want direct access to model internals. For production deployments with high throughput, consider using vLLM.
The Transformers library provides two interfaces for text generation: generate() for fine-grained control and pipeline() for simplicity. We use generate() here for direct access to model internals and explicit control over the generation process:
Copy
Ask AI
from transformers import AutoModelForCausalLM, AutoTokenizer# Load model and tokenizermodel_id = "LiquidAI/LFM2.5-1.2B-Instruct"model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="bfloat16",# attn_implementation="flash_attention_2" <- uncomment on compatible GPU)tokenizer = AutoTokenizer.from_pretrained(model_id)# Generate answerprompt = "What is C. elegans?"input_ids = tokenizer.apply_chat_template( [{"role": "user", "content": prompt}], add_generation_prompt=True, return_tensors="pt", tokenize=True,).to(model.device)output = model.generate(input_ids, max_new_tokens=512)# Decode only the newly generated tokens (excluding the input prompt)response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)print(response)# C. elegans, also known as Caenorhabditis elegans, is a small, free-living# nematode worm (roundworm) that belongs to the phylum Nematoda.
Model loading notes:
model_id: Can be a Hugging Face model ID (e.g., "LiquidAI/LFM2.5-1.2B-Instruct") or a local path
device_map="auto": Automatically distributes across available GPUs/CPU (requires accelerate). Use device="cuda" for single GPU or device="cpu" for CPU only
torch_dtype="bfloat16": Recommended for modern GPUs. Use "auto" for automatic selection, or "float32" (slower, more memory)
Click to see a pipeline() example
The pipeline() interface provides a simpler API for text generation with automatic chat template handling. It wraps model loading and tokenization, making it ideal for quick prototyping.
Copy
Ask AI
from transformers import pipelinegenerator = pipeline( "text-generation", "LiquidAI/LFM2.5-1.2B-Instruct", torch_dtype="auto", device_map="auto",)messages = [ {"role": "user", "content": "Give me a short introduction to large language models."},]messages = generator(messages, max_new_tokens=512)[0]["generated_text"]messages.append({"role": "user", "content": "In a single sentence."})messages = generator(messages, max_new_tokens=512)[0]["generated_text"]
Key parameters:
"text-generation": Task type for the pipeline
model_name_or_path: Model ID (e.g., "LiquidAI/LFM2.5-1.2B-Instruct") or local path (download locally with hf download --local-dir ./LFM2.5-1.2B-Instruct LiquidAI/LFM2.5-1.2B-Instruct)
torch_dtype="auto": Automatically selects optimal dtype (bfloat16 on modern devices). Can use "bfloat16" explicitly or "float32" (slower, more memory)
device_map="auto": Automatically distributes across available GPUs/CPU (requires accelerate). Alternative: device="cuda" for single GPU, device="cpu" for CPU only. Don’t mix device_map and device
The pipeline automatically handles chat templates and tokenization, returning structured output with the generated text.
stop_strings (str or list[str]): Strings that terminate generation when encountered
Use GenerationConfig to organize parameters:
Copy
Ask AI
from transformers import GenerationConfig# Create a generation configgeneration_config = GenerationConfig( do_sample=True, temperature=0.3, min_p=0.15, repetition_penalty=1.05, max_new_tokens=512,)# Use it in generate()output = model.generate(input_ids, generation_config=generation_config)
Stream responses as they’re generated using TextStreamer:
Copy
Ask AI
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer# Use the model and tokenizer setup from Basic Usage aboveprompt = "Tell me a story about space exploration."input_ids = tokenizer.apply_chat_template( [{"role": "user", "content": prompt}], add_generation_prompt=True, return_tensors="pt", tokenize=True,).to(model.device)streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)output = model.generate(input_ids, streamer=streamer, max_new_tokens=512)
Process multiple prompts in a single batch for efficiency. See the batching documentation for more details:
Batching is not automatically a win for performance. For high-performance batching with optimized throughput, consider using vLLM.
Copy
Ask AI
from transformers import AutoModelForCausalLM, AutoTokenizer# Use the model and tokenizer setup from Basic Usage above# Prepare multiple promptsprompts = [ [{"role": "user", "content": "Give me a short introduction to large language models."}], [{"role": "user", "content": "Give me a detailed introduction to large language models."}],]# Apply chat templates and tokenizebatch = tokenizer.apply_chat_template( prompts, add_generation_prompt=True, return_tensors="pt", tokenize=True, padding=True,).to(model.device)# Generate for all prompts in batchoutputs = model.generate(batch, max_new_tokens=512)# Decode outputsfor output in outputs: print(tokenizer.decode(output, skip_special_tokens=True))
You may find distributed inference with Transformers is not as fast as you would imagine. Transformers with device_map="auto" does not apply tensor parallelism, and it only uses one GPU at a time. For Transformers with tensor parallelism, please refer to its documentation.