> ## Documentation Index > Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt > Use this file to discover all available pages before exploring further. # Transformers > Transformers is a library for inference and training of pretrained models. Use Transformers for simple inference without extra dependencies, research and experimentation, or integration with the Hugging Face ecosystem. Transformers provides the most flexibility for model development and is ideal for users who want direct access to model internals. For production deployments with high throughput, consider using [vLLM](/deployment/gpu-inference/vllm).

## Installation We use [uv](https://docs.astral.sh/uv/#installation) to manage packages across all our code examples. It's backwards compatible with pip. Install the required dependencies: ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}} uv pip install "transformers>=5.2.0" torch accelerate ``` GPU is recommended for faster inference. ## Basic Usage The Transformers library provides two interfaces for text generation: [`generate()`](https://huggingface.co/docs/transformers/v4.57.1/en/main_classes/text_generation#transformers.GenerationMixin.generate) for fine-grained control and [`pipeline()`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) for simplicity. We use `generate()` here for direct access to model internals and explicit control over the generation process: ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model_id = "LiquidAI/LFM2.5-1.2B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", dtype="bfloat16", # attn_implementation="flash_attention_2" <- uncomment on compatible GPU ) tokenizer = AutoTokenizer.from_pretrained(model_id) # Generate answer prompt = "What is C. elegans?" inputs = tokenizer.apply_chat_template( [{"role": "user", "content": prompt}], add_generation_prompt=True, return_tensors="pt", tokenize=True, return_dict=True, ).to(model.device) output = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, max_new_tokens=512) # Decode only the newly generated tokens (excluding the input prompt) input_length = inputs["input_ids"].shape[1] response = tokenizer.decode(output[0][input_length:], skip_special_tokens=True) print(response) # C. elegans, also known as Caenorhabditis elegans, is a small, free-living # nematode worm (roundworm) that belongs to the phylum Nematoda. ``` **Model loading notes:** * **`model_id`**: Can be a Hugging Face model ID (e.g., `"LiquidAI/LFM2.5-1.2B-Instruct"`) or a local path * **`device_map="auto"`**: Automatically distributes across available GPUs/CPU (requires `accelerate`). Use `device="cuda"` for single GPU or `device="cpu"` for CPU only * **`dtype="bfloat16"`**: Recommended for modern GPUs. Use `"auto"` for automatic selection, or `"float32"` (slower, more memory) The [`pipeline()`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) interface provides a simpler API for text generation with automatic chat template handling. It wraps model loading and tokenization, making it ideal for quick prototyping. ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} from transformers import pipeline generator = pipeline( "text-generation", "LiquidAI/LFM2.5-1.2B-Instruct", dtype="auto", device_map="auto", ) messages = [ {"role": "user", "content": "Give me a short introduction to large language models."}, ] messages = generator(messages, max_new_tokens=512)[0]["generated_text"] messages.append({"role": "user", "content": "In a single sentence."}) messages = generator(messages, max_new_tokens=512)[0]["generated_text"] ``` **Key parameters:** * **`"text-generation"`**: Task type for the pipeline * **`model_name_or_path`**: Model ID (e.g., `"LiquidAI/LFM2.5-1.2B-Instruct"`) or local path (download locally with `hf download --local-dir ./LFM2.5-1.2B-Instruct LiquidAI/LFM2.5-1.2B-Instruct`) * **`dtype="auto"`**: Automatically selects optimal dtype (`bfloat16` on modern devices). Can use `"bfloat16"` explicitly or `"float32"` (slower, more memory) * **`device_map="auto"`**: Automatically distributes across available GPUs/CPU (requires `accelerate`). Alternative: `device="cuda"` for single GPU, `device="cpu"` for CPU only. Don't mix `device_map` and `device` The pipeline automatically handles chat templates and tokenization, returning structured output with the generated text. ### Generation Parameters Control text generation behavior using [`GenerationConfig`](https://huggingface.co/docs/transformers/v4.57.1/en/main_classes/text_generation#transformers.GenerationConfig). Key parameters: * **`do_sample`** (`bool`): Enable sampling (`True`) or greedy decoding (`False`, default) * **`temperature`** (`float`, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0 * **`top_p`** (`float`, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top\_p. Typical range: 0.1-1.0 * **`top_k`** (`int`, default 50): Limits to top-k most probable tokens. Typical range: 1-100 * **`min_p`** (`float`): Minimum token probability threshold. Typical range: 0.01-0.2 * **`max_new_tokens`** (`int`): Maximum number of tokens to generate (preferred over `max_length`) * **`repetition_penalty`** (`float`, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5 * **`stop_strings`** (`str` or `list[str]`): Strings that terminate generation when encountered Use `GenerationConfig` to organize parameters: ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} from transformers import GenerationConfig # Create a generation config generation_config = GenerationConfig( do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, max_new_tokens=512, ) # Use it in generate() output = model.generate(**inputs, generation_config=generation_config) ``` For a complete list of parameters, see the [GenerationConfig documentation](https://huggingface.co/docs/transformers/v4.57.1/en/main_classes/text_generation#transformers.GenerationConfig). ## Streaming Generation Stream responses as they're generated using `TextStreamer`: ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer # Use the model and tokenizer setup from Basic Usage above prompt = "Tell me a story about space exploration." inputs = tokenizer.apply_chat_template( [{"role": "user", "content": prompt}], add_generation_prompt=True, return_tensors="pt", tokenize=True, return_dict=True, ).to(model.device) streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) output = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, streamer=streamer, max_new_tokens=512) ``` ## Batch Generation Process multiple prompts in a single batch for efficiency. See the [batching documentation](https://huggingface.co/docs/transformers/en/main_classes/text_generation#batch-generation) for more details: Batching is not automatically a win for performance. For high-performance batching with optimized throughput, consider using [vLLM](/deployment/gpu-inference/vllm). ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} from transformers import AutoModelForCausalLM, AutoTokenizer # Use the model and tokenizer setup from Basic Usage above # Prepare multiple prompts prompts = [ [{"role": "user", "content": "Give me a short introduction to large language models."}], [{"role": "user", "content": "Give me a detailed introduction to large language models."}], ] # Apply chat templates and tokenize inputs = tokenizer.apply_chat_template( prompts, add_generation_prompt=True, return_tensors="pt", tokenize=True, padding=True, return_dict=True, ).to(model.device) # Generate for all prompts in batch outputs = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, max_new_tokens=512) # Decode outputs for output in outputs: print(tokenizer.decode(output, skip_special_tokens=True)) ``` ## Vision Models LFM2-VL models support both text and images as input. Use `generate()` with the vision model and processor: ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} from transformers import AutoProcessor, AutoModelForImageTextToText from transformers.image_utils import load_image # Load model and processor model_id = "LiquidAI/LFM2.5-VL-1.6B" model = AutoModelForImageTextToText.from_pretrained( model_id, device_map="auto", dtype="bfloat16" ) processor = AutoProcessor.from_pretrained(model_id) # Load image and create conversation url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" image = load_image(url) # or use PIL Image: Image.open("path/to/image.jpg") conversation = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "What is in this image?"}, ], }, ] # Apply chat template and generate inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, return_tensors="pt", return_dict=True, tokenize=True, ).to(model.device) outputs = model.generate(**inputs, do_sample=True, temperature=0.1, min_p=0.15, repetition_penalty=1.05, max_new_tokens=256) response = processor.batch_decode(outputs, skip_special_tokens=True)[0] print(response) ``` ```python theme={"theme":{"light":"github-light","dark":"github-dark"}} from transformers import AutoProcessor, AutoModelForImageTextToText from transformers.image_utils import load_image # Use the model and processor setup from above image1 = load_image("path/to/first.jpg") image2 = load_image("path/to/second.jpg") conversation = [ { "role": "user", "content": [ {"type": "image", "image": image1}, {"type": "image", "image": image2}, {"type": "text", "text": "What are the differences between these two images?"} ], }, ] inputs = processor.apply_chat_template( conversation, add_generation_prompt=True, return_tensors="pt", return_dict=True, tokenize=True, ).to(model.device) outputs = model.generate(**inputs, do_sample=True, temperature=0.1, min_p=0.15, repetition_penalty=1.05, max_new_tokens=256) response = processor.batch_decode(outputs, skip_special_tokens=True)[0] print(response) ``` ## FAQ You may find distributed inference with Transformers is not as fast as you would imagine. Transformers with `device_map="auto"` does not apply tensor parallelism, and it only uses one GPU at a time. For Transformers with tensor parallelism, please refer to [its documentation](https://huggingface.co/docs/transformers/v4.51.3/en/perf_infer_gpu_multi).