> ## Documentation Index
> Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Transformers

> Transformers is a library for inference and training of pretrained models.

<Tip>
  Use Transformers for simple inference without extra dependencies, research and experimentation, or integration with the Hugging Face ecosystem.
</Tip>

Transformers provides the most flexibility for model development and is ideal for users who want direct access to model internals. For production deployments with high throughput, consider using [vLLM](/deployment/gpu-inference/vllm).

<div className="colab-link">
  <a href="https://colab.research.google.com/github/Liquid4All/docs/blob/main/notebooks/LFM2_Inference_with_Transformers.ipynb" target="_blank">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" />
  </a>
</div>

## Installation

<Note>
  We use [uv](https://docs.astral.sh/uv/#installation) to manage packages across all our code examples. It's backwards compatible with pip.
</Note>

Install the required dependencies:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
uv pip install "transformers>=5.2.0" torch accelerate
```

GPU is recommended for faster inference.

## Basic Usage

The Transformers library provides two interfaces for text generation: [`generate()`](https://huggingface.co/docs/transformers/v4.57.1/en/main_classes/text_generation#transformers.GenerationMixin.generate) for fine-grained control and [`pipeline()`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) for simplicity. We use `generate()` here for direct access to model internals and explicit control over the generation process:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "LiquidAI/LFM2.5-1.2B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
#    attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate answer
prompt = "What is C. elegans?"
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
    return_dict=True,
).to(model.device)

output = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, max_new_tokens=512)

# Decode only the newly generated tokens (excluding the input prompt)
input_length = inputs["input_ids"].shape[1]
response = tokenizer.decode(output[0][input_length:], skip_special_tokens=True)
print(response)
# C. elegans, also known as Caenorhabditis elegans, is a small, free-living
# nematode worm (roundworm) that belongs to the phylum Nematoda.
```

**Model loading notes:**

* **`model_id`**: Can be a Hugging Face model ID (e.g., `"LiquidAI/LFM2.5-1.2B-Instruct"`) or a local path
* **`device_map="auto"`**: Automatically distributes across available GPUs/CPU (requires `accelerate`). Use `device="cuda"` for single GPU or `device="cpu"` for CPU only
* **`dtype="bfloat16"`**: Recommended for modern GPUs. Use `"auto"` for automatic selection, or `"float32"` (slower, more memory)

<Accordion title="Click to see a pipeline() example">
  The [`pipeline()`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) interface provides a simpler API for text generation with automatic chat template handling. It wraps model loading and tokenization, making it ideal for quick prototyping.

  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from transformers import pipeline

  generator = pipeline(
      "text-generation",
      "LiquidAI/LFM2.5-1.2B-Instruct",
      dtype="auto",
      device_map="auto",
  )

  messages = [
      {"role": "user", "content": "Give me a short introduction to large language models."},
  ]

  messages = generator(messages, max_new_tokens=512)[0]["generated_text"]
  messages.append({"role": "user", "content": "In a single sentence."})
  messages = generator(messages, max_new_tokens=512)[0]["generated_text"]
  ```

  **Key parameters:**

  * **`"text-generation"`**: Task type for the pipeline
  * **`model_name_or_path`**: Model ID (e.g., `"LiquidAI/LFM2.5-1.2B-Instruct"`) or local path (download locally with `hf download --local-dir ./LFM2.5-1.2B-Instruct LiquidAI/LFM2.5-1.2B-Instruct`)
  * **`dtype="auto"`**: Automatically selects optimal dtype (`bfloat16` on modern devices). Can use `"bfloat16"` explicitly or `"float32"` (slower, more memory)
  * **`device_map="auto"`**: Automatically distributes across available GPUs/CPU (requires `accelerate`). Alternative: `device="cuda"` for single GPU, `device="cpu"` for CPU only. Don't mix `device_map` and `device`

  The pipeline automatically handles chat templates and tokenization, returning structured output with the generated text.
</Accordion>

### Generation Parameters

Control text generation behavior using [`GenerationConfig`](https://huggingface.co/docs/transformers/v4.57.1/en/main_classes/text_generation#transformers.GenerationConfig). Key parameters:

* **`do_sample`** (`bool`): Enable sampling (`True`) or greedy decoding (`False`, default)
* **`temperature`** (`float`, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
* **`top_p`** (`float`, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top\_p. Typical range: 0.1-1.0
* **`top_k`** (`int`, default 50): Limits to top-k most probable tokens. Typical range: 1-100
* **`min_p`** (`float`): Minimum token probability threshold. Typical range: 0.01-0.2
* **`max_new_tokens`** (`int`): Maximum number of tokens to generate (preferred over `max_length`)
* **`repetition_penalty`** (`float`, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
* **`stop_strings`** (`str` or `list[str]`): Strings that terminate generation when encountered

Use `GenerationConfig` to organize parameters:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from transformers import GenerationConfig

# Create a generation config
generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_new_tokens=512,
)

# Use it in generate()
output = model.generate(**inputs, generation_config=generation_config)
```

For a complete list of parameters, see the [GenerationConfig documentation](https://huggingface.co/docs/transformers/v4.57.1/en/main_classes/text_generation#transformers.GenerationConfig).

## Streaming Generation

Stream responses as they're generated using `TextStreamer`:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

# Use the model and tokenizer setup from Basic Usage above
prompt = "Tell me a story about space exploration."
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
    return_dict=True,
).to(model.device)

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
output = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, streamer=streamer, max_new_tokens=512)
```

## Batch Generation

Process multiple prompts in a single batch for efficiency. See the [batching documentation](https://huggingface.co/docs/transformers/en/main_classes/text_generation#batch-generation) for more details:

<Note>
  Batching is not automatically a win for performance. For high-performance batching with optimized throughput, consider using [vLLM](/deployment/gpu-inference/vllm).
</Note>

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from transformers import AutoModelForCausalLM, AutoTokenizer

# Use the model and tokenizer setup from Basic Usage above

# Prepare multiple prompts
prompts = [
    [{"role": "user", "content": "Give me a short introduction to large language models."}],
    [{"role": "user", "content": "Give me a detailed introduction to large language models."}],
]

# Apply chat templates and tokenize
inputs = tokenizer.apply_chat_template(
    prompts,
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
    padding=True,
    return_dict=True,
).to(model.device)

# Generate for all prompts in batch
outputs = model.generate(**inputs, do_sample=True, temperature=0.1, top_k=50, repetition_penalty=1.05, max_new_tokens=512)

# Decode outputs
for output in outputs:
    print(tokenizer.decode(output, skip_special_tokens=True))
```

## Vision Models

LFM2-VL models support both text and images as input. Use `generate()` with the vision model and processor:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Load model and processor
model_id = "LiquidAI/LFM2.5-VL-1.6B"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load image and create conversation
url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image = load_image(url)  # or use PIL Image: Image.open("path/to/image.jpg")

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

# Apply chat template and generate
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, do_sample=True, temperature=0.1, min_p=0.15, repetition_penalty=1.05, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
```

<Accordion title="Multiple Images Example">
  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from transformers import AutoProcessor, AutoModelForImageTextToText
  from transformers.image_utils import load_image

  # Use the model and processor setup from above
  image1 = load_image("path/to/first.jpg")
  image2 = load_image("path/to/second.jpg")

  conversation = [
      {
          "role": "user",
          "content": [
              {"type": "image", "image": image1},
              {"type": "image", "image": image2},
              {"type": "text", "text": "What are the differences between these two images?"}
          ],
      },
  ]

  inputs = processor.apply_chat_template(
      conversation,
      add_generation_prompt=True,
      return_tensors="pt",
      return_dict=True,
      tokenize=True,
  ).to(model.device)

  outputs = model.generate(**inputs, do_sample=True, temperature=0.1, min_p=0.15, repetition_penalty=1.05, max_new_tokens=256)
  response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
  print(response)
  ```
</Accordion>

## FAQ

You may find distributed inference with Transformers is not as fast as you would imagine. Transformers with `device_map="auto"` does not apply tensor parallelism, and it only uses one GPU at a time. For Transformers with tensor parallelism, please refer to [its documentation](https://huggingface.co/docs/transformers/v4.51.3/en/perf_infer_gpu_multi).
