> ## Documentation Index
> Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# TRL

> TRL (Transformer Reinforcement Learning) is a library for fine-tuning and aligning language models using methods like Supervised Fine-Tuning (SFT), Reward Modeling, and Direct Preference Optimization (DPO).

<Tip>
  Use TRL for fine-tuning with native Hugging Face integration, support for SFT, DPO, PPO, and GRPO, and access to the most recent training algorithms and techniques.
</Tip>

<Tip>
  [LEAP Finetune](/lfm/fine-tuning/leap-finetune) builds on TRL and wraps the repetitive setup for LFM training: dataset validation, config defaults, evals, distributed launch, checkpointing, and optimized paths for VLM and MoE runs.
</Tip>

Different training methods require specific dataset formats. See [Datasets](/lfm/fine-tuning/datasets) for format requirements.

## Installation[​](#installation "Direct link to Installation")

Install TRL and required dependencies:

```
pip install trl>=0.9.0 transformers>=4.55.0 torch>=2.6 peft accelerate
```

* **`trl`**: Core training library
* **`peft`**: LoRA/QLoRA support
* **`accelerate`**: Multi-GPU and distributed training

## Supervised Fine-Tuning (SFT)[​](#supervised-fine-tuning-sft "Direct link to Supervised Fine-Tuning (SFT)")

[<img src="https://mintcdn.com/liquidai/DopNhNlw8MHIKIfv/images/lfm/fine-tuning/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png?fit=max&auto=format&n=DopNhNlw8MHIKIfv&q=85&s=1af5c2b712fd6c30fd310ecf956e9591" alt="Colab link" width="366" height="63" data-path="images/lfm/fine-tuning/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" />](https://colab.research.google.com/github/Liquid4All/docs/blob/main/notebooks/💧_LFM2_5_SFT_with_TRL.ipynb)

The `SFTTrainer` makes it easy to fine-tune LFM models on instruction-following or conversational datasets. It handles chat templates, packing, and dataset formatting automatically. SFT training requires [Instruction datasets](/lfm/fine-tuning/datasets#instruction-datasets-sft).

### LoRA Fine-Tuning (Recommended)[​](#lora-fine-tuning-recommended "Direct link to LoRA Fine-Tuning (Recommended)")

LoRA (Low-Rank Adaptation) is the recommended approach for fine-tuning LFM2 models with TRL. It offers several key advantages:

* **Memory efficient**: Trains only small adapter weights (\~1-2% of model size) instead of full model parameters
* **Data efficient**: Achieves strong task performance improvements with less training data than full fine-tuning
* **Fast training**: Reduced parameter count enables faster iteration and larger effective batch sizes
* **Flexible**: Easy to switch between different task adapters without retraining the base model

For memory-efficient fine-tuning, use LoRA with the `SFTTrainer`:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained(
    "LiquidAI/LFM2.5-1.2B-Instruct",
    dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")

# Configure LoRA
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

training_args = SFTConfig(
    output_dir="./lfm2-sft-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
)

dataset = load_dataset("HuggingFaceTB/smoltalk", "all")

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    peft_config=peft_config,
)

trainer.train()
```

<Accordion title="Full Fine-Tuning">
  Full fine-tuning updates all model parameters. Use this only when you have sufficient GPU memory and need maximum adaptation for your task.

  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from transformers import AutoModelForCausalLM, AutoTokenizer
  from trl import SFTTrainer, SFTConfig
  from datasets import load_dataset

  # Load model and tokenizer
  model = AutoModelForCausalLM.from_pretrained(
      "LiquidAI/LFM2.5-1.2B-Instruct",
      dtype="auto",
      device_map="auto"
  )
  tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")

  # Load your dataset
  dataset = load_dataset("HuggingFaceTB/smoltalk", "all")

  # Configure training
  training_args = SFTConfig(
      output_dir="./lfm2-sft",
      num_train_epochs=3,
      per_device_train_batch_size=4,
      gradient_accumulation_steps=4,
      learning_rate=2e-5,
      logging_steps=10,
      save_strategy="epoch",
      bf16=True,
  )

  # Create trainer
  trainer = SFTTrainer(
      model=model,
      args=training_args,
      train_dataset=dataset["train"],
      tokenizer=tokenizer,
  )

  # Train
  trainer.train()
  ```
</Accordion>

## Vision Language Model Fine-Tuning (VLM-SFT)[​](#vision-language-model-fine-tuning-vlm-sft "Direct link to Vision Language Model Fine-Tuning (VLM-SFT)")

[<img src="https://mintcdn.com/liquidai/DopNhNlw8MHIKIfv/images/lfm/fine-tuning/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png?fit=max&auto=format&n=DopNhNlw8MHIKIfv&q=85&s=1af5c2b712fd6c30fd310ecf956e9591" alt="Colab link" width="366" height="63" data-path="images/lfm/fine-tuning/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" />](https://colab.research.google.com/github/Liquid4All/docs/blob/main/notebooks/💧_LFM2_5_VL_SFT_with_TRL.ipynb)

The `SFTTrainer` also supports fine-tuning Vision Language Models like `LFM2.5-VL-1.6B` on image-text datasets. VLM fine-tuning requires [Vision datasets](/lfm/fine-tuning/datasets#vision-datasets-vlm-sft) and a few key differences from text-only SFT:

* Uses `AutoModelForImageTextToText` instead of `AutoModelForCausalLM`
* Uses `AutoProcessor` instead of just a tokenizer
* Requires dataset formatting with image content types
* Needs a custom `collate_fn` for multimodal batching

### VLM LoRA Fine-Tuning (Recommended)[​](#vlm-lora-fine-tuning-recommended "Direct link to VLM LoRA Fine-Tuning (Recommended)")

LoRA is recommended for VLM fine-tuning due to the larger model size and multimodal complexity:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from transformers import AutoModelForImageTextToText, AutoProcessor
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

model_id = "LiquidAI/LFM2.5-VL-1.6B"

processor = AutoProcessor.from_pretrained(
    model_id,
    max_image_tokens=256,
    trust_remote_code=True
)

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True
)

# Format dataset for VLM (image + text)
def format_vlm_sample(sample):
    return [
        {"role": "system", "content": [{"type": "text", "text": "You are a vision assistant."}]},
        {"role": "user", "content": [
            {"type": "image", "image": sample["image"]},
            {"type": "text", "text": sample["question"]},
        ]},
        {"role": "assistant", "content": [{"type": "text", "text": sample["answer"]}]},
    ]

# Custom collate function for multimodal batching
def collate_fn(samples):
    batch = processor.apply_chat_template(samples, tokenize=True, return_dict=True, return_tensors="pt")
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100
    batch["labels"] = labels
    return batch

# Configure LoRA
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules=["q_proj", "v_proj", "fc1", "fc2", "linear", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, peft_config)

sft_config = SFTConfig(
    output_dir="./lfm2-vl-sft-lora",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=5e-4,
    gradient_checkpointing=True,
    max_length=512,
    dataset_kwargs={"skip_prepare_dataset": True},
)

# Load and format your dataset
dataset = load_dataset("HuggingFaceH4/llava-instruct-mix-vsft")
train_dataset = [format_vlm_sample(s) for s in dataset["train"]]

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    data_collator=collate_fn,
    processing_class=processor.tokenizer,
)

trainer.train()
```

<Accordion title="Full VLM Fine-Tuning">
  Full VLM fine-tuning updates all model parameters. Use this only when you have sufficient GPU memory.

  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from transformers import AutoModelForImageTextToText, AutoProcessor
  from trl import SFTTrainer, SFTConfig
  from datasets import load_dataset

  model_id = "LiquidAI/LFM2.5-VL-1.6B"

  processor = AutoProcessor.from_pretrained(
      model_id,
      max_image_tokens=256,
      trust_remote_code=True
  )

  model = AutoModelForImageTextToText.from_pretrained(
      model_id,
      dtype="bfloat16",
      device_map="auto",
      trust_remote_code=True
  )

  def format_vlm_sample(sample):
      return [
          {"role": "user", "content": [
              {"type": "image", "image": sample["image"]},
              {"type": "text", "text": sample["question"]},
          ]},
          {"role": "assistant", "content": [{"type": "text", "text": sample["answer"]}]},
      ]

  def collate_fn(samples):
      batch = processor.apply_chat_template(samples, tokenize=True, return_dict=True, return_tensors="pt")
      labels = batch["input_ids"].clone()
      labels[labels == processor.tokenizer.pad_token_id] = -100
      batch["labels"] = labels
      return batch

  sft_config = SFTConfig(
      output_dir="./lfm2-vl-sft",
      num_train_epochs=1,
      per_device_train_batch_size=1,
      gradient_accumulation_steps=16,
      learning_rate=2e-5,
      gradient_checkpointing=True,
      max_length=512,
      dataset_kwargs={"skip_prepare_dataset": True},
  )

  dataset = load_dataset("HuggingFaceH4/llava-instruct-mix-vsft")
  train_dataset = [format_vlm_sample(s) for s in dataset["train"]]

  trainer = SFTTrainer(
      model=model,
      args=sft_config,
      train_dataset=train_dataset,
      data_collator=collate_fn,
      processing_class=processor.tokenizer,
  )

  trainer.train()
  ```
</Accordion>

## Direct Preference Optimization (DPO)[​](#direct-preference-optimization-dpo "Direct link to Direct Preference Optimization (DPO)")

[<img src="https://mintcdn.com/liquidai/DopNhNlw8MHIKIfv/images/lfm/fine-tuning/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png?fit=max&auto=format&n=DopNhNlw8MHIKIfv&q=85&s=1af5c2b712fd6c30fd310ecf956e9591" alt="Colab link" width="366" height="63" data-path="images/lfm/fine-tuning/production/uploads/61b8e2ba285851687028d395/vlOyMEjwHa_b_LXysEu2E.png" />](https://colab.research.google.com/github/Liquid4All/docs/blob/main/notebooks/💧_LFM2_DPO_with_TRL.ipynb)

The `DPOTrainer` implements Direct Preference Optimization, a method to align models with human preferences without requiring a separate reward model. DPO training requires [Preference datasets](/lfm/fine-tuning/datasets#preference-datasets-dpo) with chosen and rejected response pairs.

### DPO with LoRA (Recommended)[​](#dpo-with-lora-recommended "Direct link to DPO with LoRA (Recommended)")

LoRA is highly recommended for DPO training, as it significantly reduces memory requirements while maintaining strong alignment performance.

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained(
    "LiquidAI/LFM2.5-1.2B-Instruct",
    dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

training_args = DPOConfig(
    output_dir="./lfm2-dpo-lora",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=5e-7,
    beta=0.1,
    bf16=True,
)

dataset = load_dataset("mlabonne/orpo-dpo-mix-40k")

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    peft_config=peft_config,
)

trainer.train()
```

<Accordion title="Full DPO Training">
  Full DPO training updates all model parameters. Use this only when you have sufficient GPU memory.

  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from transformers import AutoModelForCausalLM, AutoTokenizer
  from trl import DPOTrainer, DPOConfig
  from datasets import load_dataset

  # Load model and tokenizer
  model = AutoModelForCausalLM.from_pretrained(
      "LiquidAI/LFM2.5-1.2B-Instruct",
      dtype="auto",
      device_map="auto"
  )
  tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")

  # Load preference dataset
  # Dataset should have "prompt", "chosen", and "rejected" columns
  dataset = load_dataset("mlabonne/orpo-dpo-mix-40k")

  # Configure DPO training
  training_args = DPOConfig(
      output_dir="./lfm2-dpo",
      num_train_epochs=3,
      per_device_train_batch_size=2,
      gradient_accumulation_steps=8,
      learning_rate=5e-7,
      beta=0.1,  # DPO temperature parameter
      logging_steps=10,
      bf16=True,
  )

  # Create trainer
  trainer = DPOTrainer(
      model=model,
      args=training_args,
      train_dataset=dataset["train"],
      tokenizer=tokenizer,
  )

  # Train
  trainer.train()
  ```
</Accordion>

## Tips[​](#tips "Direct link to Tips")

* **Learning Rates**: SFT typically uses higher learning rates (1e-5 to 5e-5) than DPO (1e-7 to 1e-6)
* **Batch Size**: DPO requires larger effective batch sizes; increase `gradient_accumulation_steps` if GPU memory is limited
* **LoRA Ranks**: Start with `r=16`. Higher ranks increase adapter memory and parameter count. Set `lora_alpha` (`a`) to `2 * r`
* **DPO Beta**: The `beta` parameter controls the deviation from the reference model. Start with `0.1`

***

For more end to end examples, visit the [Liquid AI Cookbook](https://github.com/Liquid4All/cookbook).