TRL
TRL (Transformer Reinforcement Learning) is a library for fine-tuning and aligning language models using methods like Supervised Fine-Tuning (SFT), Reward Modeling, and Direct Preference Optimization (DPO).
- Native integration in the Hugging Face ecosystem
- Many trained with SFT, DPO, PPO, GRPO
- Most recent training algorithms and techniques
LFM models work out-of-the-box with TRL without requiring any custom integration.
Installation
Install TRL and required dependencies:
pip install trl>=0.9.0 transformers>=4.55.0 torch>=2.6 peft accelerate
trl: Core training librarypeft: LoRA/QLoRA supportaccelerate: Multi-GPU and distributed training
Supervised Fine-Tuning (SFT)
The SFTTrainer makes it easy to fine-tune LFM models on instruction-following or conversational datasets. It handles chat templates, packing, and dataset formatting automatically.
LoRA Fine-Tuning (Recommended)
LoRA (Low-Rank Adaptation) is the recommended approach for fine-tuning LFM2 models with TRL. It offers several key advantages:
- Memory efficient: Trains only small adapter weights (~1-2% of model size) instead of full model parameters
- Data efficient: Achieves strong task performance improvements with less training data than full fine-tuning
- Fast training: Reduced parameter count enables faster iteration and larger effective batch sizes
- Flexible: Easy to switch between different task adapters without retraining the base model
For memory-efficient fine-tuning, use LoRA with the SFTTrainer:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained(
"LiquidAI/LFM2-1.2B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-1.2B")
# Configure LoRA
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
training_args = SFTConfig(
output_dir="./lfm2-sft-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
)
dataset = load_dataset("your-dataset")
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
peft_config=peft_config,
)
trainer.train()
Full Fine-Tuning
Full fine-tuning updates all model parameters. Use this only when you have sufficient GPU memory and need maximum adaptation for your task.
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"LiquidAI/LFM2-1.2B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-1.2B")
# Load your dataset
dataset = load_dataset("your-dataset")
# Configure training
training_args = SFTConfig(
output_dir="./lfm2-sft",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
logging_steps=10,
save_strategy="epoch",
bf16=True,
)
# Create trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
# Train
trainer.train()
Direct Preference Optimization (DPO)
The DPOTrainer implements Direct Preference Optimization, a method to align models with human preferences without requiring a separate reward model. DPO works with preference pairs (chosen vs. rejected responses).
DPO with LoRA (Recommended)
LoRA is highly recommended for DPO training, as it significantly reduces memory requirements while maintaining strong alignment performance.
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained(
"LiquidAI/LFM2-1.2B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-1.2B")
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
training_args = DPOConfig(
output_dir="./lfm2-dpo-lora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7,
beta=0.1,
bf16=True,
)
dataset = load_dataset("your-preference-dataset")
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
peft_config=peft_config,
)
trainer.train()
Full DPO Training
Full DPO training updates all model parameters. Use this only when you have sufficient GPU memory.
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"LiquidAI/LFM2-1.2B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-1.2B")
# Load preference dataset
# Dataset should have "prompt", "chosen", and "rejected" columns
dataset = load_dataset("your-preference-dataset")
# Configure DPO training
training_args = DPOConfig(
output_dir="./lfm2-dpo",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7,
beta=0.1, # DPO temperature parameter
logging_steps=10,
bf16=True,
)
# Create trainer
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
# Train
trainer.train()
Other Training Methods
TRL also provides additional trainers that work seamlessly with LFM models:
- RewardTrainer: Train reward models for RLHF
- PPOTrainer: Proximal Policy Optimization for reinforcement learning from human feedback
- ORPOTrainer: Odds Ratio Preference Optimization, an alternative to DPO
- KTOTrainer: Kahneman-Tversky Optimization for alignment
Refer to the TRL documentation for detailed guides on these methods.
Tips
- Learning Rates: SFT typically uses higher learning rates (1e-5 to 5e-5) than DPO (1e-7 to 1e-6)
- Batch Size: DPO requires larger effective batch sizes; increase
gradient_accumulation_stepsif GPU memory is limited - LoRA Ranks: Start with
r=16for experimentation; increase tor=64or higher for better quality - DPO Beta: The
betaparameter controls the deviation from the reference model; typical values range from 0.1 to 0.5
For more end to end examples, visit the Liquid AI Cookbook.