TRL
TRL (Transformer Reinforcement Learning) is a library for fine-tuning and aligning language models using methods like Supervised Fine-Tuning (SFT), Reward Modeling, and Direct Preference Optimization (DPO).
Use TRL for:
- Native integration in the Hugging Face ecosystem
- Many trained with SFT, DPO, PPO, GRPO
- Most recent training algorithms and techniques
LFM models work out-of-the-box with TRL without requiring any custom integration.
Installation​
Install TRL and required dependencies:
pip install trl>=0.9.0 transformers>=4.55.0 torch>=2.6 peft accelerate
trl: Core training librarypeft: LoRA/QLoRA supportaccelerate: Multi-GPU and distributed training
Supervised Fine-Tuning (SFT)​
The SFTTrainer makes it easy to fine-tune LFM models on instruction-following or conversational datasets. It handles chat templates, packing, and dataset formatting automatically.
Full Fine-Tuning​
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"LiquidAI/LFM2-1.2B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-1.2B")
# Load your dataset
dataset = load_dataset("your-dataset")
# Configure training
training_args = SFTConfig(
output_dir="./lfm2-sft",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
logging_steps=10,
save_strategy="epoch",
bf16=True,
)
# Create trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
# Train
trainer.train()
LoRA Fine-Tuning​
For memory-efficient fine-tuning, use LoRA with the SFTTrainer:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained(
"LiquidAI/LFM2-1.2B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-1.2B")
# Configure LoRA
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
training_args = SFTConfig(
output_dir="./lfm2-sft-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
)
dataset = load_dataset("your-dataset")
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
peft_config=peft_config,
)
trainer.train()
Direct Preference Optimization (DPO)​
The DPOTrainer implements Direct Preference Optimization, a method to align models with human preferences without requiring a separate reward model. DPO works with preference pairs (chosen vs. rejected responses).
Full DPO Training​
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"LiquidAI/LFM2-1.2B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-1.2B")
# Load preference dataset
# Dataset should have "prompt", "chosen", and "rejected" columns
dataset = load_dataset("your-preference-dataset")
# Configure DPO training
training_args = DPOConfig(
output_dir="./lfm2-dpo",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7,
beta=0.1, # DPO temperature parameter
logging_steps=10,
bf16=True,
)
# Create trainer
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
# Train
trainer.train()
DPO with LoRA​
For memory-efficient DPO training:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
from datasets import load_dataset
model = AutoModelForCausalLM.from_pretrained(
"LiquidAI/LFM2-1.2B",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("LiquidAI/LFM2-1.2B")
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
task_type="CAUSAL_LM",
)
training_args = DPOConfig(
output_dir="./lfm2-dpo-lora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7,
beta=0.1,
bf16=True,
)
dataset = load_dataset("your-preference-dataset")
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
peft_config=peft_config,
)
trainer.train()
Other Training Methods​
TRL also provides additional trainers that work seamlessly with LFM models:
- RewardTrainer: Train reward models for RLHF
- PPOTrainer: Proximal Policy Optimization for reinforcement learning from human feedback
- ORPOTrainer: Odds Ratio Preference Optimization, an alternative to DPO
- KTOTrainer: Kahneman-Tversky Optimization for alignment
Refer to the TRL documentation for detailed guides on these methods.
Tips​
- Learning Rates: SFT typically uses higher learning rates (1e-5 to 5e-5) than DPO (1e-7 to 1e-6)
- Batch Size: DPO requires larger effective batch sizes; increase
gradient_accumulation_stepsif GPU memory is limited - LoRA Ranks: Start with
r=16for experimentation; increase tor=64or higher for better quality - DPO Beta: The
betaparameter controls the deviation from the reference model; typical values range from 0.1 to 0.5
For more end to end examples, visit the Liquid AI Cookbook.