Unsloth
Unsloth is a library that makes fine-tuning LLMs 2-5x faster and uses 70% less memory through optimized kernels and efficient memory management.
- 2-5x faster training with optimized kernels
- 70% less memory usage through efficient management
- Utilities for quantization and quick inference
Installation
Install Unsloth and required dependencies:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
For a stable release:
pip install unsloth transformers>=4.55.0 torch>=2.6
Optional: Install xformers for additional memory optimizations.
Supervised Fine-Tuning (SFT)
Unsloth provides the FastLanguageModel wrapper that automatically applies optimizations to your model and integrates seamlessly with TRL's SFTTrainer.
LoRA Fine-Tuning (Recommended)
LoRA (Low-Rank Adaptation) is the recommended approach for fine-tuning LFM2 models with Unsloth. Combined with Unsloth's optimizations, LoRA offers:
- Memory efficient: Trains only small adapter weights (~1-2% of model size) instead of full model parameters
- Data efficient: Achieves strong task performance improvements with less training data than full fine-tuning
- Fast training: Unsloth's optimized kernels combined with reduced parameters enable 2-5x faster training
- Flexible: Easy to switch between different task adapters without retraining the base model
Unsloth provides optimized LoRA support through FastLanguageModel.get_peft_model():
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="LiquidAI/LFM2-1.2B",
max_seq_length=2048,
dtype=None,
load_in_4bit=False,
)
# Apply LoRA with Unsloth
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing="unsloth", # Unsloth's optimized gradient checkpointing
random_state=42,
)
training_args = SFTConfig(
output_dir="./lfm2-unsloth-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
bf16=True,
)
dataset = load_dataset("your-dataset")
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
trainer.train()
QLoRA (4-Bit Quantization)
For maximum memory efficiency on resource-constrained hardware, use QLoRA with 4-bit quantization. This reduces memory usage by ~4x while maintaining strong performance.
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load model in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="LiquidAI/LFM2-1.2B",
max_seq_length=2048,
dtype=None,
load_in_4bit=True, # Enable 4-bit quantization
)
# Apply LoRA
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
training_args = SFTConfig(
output_dir="./lfm2-unsloth-qlora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-4,
logging_steps=10,
bf16=True,
)
dataset = load_dataset("your-dataset")
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
trainer.train()
Full Fine-Tuning
Full fine-tuning updates all model parameters. Use this only when you have sufficient GPU memory and need maximum adaptation for your task.
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="LiquidAI/LFM2-1.2B",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=False,
)
# Load your dataset
dataset = load_dataset("your-dataset")
# Configure training
training_args = SFTConfig(
output_dir="./lfm2-unsloth-sft",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
logging_steps=10,
save_strategy="epoch",
bf16=True,
)
# Create trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
# Train
trainer.train()
Saving Models
After training, save your model:
# Save LoRA adapters only (lightweight)
model.save_pretrained("./lfm2-lora-adapters")
tokenizer.save_pretrained("./lfm2-lora-adapters")
# Or save merged model (full weights)
model.save_pretrained_merged("./lfm2-merged", tokenizer)
Inference
Load and run inference with your fine-tuned model:
from unsloth import FastLanguageModel
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./lfm2-lora-adapters",
max_seq_length=2048,
dtype=None,
load_in_4bit=False,
)
# Enable inference mode for faster generation
FastLanguageModel.for_inference(model)
# Generate
inputs = tokenizer("Your prompt here", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Direct Preference Optimization (DPO)
Unsloth also supports DPO training with the DPOTrainer:
from unsloth import FastLanguageModel, PatchDPOTrainer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# Patch DPO for Unsloth optimizations
PatchDPOTrainer()
# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="LiquidAI/LFM2-1.2B",
max_seq_length=2048,
dtype=None,
load_in_4bit=False,
)
# Apply LoRA
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
# Load preference dataset
dataset = load_dataset("your-preference-dataset")
training_args = DPOConfig(
output_dir="./lfm2-unsloth-dpo",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=5e-7,
beta=0.1,
logging_steps=10,
bf16=True,
)
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
trainer.train()
Multi-GPU support exists in Unsloth but is less clearly documented and tested than other packages like TRL or Axolotl. For production multi-GPU training, consider using TRL or Axolotl with Unsloth optimizations where available.
Tips
max_seq_length: Set to your expected maximum sequence length; Unsloth pre-allocates memory for efficiencyload_in_4bit: Enables QLoRA, reducing memory by ~4x with minimal quality lossuse_gradient_checkpointing: Use"unsloth"for faster checkpointing than default- Target modules: Include MLP layers (
gate_proj,up_proj,down_proj) for better quality, especially on smaller models - Batch size: Unsloth's optimizations allow larger batch sizes; experiment to maximize GPU utilization
For more end to end examples, visit the Liquid AI Cookbook.