Axolotl
Axolotl is a YAML-driven fine-tuning toolkit that simplifies training with support for SFT, LoRA, QLoRA, DPO, and multi-GPU training.
- YAML-based configuration for reproducible training
- Multi-GPU training with DeepSpeed or FSDP
- Production-ready training pipelines
Quick Start:
- Install Axolotl
- Create a YAML config file
- Run
axolotl train config.yml - Run
axolotl inference config.ymloraxolotl merge-lora config.yml
Installation
Requirements
- NVIDIA Ampere+ GPU (or AMD ROCm GPU)
- Python ≥ 3.10
- PyTorch compatible with your CUDA/ROCm version
Install via Pip
pip install --no-build-isolation "axolotl[flash-attn,deepspeed]"
Install from Source
git clone https://github.com/axolotl-ai-cloud/axolotl
cd axolotl
pip install -e '.[flash-attn,deepspeed]'
For detailed installation instructions, see the official Axolotl installation guide.
Dataset Formats
Axolotl supports multiple dataset formats for supervised fine-tuning. We recommend either OpenAI Messages or Input/Output.
OpenAI Messages Format
data/lfm2_sft.jsonl:
{"messages":[{"role":"user","content":"Write a short haiku about LFM2."},{"role":"assistant","content":"Silent layers learn\nTokens drift like falling leaves\nLFM2 speaks."}]}
{"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Summarize Axolotl in one sentence."},{"role":"assistant","content":"Axolotl is a YAML‑driven LLM finetuning toolkit."}]}
YAML (snippet):
datasets:
- path: data/lfm2_sft.jsonl
type: chat_template # use tokenizer chat template
field_messages: messages # key in your JSON objects
Tip: the LFM2 tokenizer includes a chat template in
tokenizer_config.jsonwhich you can use
Input/Output Format
data/lfm2_io.jsonl:
{"input": "User question...", "output": "Assistant answer..."}
YAML (snippet):
datasets:
- path: data/lfm2_io.jsonl
type: input_output
train_on_inputs: false # mask inputs when computing loss
Training Configurations
LoRA Fine-Tuning (Recommended)
LoRA (Low-Rank Adaptation) is the recommended approach for fine-tuning LFM2 models. It offers several key advantages:
- Memory efficient: Trains only small adapter weights (~1-2% of model size) instead of full model parameters
- Data efficient: Achieves strong task performance improvements with less training data than full fine-tuning
- Fast training: Reduced parameter count enables faster iteration and larger effective batch sizes
- Flexible: Easy to switch between different task adapters without retraining the base model
Create configs/lfm2-2.6b-lora.yml:
# ---- Model ----
base_model: LiquidAI/LFM2-2.6B
adapter: lora # LoRA; omit for full fine‑tune
# ---- Data ----
datasets:
- path: data/lfm2_sft.jsonl
type: chat_template
field_messages: messages
# Optional: preprocess & cache
# dataset_prepared_path: .cache/lfm2_sft_prepared
# ---- Training ----
output_dir: ./outputs/lfm2-2.6b-sft-lora
num_epochs: 2
learning_rate: 2e-4
lr_scheduler: cosine
warmup_steps: 50
micro_batch_size: 2
gradient_accumulation_steps: 16
seed: 42
# ---- Sequence & packing ----
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
# ---- Precision & memory ----
bf16: true
flash_attention: true
gradient_checkpointing: true
# ---- LoRA hyperparams ----
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
# lora_target_modules: [q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj] # override only if needed
# ---- Logging ----
# wandb_project: lfm2-sft
# wandb_run_name: lfm2-2.6b-lora
QLoRA (4-Bit Quantization)
For maximum memory efficiency on resource-constrained hardware, use QLoRA with 4-bit quantization. This reduces memory usage by ~4x while maintaining strong performance.
configs/lfm2-2.6b-qlora.yml:
base_model: LiquidAI/LFM2-2.6B
adapter: lora
# 4‑bit loading for QLoRA
load_in_4bit: true
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true
bnb_4bit_compute_dtype: bfloat16
# Data & training same as LoRA example
sequence_len: 4096
sample_packing: true
bf16: true
flash_attention: true
gradient_checkpointing: true
lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
output_dir: ./outputs/lfm2-2.6b-sft-qlora
num_epochs: 2
learning_rate: 2e-4
micro_batch_size: 2
gradient_accumulation_steps: 16
Full Fine-Tuning
Full fine-tuning updates all model parameters. Use this only when you have sufficient GPU memory and need maximum adaptation for your task. Requires significantly more memory and training time than LoRA.
configs/lfm2-2.6b-full.yml:
base_model: LiquidAI/LFM2-2.6B
# adapter: null # no PEFT adapters -> full FT
sequence_len: 4096
sample_packing: true
bf16: true
flash_attention: true
gradient_checkpointing: true
output_dir: ./outputs/lfm2-2.6b-sft-full
num_epochs: 2
learning_rate: 1e-5
lr_scheduler: cosine
warmup_steps: 50
micro_batch_size: 1
gradient_accumulation_steps: 32
Notes
- Tune
sequence_len,micro_batch_size, andgradient_accumulation_stepsto your GPU budget.- If you hit OOM with long contexts, consider sequence parallelism (multi‑GPU) and keep
flash_attention: true.
Training
Single GPU
Run training with your YAML config:
axolotl train configs/lfm2-2.6b-lora.yml
Debug preprocessing to inspect tokens:
axolotl preprocess configs/lfm2-2.6b-lora.yml --debug --debug-num-examples 5
Multi-GPU with DeepSpeed
# Fetch DeepSpeed configs (one-time setup)
axolotl fetch deepspeed_configs
# Train with DeepSpeed ZeRO-2
axolotl train configs/lfm2-2.6b-lora.yml --deepspeed deepspeed_configs/zero2.json
# Or use torchrun launcher
axolotl train configs/lfm2-2.6b-lora.yml --launcher torchrun -- --nproc_per_node=4
Multi-GPU with FSDP
For FSDP2, set fsdp_version: 2 and configure fsdp_config in your YAML file.
Inference
LoRA Inference
axolotl inference configs/lfm2-2.6b-lora.yml \
--lora-model-dir ./outputs/lfm2-2.6b-sft-lora
Full Model Inference
axolotl inference configs/lfm2-2.6b-full.yml \
--base-model ./outputs/lfm2-2.6b-sft-full/completed
Merging LoRA Adapters
Merge LoRA adapters into the base model:
axolotl merge-lora configs/lfm2-2.6b-lora.yml \
--lora-model-dir ./outputs/lfm2-2.6b-sft-lora
For CPU-only merging (if VRAM is limited):
CUDA_VISIBLE_DEVICES="" axolotl merge-lora configs/lfm2-2.6b-lora.yml
Pushing to Hugging Face
Manual Upload
hf login
hf upload <your-org>/<repo-name> ./outputs/lfm2-2.6b-sft-lora/merged
Automatic Upload
Set hub_model_id: <your-org>/<repo> in your YAML config to auto-push during training (requires hf login first).
Tips
- Out of memory: Reduce
micro_batch_size, increasegradient_accumulation_steps, lowersequence_len, or use QLoRA - Slow training: Enable
flash_attention,sample_packing, andgradient_checkpointing - LoRA optimizations: Add
lora_mlp_kernel: true,lora_qkv_kernel: true,lora_o_kernel: truefor faster training - Merge errors: Use CPU merge with
CUDA_VISIBLE_DEVICES=""or setlora_on_cpu: true - Multi-GPU: Start with DeepSpeed ZeRO-2, upgrade to ZeRO-3 for larger models
For more end to end examples, visit the Liquid AI Cookbook.