Dataset Sources & File Types
Hugging Face Datasets is a great place to find pre-built datasets for fine-tuning. Most dataset loaders also support local files in these formats:- JSONL - One JSON object per line, easiest to create
- CSV - Tabular format, good for simple datasets
- Parquet/Arrow - More efficient for larger datasets
Text Datasets
Instruction Datasets (SFT)
Conversational format with amessages array:
system (optional), user, assistant. Multi-turn conversations are supported by alternating user/assistant messages.
Example: HuggingFaceTB/smoltalk
You may encounter datasets in other formats (standard prompt-completion or conversational prompt-completion). Convert these to the messages format with role and content fields to ensure reliable generations.
Preference Datasets (DPO)
Chosen and rejected completions for the same prompt. Use the explicit format which separates the prompt from each answer:DPOTrainer will automatically convert implicit to explicit if needed.
Prompt-Only Datasets (GRPO)
For reinforcement learning methods like GRPO, only prompts are provided. Completions are generated during training and evaluated by reward functions:Vision Datasets
Vision Datasets (VLM-SFT)
For vision-language models, content uses typed arrays with a separateimages column:
{"type": "image"} placeholder indicates where the image appears in the conversation.
Example: HuggingFaceH4/llava-instruct-mix-vsft
Loading Images with PIL
Loading Images with PIL
You can map a preprocessing function like this to your dataset to load and prepare images for training: