Skip to main content

Installation Issues

Ensure you have the latest version of transformers installed:
pip install transformers>=4.55.0
If you’re using an older version, the LFM model classes may not be available.
Try these solutions in order:
  1. Use a smaller model: Try LFM2-350M instead of LFM2-1.2B
  2. Enable quantization:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    device_map="auto"
)
  1. Reduce batch size or sequence length
  2. Use gradient checkpointing for training:
model.gradient_checkpointing_enable()
  • Check your internet connection
  • Try using huggingface-cli login if the model requires authentication
  • Set a longer timeout: HF_HUB_DOWNLOAD_TIMEOUT=600
  • Try downloading with snapshot_download:
from huggingface_hub import snapshot_download
snapshot_download("LiquidAI/LFM2.5-1.2B-Instruct")

Inference Issues

Adjust generation parameters:
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.1
)
Key parameters to tune:
  • temperature: Lower (0.3-0.5) for factual, higher (0.7-1.0) for creative
  • top_p: 0.9 is a good default
  • repetition_penalty: 1.1-1.2 helps avoid loops
Optimization strategies:
  1. Use GGUF models with llama.cpp for CPU inference
  2. Use MLX models on Apple Silicon
  3. Enable Flash Attention (if available):
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2"
)
  1. Use vLLM for high-throughput serving
  2. Use smaller quantization levels (Q4 vs Q8)
Increase max_new_tokens:
outputs = model.generate(
    **inputs,
    max_new_tokens=1024,  # Increase this value
)
Check that your input isn’t too long - LFM models support 32k context, but very long inputs leave less room for output.

Fine-tuning Issues

Common causes and solutions:
  1. Learning rate too high/low: Try 2e-4 for LoRA, 2e-5 for full fine-tuning
  2. Dataset format issues: Verify your data matches the expected chat template
  3. Insufficient data: Ensure you have enough training examples
  4. Check for data leakage: Make sure eval data isn’t in training set
Memory optimization strategies:
  1. Use QLoRA instead of full fine-tuning
  2. Reduce batch size and increase gradient accumulation
  3. Enable gradient checkpointing
  4. Use a smaller model (LFM2-350M for experiments)
training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    gradient_checkpointing=True,
)

llama.cpp / GGUF Issues

  • Ensure you’re using a compatible llama.cpp version
  • Check that the GGUF file downloaded completely
  • Try a different quantization level (e.g., Q4_K_M)
  • Ensure you compiled with GPU support if available
  • Use appropriate thread count: -t $(nproc)
  • Try a more aggressive quantization (Q4_0)

Still Stuck?