Skip to main content

llama.cpp

llama.cpp is a C++ library for efficient LLM inference with minimal dependencies. It's designed for CPU-first inference with cross-platform support.

tip

Use llama.cpp for:

  • CPU-only environments (no GPU required)
  • Local development and testing
  • Edge deployment and on-device inference

For GPU-accelerated inference at scale, consider using vLLM instead.

Installation​

Install llama.cpp for your target operating system:

brew install llama.cpp

Downloading GGUF Models​

llama.cpp uses the GGUF format, which stores quantized model weights for efficient inference. All LFM models are available in GGUF format on Hugging Face. See the Models page for all available GGUF models.

You can download LFM models in GGUF format from Hugging Face as follows:

pip install huggingface-hub
hf download LiquidAI/LFM2-1.2B-GGUF lfm2-1.2b-q4_k_m.gguf --local-dir .
Available quantization levels
  • Q4_0: 4-bit quantization, smallest size
  • Q4_K_M: 4-bit quantization, good balance of quality and size (recommended)
  • Q5_K_M: 5-bit quantization, better quality with moderate size increase
  • Q6_K: 6-bit quantization, excellent quality closer to original
  • Q8_0: 8-bit quantization, near-original quality
  • F16: 16-bit float, full precision

Basic Usage​

llama.cpp offers three main interfaces for running inference: llama-cpp-python (Python bindings), llama-server (OpenAI-compatible server), and llama-cli (interactive CLI).

For Python applications, use the llama-cpp-python package.

Installation:

pip install llama-cpp-python

For GPU support:

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

Model Setup:

from llama_cpp import Llama

# Load model
llm = Llama(
model_path="lfm2-1.2b-q4_k_m.gguf",
n_ctx=4096,
n_threads=8
)

# Generate text
output = llm(
"What is artificial intelligence?",
max_tokens=512,
temperature=0.7,
top_p=0.9
)

print(output["choices"][0]["text"])

Chat Completions:

response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing."}
],
temperature=0.7,
max_tokens=512
)

print(response["choices"][0]["message"]["content"])

Generation Parameters​

Control text generation behavior using parameters in the OpenAI-compatible API or command-line flags. Key parameters:

  • temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
  • top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
  • top_k (int, default 40): Limits to top-k most probable tokens. Typical range: 1-100
  • max_tokens / --n-predict (int): Maximum number of tokens to generate
  • repetition_penalty / --repeat-penalty (float, default 1.1): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
  • stop (str or list[str]): Strings that terminate generation when encountered
llama-cpp-python example
from llama_cpp import Llama

llm = Llama(
model_path="lfm2-1.2b-q4_k_m.gguf",
n_ctx=4096,
n_threads=8
)

# Text generation with sampling parameters
output = llm(
"What is machine learning?",
max_tokens=512,
temperature=0.7,
top_p=0.9,
top_k=40,
repeat_penalty=1.1,
stop=["<|im_end|>", "<|endoftext|>"]
)

print(output["choices"][0]["text"])

# Chat completion with sampling parameters
response = llm.create_chat_completion(
messages=[
{"role": "user", "content": "Explain quantum computing."}
],
temperature=0.7,
top_p=0.9,
top_k=40,
max_tokens=512,
repeat_penalty=1.1
)

print(response["choices"][0]["message"]["content"])
llama-server (OpenAI-compatible API) example
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)

response = client.chat.completions.create(
model="lfm2-1.2b",
messages=[{"role": "user", "content": "What is machine learning?"}],
temperature=0.7,
top_p=0.9,
top_k=40,
max_tokens=512,
repetition_penalty=1.1,
)

print(response.choices[0].message.content)

For command-line tools (llama-cli), use flags like --temperature, --top-p, --top-k, --repeat-penalty, and --n-predict.

Vision Models​

LFM2-VL GGUF models can also be used for multimodal inference with llama.cpp.

Using llama-server

Start a vision model server:

llama-server -m lfm2-vl-1.6b-q4_k_m.gguf -c 4096 --port 8080

Use with the OpenAI Python client:

from openai import OpenAI
import base64

client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)

# Encode image to base64
with open("image.jpg", "rb") as image_file:
image_data = base64.b64encode(image_file.read()).decode("utf-8")

response = client.chat.completions.create(
model="lfm2-vl",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
{"type": "text", "text": "What's in this image?"}
]
}
],
max_tokens=256
)

print(response.choices[0].message.content)
Using llama-cpp-python
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler

# Initialize with vision support
chat_handler = Llava15ChatHandler(clip_model_path="mmproj-model.gguf")
llm = Llama(
model_path="lfm2-vl-1.6b-q4_k_m.gguf",
chat_handler=chat_handler,
n_ctx=4096
)

# Generate with image
response = llm.create_chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]
}
]
)

print(response["choices"][0]["message"]["content"])

Converting Custom Models​

If you have a finetuned model or need to create a GGUF from a Hugging Face model:

# Clone llama.cpp if you haven't already
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert model with quantization
python convert_hf_to_gguf.py /path/to/your/model --outfile model.gguf --outtype q4_k_m

Use --outtype to specify the quantization level (e.g., q4_0, q4_k_m, q5_k_m, q6_k, q8_0, f16).