llama.cpp

llama.cpp is a C++ library for efficient LLM inference with minimal dependencies. It's designed for CPU-first inference with cross-platform support.

tip

Use llama.cpp for:

CPU-only environments (no GPU required)
Local development and testing
Edge deployment and on-device inference

For GPU-accelerated inference at scale, consider using vLLM instead.

Installation

Install llama.cpp for your target operating system:

macOS and Linux
Pre-built Binaries
Build from Source

brew install llama.cpp

Download from llama.cpp releases.

File naming: llama-<version>-bin-<os>-<feature>-<arch>.zip

Quick selection guide:

Windows (CPU): llama-*-bin-win-avx2-x64.zip for Intel/AMD CPUs
Windows (NVIDIA GPU): llama-*-bin-win-cu12-x64.zip (requires CUDA drivers)
macOS (Intel): llama-*-bin-macos-x64.zip
macOS (Apple Silicon): llama-*-bin-macos-arm64.zip
Linux: llama-*-bin-linux-x64.zip

After downloading, unzip and run from that directory.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

The compiled programs will be in ./build/bin/.

For detailed build instructions including GPU support, see the llama.cpp documentation.

Downloading GGUF Models

llama.cpp uses the GGUF format, which stores quantized model weights for efficient inference. All LFM models are available in GGUF format on Hugging Face. See the Models page for all available GGUF models.

You can download LFM models in GGUF format from Hugging Face as follows:

pip install huggingface-hub
hf download LiquidAI/LFM2-1.2B-GGUF lfm2-1.2b-q4_k_m.gguf --local-dir .

Available quantization levels

Q4_0: 4-bit quantization, smallest size
Q4_K_M: 4-bit quantization, good balance of quality and size (recommended)
Q5_K_M: 5-bit quantization, better quality with moderate size increase
Q6_K: 6-bit quantization, excellent quality closer to original
Q8_0: 8-bit quantization, near-original quality
F16: 16-bit float, full precision

Basic Usage

llama.cpp offers three main interfaces for running inference: llama-cpp-python (Python bindings), llama-server (OpenAI-compatible server), and llama-cli (interactive CLI).

llama-cpp-python
llama-server
llama-cli

For Python applications, use the llama-cpp-python package.

Installation:

pip install llama-cpp-python

For GPU support:

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

Model Setup:

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="lfm2-1.2b-q4_k_m.gguf",
    n_ctx=4096,
    n_threads=8
)

# Generate text
output = llm(
    "What is artificial intelligence?",
    max_tokens=512,
    temperature=0.7,
    top_p=0.9
)

print(output["choices"][0]["text"])

Chat Completions:

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response["choices"][0]["message"]["content"])

llama-server provides an OpenAI-compatible API for serving models locally.

Starting the Server:

llama-server -m lfm2-1.2b-q4_k_m.gguf -c 4096 --port 8080

Key parameters:

-m: Path to GGUF model file
-c: Context length (default: 4096)
--port: Server port (default: 8080)
-ngl 99: Offload layers to GPU (if available)

Using the Server:

Once running at http://localhost:8080, use the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="lfm2-1.2b",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

Using curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "lfm2-1.2b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

llama-cli provides an interactive terminal interface for chatting with models.

llama-cli -m lfm2-1.2b-q4_k_m.gguf -c 4096 --color -i

Key parameters:

-m: Path to GGUF model file
-c: Context length
--color: Colored output
-i: Interactive mode
-ngl 99: Offload layers to GPU (if available)

Press Ctrl+C to exit.

Generation Parameters

Control text generation behavior using parameters in the OpenAI-compatible API or command-line flags. Key parameters:

temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int, default 40): Limits to top-k most probable tokens. Typical range: 1-100
max_tokens / --n-predict (int): Maximum number of tokens to generate
repetition_penalty / --repeat-penalty (float, default 1.1): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
stop (str or list[str]): Strings that terminate generation when encountered

llama-cpp-python example

from llama_cpp import Llama

llm = Llama(
    model_path="lfm2-1.2b-q4_k_m.gguf",
    n_ctx=4096,
    n_threads=8
)

# Text generation with sampling parameters
output = llm(
    "What is machine learning?",
    max_tokens=512,
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repeat_penalty=1.1,
    stop=["<|im_end|>", "<|endoftext|>"]
)

print(output["choices"][0]["text"])

# Chat completion with sampling parameters
response = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Explain quantum computing."}
    ],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    max_tokens=512,
    repeat_penalty=1.1
)

print(response["choices"][0]["message"]["content"])

llama-server (OpenAI-compatible API) example

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="lfm2-1.2b",
    messages=[{"role": "user", "content": "What is machine learning?"}],
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    max_tokens=512,
    repetition_penalty=1.1,
)

print(response.choices[0].message.content)

For command-line tools (llama-cli), use flags like --temperature, --top-p, --top-k, --repeat-penalty, and --n-predict.

Vision Models

LFM2-VL GGUF models can also be used for multimodal inference with llama.cpp.

Using llama-server

Start a vision model server:

llama-server -m lfm2-vl-1.6b-q4_k_m.gguf -c 4096 --port 8080

Use with the OpenAI Python client:

from openai import OpenAI
import base64

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

# Encode image to base64
with open("image.jpg", "rb") as image_file:
    image_data = base64.b64encode(image_file.read()).decode("utf-8")

response = client.chat.completions.create(
    model="lfm2-vl",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
                {"type": "text", "text": "What's in this image?"}
            ]
        }
    ],
    max_tokens=256
)

print(response.choices[0].message.content)

Using llama-cpp-python

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler

# Initialize with vision support
chat_handler = Llava15ChatHandler(clip_model_path="mmproj-model.gguf")
llm = Llama(
    model_path="lfm2-vl-1.6b-q4_k_m.gguf",
    chat_handler=chat_handler,
    n_ctx=4096
)

# Generate with image
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ]
)

print(response["choices"][0]["message"]["content"])

Converting Custom Models

If you have a finetuned model or need to create a GGUF from a Hugging Face model:

# Clone llama.cpp if you haven't already
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert model with quantization
python convert_hf_to_gguf.py /path/to/your/model --outfile model.gguf --outtype q4_k_m

Use --outtype to specify the quantization level (e.g., q4_0, q4_k_m, q5_k_m, q6_k, q8_0, f16).

Installation​

Downloading GGUF Models​

Basic Usage​

Generation Parameters​

Vision Models​

Converting Custom Models​