MLX

Use MLX for running models on Apple Silicon Macs with Metal GPU acceleration.

MLX leverages unified memory architecture on Apple Silicon, allowing seamless data sharing between CPU and GPU. The mlx-lm package provides a simple interface for loading and serving LLMs.

Installation

Install the MLX language model package:

pip install mlx-lm

Basic Usage

The mlx-lm package provides a simple interface for text generation with MLX models. See the Models page for all available MLX models, or browse MLX community models at mlx-community LFM2 models.

from mlx_lm import load, generate

# Load model and tokenizer
model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")

# Generate text
prompt = "What is machine learning?"

# Apply chat template
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, tokenizer=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, verbose=True)
print(response)

Generation Parameters

Control text generation behavior using parameters in the generate() function. Key parameters:

temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int, default 50): Limits to top-k most probable tokens. Typical range: 1-100
max_tokens (int): Maximum number of tokens to generate
repetition_penalty (float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5

Example with custom parameters:

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    temperature=0.3,
    min_p=0.15,
    repetition_penalty=1.05,
    max_tokens=512
)

Streaming Generation

Stream responses with stream_generate():

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")

messages = [{"role": "user", "content": "Tell me a story about space exploration."}]
prompt = tokenizer.apply_chat_template(
    messages, tokenizer=False, add_generation_prompt=True
)

for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=512):
    print(token, end="", flush=True)

Serving with mlx-lm

MLX can serve models through an OpenAI-compatible API. Start a server with:

mlx_lm.server --model mlx-community/LFM2-1.2B-8bit --port 8080

Using the Server

Once running, use the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mlx-community/LFM2-1.2B-8bit",
    messages=[
        {"role": "user", "content": "Explain quantum computing."}
    ],
    temperature=0.3,
    max_tokens=512,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)

You can also use curl to interact with the server:

Curl request example

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/LFM2-1.2B-8bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.3,
    "min_p": 0.15,
    "repetition_penalty": 1.05
  }'

Vision Models

LFM2-VL models support both text and image inputs for multimodal inference. Use mlx_vlm to load and generate with vision models:

Single Image Example

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_image_processor
from PIL import Image

# Load vision model
model, processor = load("mlx-community/LFM2-VL-1.6B-8bit")

# Load image
image = Image.open("path/to/image.jpg")

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's in this image?"}
        ]
    }
]

# Apply chat template
prompt = apply_chat_template(processor, messages)

# Generate
output = generate(model, processor, image, prompt, verbose=False)
print(output)

Multiple Images Example

images = [
    Image.open("path/to/first.jpg"),
    Image.open("path/to/second.jpg")
]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "What are the differences between these images?"}
        ]
    }
]

prompt = apply_chat_template(processor, messages)
output = generate(model, processor, images, prompt, verbose=False)
print(output)

Getting Started

On-Device

GPU Inference

Tools

Installation

Basic Usage

Generation Parameters

Streaming Generation

Serving with mlx-lm

Using the Server

Vision Models

Getting Started

On-Device

GPU Inference

Tools

​Installation

​Basic Usage

​Generation Parameters

​Streaming Generation

​Serving with mlx-lm

​Using the Server

​Vision Models

Installation

Basic Usage

Generation Parameters

Streaming Generation

Serving with mlx-lm

Using the Server

Vision Models