MLX

MLX is Apple's machine learning framework optimized for Apple Silicon. It provides efficient inference on Mac devices with M-series chips (M1, M2, M3, M4) using Metal acceleration for GPU computing.

Use MLX for:

Running models on Apple Silicon Macs
Efficient on-device inference with Metal GPU acceleration
Local development on macOS

MLX leverages unified memory architecture on Apple Silicon, allowing seamless data sharing between CPU and GPU. The mlx-lm package provides a simple interface for loading and serving LLMs.

Installation

Install the MLX language model package:

pip install mlx-lm

Basic Usage

The mlx-lm package provides a simple interface for text generation with MLX models.

See the Models page for all available MLX models, or browse MLX community models at mlx-community LFM2 models.

from mlx_lm import load, generate

# Load model and tokenizer
model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")

# Generate text
prompt = "What is machine learning?"

# Apply chat template
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, tokenizer=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, verbose=True)
print(response)

Generation Parameters

Control text generation behavior using parameters in the generate() function. Key parameters:

temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int, default 50): Limits to top-k most probable tokens. Typical range: 1-100
max_tokens (int): Maximum number of tokens to generate
repetition_penalty (float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5

Example with custom parameters:

response = generate(
    model, 
    tokenizer, 
    prompt=prompt,
    temperature=0.3,
    top_p=0.9,
    max_tokens=512,
    repetition_penalty=1.05
)

Streaming Generation

Stream responses with stream_generate():

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")

messages = [{"role": "user", "content": "Tell me a story about space exploration."}]
prompt = tokenizer.apply_chat_template(
    messages, tokenizer=False, add_generation_prompt=True
)

for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=512):
    print(token, end="", flush=True)

Serving with mlx-lm

MLX can serve models through an OpenAI-compatible API. Start a server with:

mlx_lm.server --model mlx-community/LFM2-1.2B-8bit --port 8080

Using the Server

Once running, use the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mlx-community/LFM2-1.2B-8bit",
    messages=[
        {"role": "user", "content": "Explain quantum computing."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

You can also use curl to interact with the server:

Curl request example

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/LFM2-1.2B-8bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

Vision Models

LFM2-VL models support both text and image inputs for multimodal inference. Use mlx_vlm to load and generate with vision models:

Single Image Example

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_image_processor
from PIL import Image

# Load vision model
model, processor = load("mlx-community/LFM2-VL-1.6B-8bit")

# Load image
image = Image.open("path/to/image.jpg")

# Create prompt
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's in this image?"}
        ]
    }
]

# Apply chat template
prompt = apply_chat_template(processor, messages)

# Generate
output = generate(model, processor, image, prompt, verbose=False)
print(output)

Multiple Images Example

images = [
    Image.open("path/to/first.jpg"),
    Image.open("path/to/second.jpg")
]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "What are the differences between these images?"}
        ]
    }
]

prompt = apply_chat_template(processor, messages)
output = generate(model, processor, images, prompt, verbose=False)
print(output)

Installation​

Basic Usage​

Generation Parameters​

Streaming Generation​

Serving with mlx-lm​

Using the Server​

Vision Models​