> ## Documentation Index
> Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# MLX

> MLX is Apple's machine learning framework optimized for Apple Silicon. It provides efficient inference on Mac devices with M-series chips (M1, M2, M3, M4) using Metal acceleration for GPU computing.

<Tip>
  Use MLX for running models on Apple Silicon Macs with Metal GPU acceleration.
</Tip>

MLX leverages unified memory architecture on Apple Silicon, allowing seamless data sharing between CPU and GPU. The `mlx-lm` package provides a simple interface for loading and serving LLMs.

## Installation

Install the MLX language model package:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
pip install mlx-lm
```

## Basic Usage

The `mlx-lm` package provides a simple interface for text generation with MLX models.

See the [Models page](/lfm/models/complete-library) for all available MLX models, or browse MLX community models at [mlx-community LFM2 models](https://huggingface.co/models?sort=created\&search=mlx-communityLFM2).

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from mlx_lm import load, generate

# Load model and tokenizer
model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")

# Generate text
prompt = "What is machine learning?"

# Apply chat template
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, tokenizer=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, verbose=True)
print(response)
```

### Generation Parameters

Control text generation behavior using parameters in the `generate()` function. Key parameters:

* **`temperature`** (`float`, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
* **`top_p`** (`float`, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top\_p. Typical range: 0.1-1.0
* **`top_k`** (`int`, default 50): Limits to top-k most probable tokens. Typical range: 1-100
* **`max_tokens`** (`int`): Maximum number of tokens to generate
* **`repetition_penalty`** (`float`, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5

Example with custom parameters:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    temperature=0.3,
    min_p=0.15,
    repetition_penalty=1.05,
    max_tokens=512
)
```

## Streaming Generation

Stream responses with `stream_generate()`:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")

messages = [{"role": "user", "content": "Tell me a story about space exploration."}]
prompt = tokenizer.apply_chat_template(
    messages, tokenizer=False, add_generation_prompt=True
)

for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=512):
    print(token, end="", flush=True)
```

## Serving with mlx-lm

MLX can serve models through an OpenAI-compatible API. Start a server with:

```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
mlx_lm.server --model mlx-community/LFM2-1.2B-8bit --port 8080
```

### Using the Server

Once running, use the OpenAI Python client:

```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mlx-community/LFM2-1.2B-8bit",
    messages=[
        {"role": "user", "content": "Explain quantum computing."}
    ],
    temperature=0.3,
    max_tokens=512,
    extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)
```

You can also use curl to interact with the server:

<Accordion title="Curl request example">
  ```bash theme={"theme":{"light":"github-light","dark":"github-dark"}}
  curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "mlx-community/LFM2-1.2B-8bit",
      "messages": [{"role": "user", "content": "Hello!"}],
      "temperature": 0.3,
      "min_p": 0.15,
      "repetition_penalty": 1.05
    }'
  ```
</Accordion>

## Vision Models

LFM2-VL models support both text and image inputs for multimodal inference. Use `mlx_vlm` to load and generate with vision models:

<Accordion title="Single Image Example">
  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  from mlx_vlm import load, generate
  from mlx_vlm.prompt_utils import apply_chat_template
  from mlx_vlm.utils import load_image_processor
  from PIL import Image

  # Load vision model
  model, processor = load("mlx-community/LFM2-VL-1.6B-8bit")

  # Load image
  image = Image.open("path/to/image.jpg")

  # Create prompt
  messages = [
      {
          "role": "user",
          "content": [
              {"type": "image"},
              {"type": "text", "text": "What's in this image?"}
          ]
      }
  ]

  # Apply chat template
  prompt = apply_chat_template(processor, messages)

  # Generate
  output = generate(model, processor, image, prompt, verbose=False)
  print(output)
  ```
</Accordion>

<Accordion title="Multiple Images Example">
  ```python theme={"theme":{"light":"github-light","dark":"github-dark"}}
  images = [
      Image.open("path/to/first.jpg"),
      Image.open("path/to/second.jpg")
  ]

  messages = [
      {
          "role": "user",
          "content": [
              {"type": "image"},
              {"type": "image"},
              {"type": "text", "text": "What are the differences between these images?"}
          ]
      }
  ]

  prompt = apply_chat_template(processor, messages)
  output = generate(model, processor, images, prompt, verbose=False)
  print(output)
  ```
</Accordion>
