ONNX

Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.

ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications.

LiquidONNX

LiquidONNX is the official tool for exporting LFM models to ONNX and running inference.

Installation

git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

# For GPU inference
uv sync --extra gpu

Supported Models

Family	Quantization Formats
LFM2.5, LFM2 (text)	fp32, fp16, q4, q8
LFM2.5-VL, LFM2-VL (vision)	fp32, fp16, q4, q8
LFM2-MoE	fp32, fp16, q4, q4f16
LFM2.5-Audio	fp32, fp16, q4, q8

Export

# Text models - export with all precisions (fp16, q4, q8)
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision

# Vision-language models
uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision

# MoE models
uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision

# Audio models
uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision

Inference

# Text model chat
uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx

# Vision-language with images
uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \
    --images photo.jpg --prompt "Describe this image"

# Audio transcription (ASR)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
    --audio input.wav --precision q4

# Text-to-speech (TTS)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
    --prompt "Hello, how are you?" --output speech.wav --precision q4

For complete documentation and advanced options, see the LiquidONNX GitHub repository.

Pre-exported Models

Many LFM models are available as pre-exported ONNX packages from LiquidAI and the onnx-community. Check the Model Library for a complete list of available formats.

Quantization Options

Each ONNX export includes multiple precision levels. Q4 is recommended for most deployments and supports WebGPU, CPU, and GPU. FP16 offers higher quality and works on WebGPU and GPU. Q8 provides a quality/size balance but is server-only (CPU/GPU). FP32 is the full precision baseline.

Hugging Face Spaces

These are fully deployed examples of WebGPU and ONNX inference with LFM models.

LFM2 WebGPU Chat

Run LFM2 text models directly in your browser with WebGPU acceleration.

LFM2.5 Audio

Speech-to-text and text-to-speech with LFM2.5 Audio in the browser.

LFM2.5 Vision

Vision-language inference with LFM2.5-VL in the browser.

WebGPU Inference

ONNX models run in browsers via Transformers.js with WebGPU acceleration. This enables fully client-side inference without server infrastructure.

Setup

Install Transformers.js:

npm install @huggingface/transformers

Enable WebGPU in your browser:
- Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
- Verify: Check chrome://gpu for WebGPU status

Usage

import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";

// Load model with WebGPU
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
  device: "webgpu",
  dtype: "q4",  // or "fp16"
});

// Generate with streaming
const messages = [{ role: "user", content: "What is the capital of France?" }];
const input = tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  return_dict: true,
});

const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
const output = await model.generate({
  ...input,
  max_new_tokens: 256,
  do_sample: false,
  streamer,
});

console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));

WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments.

Python Inference

Install with pip:

pip install onnxruntime transformers numpy huggingface_hub jinja2

# For GPU support
pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2

Full Python example with KV cache

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer

# Download Q4 model (recommended)
model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

# Download external data files
for f in list_repo_files(model_id):
    if f.startswith("onnx/model_q4.onnx_data"):
        hf_hub_download(model_id, f)

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prepare input
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer.encode(prompt, add_special_tokens=False)
input_ids = np.array([inputs], dtype=np.int64)

# Initialize KV cache
DTYPE_MAP = {
    "tensor(float)": np.float32,
    "tensor(float16)": np.float16,
    "tensor(int64)": np.int64
}
cache = {}
for inp in session.get_inputs():
    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
        continue
    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
    for i, d in enumerate(inp.shape):
        if isinstance(d, str) and "sequence" in d.lower():
            shape[i] = 0
    dtype = DTYPE_MAP.get(inp.type, np.float32)
    cache[inp.name] = np.zeros(shape, dtype=dtype)

# Generate tokens
seq_len = input_ids.shape[1]
generated = []
input_names = {inp.name for inp in session.get_inputs()}

for step in range(100):
    if step == 0:
        ids = input_ids
        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
    else:
        ids = np.array([[generated[-1]]], dtype=np.int64)
        pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64)

    attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64)
    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
    if "position_ids" in input_names:
        feed["position_ids"] = pos

    outputs = session.run(None, feed)
    next_token = int(np.argmax(outputs[0][0, -1]))
    generated.append(next_token)

    # Update cache
    for i, out in enumerate(session.get_outputs()[1:], 1):
        name = out.name.replace("present_conv", "past_conv")
        name = name.replace("present.", "past_key_values.")
        if name in cache:
            cache[name] = outputs[i]

    if next_token == tokenizer.eos_token_id:
        break

print(tokenizer.decode(generated, skip_special_tokens=True))

Get Started

Models

Key Concepts

Inference

Fine-tuning

Help

LiquidONNX

Installation

Supported Models

Export

Inference

Pre-exported Models

Quantization Options

Hugging Face Spaces

LFM2 WebGPU Chat

LFM2.5 Audio

LFM2.5 Vision

WebGPU Inference

Setup

Usage

Python Inference

Get Started

Models

Key Concepts

Inference

Fine-tuning

Help

​LiquidONNX

​Installation

​Supported Models

​Export

​Inference

​Pre-exported Models

​Quantization Options

​Hugging Face Spaces

LFM2 WebGPU Chat

LFM2.5 Audio

LFM2.5 Vision

​WebGPU Inference

​Setup

​Usage

​Python Inference

LiquidONNX

Installation

Supported Models

Export

Inference

Pre-exported Models

Quantization Options

Hugging Face Spaces

WebGPU Inference

Setup

Usage

Python Inference