Skip to main content
Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.
ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications.

LiquidONNX

LiquidONNX is the official tool for exporting LFM models to ONNX and running inference.

Installation

git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

# For GPU inference
uv sync --extra gpu

Supported Models

FamilyQuantization Formats
LFM2.5, LFM2 (text)fp32, fp16, q4, q8
LFM2.5-VL, LFM2-VL (vision)fp32, fp16, q4, q8
LFM2-MoEfp32, fp16, q4, q4f16
LFM2.5-Audiofp32, fp16, q4, q8

Export

# Text models - export with all precisions (fp16, q4, q8)
uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision

# Vision-language models
uv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision

# MoE models
uv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision

# Audio models
uv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision

Inference

# Text model chat
uv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx

# Vision-language with images
uv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \
    --images photo.jpg --prompt "Describe this image"

# Audio transcription (ASR)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \
    --audio input.wav --precision q4

# Text-to-speech (TTS)
uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \
    --prompt "Hello, how are you?" --output speech.wav --precision q4
For complete documentation and advanced options, see the LiquidONNX GitHub repository.

Pre-exported Models

Many LFM models are available as pre-exported ONNX packages from LiquidAI and the onnx-community. Check the Model Library for a complete list of available formats.

Quantization Options

Each ONNX export includes multiple precision levels. Q4 is recommended for most deployments and supports WebGPU, CPU, and GPU. FP16 offers higher quality and works on WebGPU and GPU. Q8 provides a quality/size balance but is server-only (CPU/GPU). FP32 is the full precision baseline.

Hugging Face Spaces

These are fully deployed examples of WebGPU and ONNX inference with LFM models.

WebGPU Inference

ONNX models run in browsers via Transformers.js with WebGPU acceleration. This enables fully client-side inference without server infrastructure.

Setup

  1. Install Transformers.js:
npm install @huggingface/transformers
  1. Enable WebGPU in your browser:
    • Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
    • Verify: Check chrome://gpu for WebGPU status

Usage

import { AutoModelForCausalLM, AutoTokenizer, TextStreamer } from "@huggingface/transformers";

const modelId = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX";

// Load model with WebGPU
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {
  device: "webgpu",
  dtype: "q4",  // or "fp16"
});

// Generate with streaming
const messages = [{ role: "user", content: "What is the capital of France?" }];
const input = tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  return_dict: true,
});

const streamer = new TextStreamer(tokenizer, { skip_prompt: true });
const output = await model.generate({
  ...input,
  max_new_tokens: 256,
  do_sample: false,
  streamer,
});

console.log(tokenizer.decode(output[0], { skip_special_tokens: true }));
WebGPU supports Q4 and FP16 precision. Q8 quantization is not available in browser environments.

Python Inference

Install with pip:
pip install onnxruntime transformers numpy huggingface_hub jinja2

# For GPU support
pip install onnxruntime-gpu transformers numpy huggingface_hub jinja2
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer

# Download Q4 model (recommended)
model_id = "LiquidAI/LFM2.5-1.2B-Instruct-ONNX"
model_path = hf_hub_download(model_id, "onnx/model_q4.onnx")

# Download external data files
for f in list_repo_files(model_id):
    if f.startswith("onnx/model_q4.onnx_data"):
        hf_hub_download(model_id, f)

# Load model and tokenizer
session = ort.InferenceSession(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prepare input
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer.encode(prompt, add_special_tokens=False)
input_ids = np.array([inputs], dtype=np.int64)

# Initialize KV cache
DTYPE_MAP = {
    "tensor(float)": np.float32,
    "tensor(float16)": np.float16,
    "tensor(int64)": np.int64
}
cache = {}
for inp in session.get_inputs():
    if inp.name in {"input_ids", "attention_mask", "position_ids"}:
        continue
    shape = [d if isinstance(d, int) else 1 for d in inp.shape]
    for i, d in enumerate(inp.shape):
        if isinstance(d, str) and "sequence" in d.lower():
            shape[i] = 0
    dtype = DTYPE_MAP.get(inp.type, np.float32)
    cache[inp.name] = np.zeros(shape, dtype=dtype)

# Generate tokens
seq_len = input_ids.shape[1]
generated = []
input_names = {inp.name for inp in session.get_inputs()}

for step in range(100):
    if step == 0:
        ids = input_ids
        pos = np.arange(seq_len, dtype=np.int64).reshape(1, -1)
    else:
        ids = np.array([[generated[-1]]], dtype=np.int64)
        pos = np.array([[seq_len + len(generated) - 1]], dtype=np.int64)

    attn_mask = np.ones((1, seq_len + len(generated)), dtype=np.int64)
    feed = {"input_ids": ids, "attention_mask": attn_mask, **cache}
    if "position_ids" in input_names:
        feed["position_ids"] = pos

    outputs = session.run(None, feed)
    next_token = int(np.argmax(outputs[0][0, -1]))
    generated.append(next_token)

    # Update cache
    for i, out in enumerate(session.get_outputs()[1:], 1):
        name = out.name.replace("present_conv", "past_conv")
        name = name.replace("present.", "past_key_values.")
        if name in cache:
            cache[name] = outputs[i]

    if next_token == tokenizer.eos_token_id:
        break

print(tokenizer.decode(generated, skip_special_tokens=True))