ONNX provides a platform-agnostic inference specification that allows running the model on device-specific runtimes that include CPU, GPU, NPU, and WebGPU.
Use ONNX for cross-platform deployment, edge devices, and browser-based inference with WebGPU and Transformers.js.
ONNX (Open Neural Network Exchange) is a portable format that enables LFM inference across diverse hardware and runtimes. ONNX models run on CPUs, GPUs, NPUs, and in browsers via WebGPU—making them ideal for edge deployment and web applications.
# Text models - export with all precisions (fp16, q4, q8)uv run lfm2-export LiquidAI/LFM2.5-1.2B-Instruct --precision# Vision-language modelsuv run lfm2-vl-export LiquidAI/LFM2.5-VL-1.6B --precision# MoE modelsuv run lfm2-moe-export LiquidAI/LFM2-8B-A1B --precision# Audio modelsuv run lfm2-audio-export LiquidAI/LFM2.5-Audio-1.5B --precision
# Text model chatuv run lfm2-infer --model ./exports/LFM2.5-1.2B-Instruct-ONNX/onnx/model_q4.onnx# Vision-language with imagesuv run lfm2-vl-infer --model ./exports/LFM2.5-VL-1.6B-ONNX \ --images photo.jpg --prompt "Describe this image"# Audio transcription (ASR)uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode asr \ --audio input.wav --precision q4# Text-to-speech (TTS)uv run lfm2-audio-infer LFM2.5-Audio-1.5B-ONNX --mode tts \ --prompt "Hello, how are you?" --output speech.wav --precision q4
Many LFM models are available as pre-exported ONNX packages from LiquidAI and the onnx-community. Check the Model Library for a complete list of available formats.
Each ONNX export includes multiple precision levels. Q4 is recommended for most deployments and supports WebGPU, CPU, and GPU. FP16 offers higher quality and works on WebGPU and GPU. Q8 provides a quality/size balance but is server-only (CPU/GPU). FP32 is the full precision baseline.