Skip to main content

View Source Code

Browse the complete example on GitHub
This example showcases the LFM2.5-Audio-1.5B model running entirely within a web browser using WebGPU and ONNX Runtime Web technology. The demo provides three powerful audio processing modes: automatic speech recognition, text-to-speech synthesis, and interleaved conversations.

Whatโ€™s Inside?

The demo provides three primary capabilities powered by LFM2.5-Audio-1.5B:
  • ASR (Automatic Speech Recognition): Convert spoken audio into accurate text transcriptions
  • TTS (Text-to-Speech): Transform written text into natural-sounding audio output
  • Interleaved Mode: Enable mixed conversations combining both audio and text inputs
All processing happens locally in your browser - no data is sent to external servers.

Quick Start

  1. Clone the repository
    git clone https://github.com/Liquid4All/cookbook.git
    cd cookbook/examples/audio-webgpu-demo
    
  2. Verify you have npm installed on your system
    npm --version
    
  3. Install dependencies
    npm install
    
  4. Start the development server
    npm run dev
    
  5. Access the application at http://localhost:5173 in your browser

Understanding the Architecture

This demo uses the LFM2.5-Audio-1.5B model, a 1.5 billion parameter audio model that handles both speech recognition and speech synthesis. The model has been quantized and converted to ONNX format for efficient browser-based inference.

Model Architecture

The implementation uses quantized ONNX models sourced from the LiquidAI/LFM2.5-Audio-1.5B-ONNX repository on Hugging Face. These models are optimized to run with WebGPU acceleration, providing fast inference directly in the browser.

Three Operation Modes

1. Automatic Speech Recognition (ASR)
  • Input: Audio file or microphone recording
  • Output: Text transcription
  • Use case: Transcribe meetings, lectures, or voice notes
2. Text-to-Speech (TTS)
  • Input: Written text
  • Output: Natural-sounding audio
  • Use case: Create voice assistants, audiobooks, or accessibility features
3. Interleaved Mode
  • Input: Mixed audio and text
  • Output: Conversational responses in text or audio
  • Use case: Interactive voice assistants and chatbots

System Requirements

WebGPU Support RequiredThis demo requires a modern web browser with WebGPU support:
  • Chrome 113 or later (recommended)
  • Edge 113 or later
If WebGPU is not enabled by default, you may need to manually activate it via browser flags:
  • Chrome: chrome://flags/#enable-unsafe-webgpu
  • Edge: edge://flags/#enable-unsafe-webgpu

Model Licensing

LFM 1.0 LicenseThe model weights are distributed under the LFM 1.0 License. For complete licensing details, refer to the official Hugging Face repository.

Build for Production

To create an optimized production build:
npm run build
The build output will be in the dist/ directory, ready for deployment to any web server.

Further Improvements

Potential enhancements for this demo:
  • Streaming Inference: Real-time processing for longer audio inputs
  • Voice Customization: Add controls for pitch, speed, and voice characteristics in TTS mode
  • Noise Reduction: Integrate preprocessing to improve ASR accuracy in noisy environments
  • Batch Processing: Support for processing multiple audio files simultaneously
  • Model Caching: Optimize initial load time with better caching strategies

Need help?