Skip to main content

View Source Code

Browse the complete example on GitHub
A browser driving game you control with your hands and voice, powered by models running fully local. Steer by holding both hands up like a steering wheel. Speak commands to accelerate, brake, toggle headlights, and play music. No cloud calls, no server round-trips. Everything runs in your browser tab.

How it works

Two models run in parallel, entirely client-side:
  • MediaPipe Hand Landmarker tracks your hand positions via webcam at ~30 fps. The angle between your two wrists drives the steering.
  • LFM2.5-Audio-1.5B runs in a Web Worker with ONNX Runtime Web. It listens for speech via the Silero VAD and transcribes each utterance on-device. Matched keywords control game state.
The audio model loads from Hugging Face and is cached in IndexedDB after the first run, so subsequent starts are instant.

Voice commands

SayEffect
speed / fast / goAccelerate to 120 km/h
slow / stop / brakeDecelerate to 0 km/h
lights onEnable headlights
lights offDisable headlights
music / playStart the techno beat
stop music / silenceStop the beat

Prerequisites

Browser Requirements
  • Chrome 113+ or Edge 113+ (WebGPU required for fast audio inference; falls back to WASM)
  • Webcam and microphone access
  • Node.js 18+

Run locally

npm install
npm run dev
Then open http://localhost:3001. On first load the audio model (~900 MB at Q4 quantization) downloads from Hugging Face and is cached in your browser. Hand detection assets load from CDN and MediaPipe’s model storage.

Architecture

Browser tab
β”œβ”€β”€ main thread
β”‚   β”œβ”€β”€ MediaPipe HandLandmarker  (webcam β†’ hand angles β†’ steering)
β”‚   β”œβ”€β”€ Canvas 2D renderer        (road, scenery, dashboard, HUD)
β”‚   └── Web Audio API             (procedural techno synthesizer)
└── audio-worker.js (Web Worker)
    β”œβ”€β”€ Silero VAD                (mic β†’ speech segments)
    └── LFM2.5-Audio-1.5B ONNX   (speech segment β†’ transcript β†’ keyword)
The game loop runs on requestAnimationFrame. Hand detection is throttled to ~30 fps so it does not block rendering. Voice processing happens off the main thread and delivers results via postMessage.

Need help?