API Spec

Latest version: v0.8.0.

`Leap`

Leap is the static entry point for loading on-device models.

Load model directly
Load from URL (legacy)

`Leap.load(model:quantization:options:downloadProgressHandler:)`

Download a model from the LEAP Model Library and load it into memory. If the model has already been downloaded, it will be loaded from the local cache without a remote request.

public struct Leap {
  public static func load(
    model: String,
    quantization: String,
    options: LiquidInferenceEngineManifestOptions? = nil,
    downloadProgressHandler: @escaping (_ progress: Double, _speed: Int64) -> Void
  ) async throws -> ModelRunner
}

Arguments

Name	Type	Required	Default	Description
`model`	`String`	Yes	-	The name of the model to load. See the LEAP Model Library for all available models.
`quantization`	`String`	Yes	-	The quantization level to download for the given model. See the LEAP Model Library for all available quantization levels.
`options`	`LiquidInferenceEngineManifestOptions`	No	`nil`	Override options for loading the model (recommended for advanced use cases only).
`downloadProgressHandler`	`(_ progress: Double, _speed: Int64) -> Void`	No	`nil`	A callback function to receive the download progress (as a percentage in decimal form between 0 and 1) and speed (in bytes per second).

Returns

ModelRunner: A ModelRunner instance that can be used to interact with the loaded model.

`ModelDownloader.downloadModel(model:quantization:downloadProgress:)`

Download a model from the LEAP Model Library and save it to the local cache, without loading it into memory.

public class ModelDownloader {
  public func downloadModel(
    _ model: String,
    quantization: String,
    downloadProgress: @escaping (_ progress: Double, _ speed: Int64) -> Void
  ) async throws -> DownloadedModelManifest
}

Arguments

Name	Type	Required	Default	Description
`model`	`String`	Yes	-	The name of the model to load. See the LEAP Model Library for all available models.
`quantization`	`String`	Yes	-	The quantization level to download for the given model. See the LEAP Model Library for all available quantization levels.
`downloadProgressHandler`	`(_ progress: Double, _speed: Int64) -> Void`	No	`nil`	A callback function to receive the download progress (as a percentage in decimal form between 0 and 1) and speed (in bytes per second).

Returns

DownloadedModelManifest: The DownloadedModelManifest instance that contains the metadata of the downloaded model:

public struct DownloadedModelManifest {
  public let manifest: ModelManifest
  public let localModelUrl: URL
  public let localMultimodalProjectorURL: URL?
  public let localAudioDecoderURL: URL?
  public let localAudioTokenizerURL: URL?
  public let chatTemplate: String?
}

public struct Leap {
  public static func load(
    url: URL,
    options: LiquidInferenceEngineOptions? = nil
  ) async throws -> ModelRunner
}

`load(url:options:)`

Loads a local model file (either a .bundle package or a .gguf checkpoint) and returns a ModelRunner instance.
Throws LeapError.modelLoadingFailure if the file cannot be loaded.
Automatically detects companion files placed alongside your model:
mmproj-*.gguf enables multimodal vision tokens for both bundle and GGUF flows.
Audio decoder artifacts whose filename contains "audio" and "decoder" with a .gguf or .bin extension unlock audio input/output for compatible checkpoints.
Must be called from an async context (for example inside an async function or a Task). Keep the returned ModelRunner alive while you interact with the model.

// ExecuTorch backend via .bundle
let bundleURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "bundle")!
let runner = try await Leap.load(url: bundleURL)

// llama.cpp backend via .gguf
let ggufURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "gguf")!
let ggufRunner = try await Leap.load(url: ggufURL)

`LiquidInferenceEngineOptions`

Pass a LiquidInferenceEngineOptions value when you need to override the default runtime configuration.

public struct LiquidInferenceEngineOptions {
  public var bundlePath: String
  public let cacheOptions: LiquidCacheOptions?
  public let cpuThreads: UInt32?
  public let contextSize: UInt32?
  public let nGpuLayers: UInt32?
  public let mmProjPath: String?
  public let audioDecoderPath: String?
  public let chatTemplate: String?
  public let audioTokenizerPath: String?
  public let extras: String?
}

bundlePath: Path to the model file on disk. When you call Leap.load(url:), this is filled automatically.
cacheOptions: Configure persistence of KV-cache data between generations.
cpuThreads: Number of CPU threads for token generation.
contextSize: Override the default maximum context length for the model.
nGpuLayers: Number of layers to offload to GPU (for macOS/macCatalyst targets with Metal support).
mmProjPath: Optional path to an auxiliary multimodal projection model. Leave nil to auto-detect a sibling mmproj-*.gguf.
audioDecoderPath: Optional audio decoder model. Leave nil to auto-detect nearby decoder artifacts.
chatTemplate: Advanced override for backend chat templating.
audioTokenizerPath: Optional tokenizer for audio-capable checkpoints.
extras: Backend-specific configuration payload (advanced use only).

info

Backend selection is automatic: .bundle files run on the ExecuTorch backend, while .gguf checkpoints use the embedded llama.cpp backend. Bundled models reference their projection data in metadata; GGUF checkpoints look for sibling companion files (multimodal projection, audio decoder, audio tokenizer) unless you override the paths through LiquidInferenceEngineOptions. Ensure these artifacts are co-located when you want vision or audio features.

Example overriding the number of CPU threads and context size:

let options = LiquidInferenceEngineOptions(
  bundlePath: bundleURL.path,
  cpuThreads: 6,
  contextSize: 8192
)
let runner = try await Leap.load(url: bundleURL, options: options)

`ModelRunner`

A ModelRunner represents a loaded model instance. The SDK returns concrete ModelRunner implementations, but your code only needs the protocol surface:

public protocol ModelRunner {
  func createConversation(systemPrompt: String?) -> Conversation
  func createConversationFromHistory(history: [ChatMessage]) -> Conversation
  func generateResponse(
    conversation: Conversation,
    generationOptions: GenerationOptions?,
    onResponseCallback: @escaping (MessageResponse) -> Void,
    onErrorCallback: ((LeapError) -> Void)?
  ) -> GenerationHandler
  func unload() async
  var modelId: String { get }
}

Lifecycle

Create conversations using createConversation(systemPrompt:) or createConversationFromHistory(history:).
Hold a strong reference to the ModelRunner for as long as you need to perform generations.
Call unload() when you are done to release native resources (optional, happens automatically on deinit).
Access modelId to identify the loaded model (for analytics, debugging, or UI labels).

Low-level generation API

generateResponse(...) drives generation with callbacks and returns a GenerationHandler you can store to cancel the run. Most apps call the higher-level streaming helpers on Conversation, but you can invoke this method directly when you need fine-grained control (for example, integrating with custom async primitives).

let handler = runner.generateResponse(
  conversation: conversation,
  generationOptions: options,
  onResponseCallback: { message in
    // Handle MessageResponse values here
  },
  onErrorCallback: { error in
    // Handle LeapError
  }
)

// Stop generation early if needed
handler.stop()

`GenerationHandler`

public protocol GenerationHandler: Sendable {
  func stop()
}

The handler returned by ModelRunner.generateResponse or Conversation.generateResponse(..., onResponse:) lets you cancel generation without tearing down the conversation.

`Conversation`

Conversation tracks chat state and provides streaming helpers built on top of the model runner.

public class Conversation {
  public let modelRunner: ModelRunner
  public private(set) var history: [ChatMessage]
  public private(set) var functions: [LeapFunction]
  public private(set) var isGenerating: Bool

  public init(modelRunner: ModelRunner, history: [ChatMessage])

  public func registerFunction(_ function: LeapFunction)
  public func exportToJSON() throws -> [[String: Any]]

  public func generateResponse(
    userTextMessage: String,
    generationOptions: GenerationOptions? = nil
  ) -> AsyncThrowingStream<MessageResponse, Error>

  public func generateResponse(
    message: ChatMessage,
    generationOptions: GenerationOptions? = nil
  ) -> AsyncThrowingStream<MessageResponse, Error>

  @discardableResult
  public func generateResponse(
    message: ChatMessage,
    generationOptions: GenerationOptions? = nil,
    onResponse: @escaping (MessageResponse) -> Void
  ) -> GenerationHandler?
}

Properties

history: Copy of the accumulated chat messages. The SDK appends the assistant reply when a generation finishes successfully.
functions: Functions registered via registerFunction(_:) for function calling.
isGenerating: Boolean flag indicating whether a generation is currently running. Attempts to start a new generation while this is true immediately finish with an empty stream (or nil handler for the callback variant).

Streaming Convenience

The most common pattern is to use the async-stream helpers:

let user = ChatMessage(role: .user, content: [.text("Hello! What can you do?")])

Task {
  do {
    for try await response in conversation.generateResponse(
      message: user,
      generationOptions: GenerationOptions(temperature: 0.7)
    ) {
      switch response {
      case .chunk(let delta):
        print(delta, terminator: "")
      case .reasoningChunk(let thought):
        print("Reasoning:", thought)
      case .functionCall(let calls):
        handleFunctionCalls(calls)
      case .audioSample(let samples, let sampleRate):
        audioRenderer.enqueue(samples, sampleRate: sampleRate)
      case .complete(let completion):
        let text = completion.message.content.compactMap { item in
          if case .text(let value) = item { return value }
          return nil
        }.joined()
        print("\nComplete:", text)
        if let stats = completion.stats {
          print("Prompt tokens: \(stats.promptTokens), completions: \(stats.completionTokens)")
        }
      }
    }
  } catch {
    print("Generation failed: \(error)")
  }
}

Cancelling the task that iterates the stream stops generation and cleans up native resources.

Callback Convenience

Use generateResponse(message:onResponse:) when you prefer callbacks or need to integrate with imperative UI components:

let handler = conversation.generateResponse(message: user) { response in
  updateUI(with: response)
}

// Later
handler?.stop()

If a generation is already running, the method returns nil and emits a .complete message with finishReason == .stop via the callback.

warning

The callback overload does not surface generation errors. Use the async-stream helper or call ModelRunner.generateResponse with onErrorCallback when you need error handling.

Export Chat History

exportToJSON() serializes the conversation history into a [[String: Any]] payload that mirrors OpenAI's chat-completions format. This is useful for persistence, analytics, or debugging tools.

`MessageResponse`

public enum MessageResponse {
  case chunk(String)
  case reasoningChunk(String)
  case audioSample(samples: [Float], sampleRate: Int)
  case functionCall([LeapFunctionCall])
  case complete(MessageCompletion)
}

public struct MessageCompletion {
  public let message: ChatMessage
  public let finishReason: GenerationFinishReason
  public let stats: GenerationStats?

  public var info: GenerationCompleteInfo { get }
}

public struct GenerationCompleteInfo {
  public let finishReason: GenerationFinishReason
  public let stats: GenerationStats?
}

public struct GenerationStats {
  public var promptTokens: UInt64
  public var completionTokens: UInt64
  public var totalTokens: UInt64
  public var tokenPerSecond: Float
}

chunk: Partial assistant text emitted during streaming.
reasoningChunk: Model reasoning tokens wrapped between <think> / </think> (only for models that expose reasoning traces).
audioSample: PCM audio frames streamed from audio-capable checkpoints. Feed them into an audio renderer or buffer for later playback.
functionCall: One or more function/tool invocations requested by the model. See the Function Calling guide.
complete: Signals the end of generation. Access the assembled assistant reply through completion.message. Stats and finish reason live on the completion object; completion.info is provided for backward compatibility.

Errors surfaced during streaming are delivered through the thrown error of AsyncThrowingStream, or via the onErrorCallback closure when using the lower-level API.

Chat Messages

Roles

public enum ChatMessageRole: String {
  case user
  case system
  case assistant
  case tool
}

Include .tool messages when you append function-call results back into the conversation.

Message Structure

public struct ChatMessage {
  public var role: ChatMessageRole
  public var content: [ChatMessageContent]
  public var reasoningContent: String?
  public var functionCalls: [LeapFunctionCall]?

  public init(
    role: ChatMessageRole,
    content: [ChatMessageContent],
    reasoningContent: String? = nil,
    functionCalls: [LeapFunctionCall]? = nil
  )

  public init(from json: [String: Any]) throws
}

content: Ordered fragments of the message. The SDK supports .text, .image, and .audio parts.
reasoningContent: Optional text produced inside <think> tags by eligible models.
functionCalls: Attach the calls returned by MessageResponse.functionCall when you include tool execution results in the history.

Message Content

public enum ChatMessageContent {
  case text(String)
  case image(Data)   // JPEG bytes
  case audio(Data)   // WAV bytes

  public init(from json: [String: Any]) throws
}

Provide JPEG-encoded bytes for .image and WAV data for .audio. Helper initializers such as ChatMessageContent.fromUIImage, ChatMessageContent.fromNSImage, ChatMessageContent.fromWAVData, and ChatMessageContent.fromFloatSamples(_:sampleRate:channelCount:) simplify interop with platform-native buffers. On the wire, image parts are encoded as OpenAI-style image_url payloads and audio parts as input_audio arrays with Base64 data.

`GenerationOptions`

Tune generation behavior with GenerationOptions.

public struct GenerationOptions {
  public var temperature: Float?
  public var topP: Float?
  public var minP: Float?
  public var repetitionPenalty: Float?
  public var jsonSchemaConstraint: String?
  public var functionCallParser: LeapFunctionCallParserProtocol?

  public init(
    temperature: Float? = nil,
    topP: Float? = nil,
    minP: Float? = nil,
    repetitionPenalty: Float? = nil,
    jsonSchemaConstraint: String? = nil,
    functionCallParser: LeapFunctionCallParserProtocol? = LFMFunctionCallParser()
  )
}

Leave a field as nil to fall back to the defaults packaged with the model bundle.
functionCallParser controls how tool-call tokens are parsed. LFMFunctionCallParser (the default) handles Liquid Foundation Model Pythonic function calling. Supply HermesFunctionCallParser() for Hermes/Qwen3 formats, or set the parser to nil to receive raw tool-call text in MessageResponse.chunk.
jsonSchemaConstraint activates constrained generation. Use setResponseFormat(type:) to populate it from a type annotated with the @Generatable macro.

extension GenerationOptions {
  public mutating func setResponseFormat<T: GeneratableType>(type: T.Type) throws {
    self.jsonSchemaConstraint = try JSONSchemaGenerator.getJSONSchema(for: type)
  }
}

var options = GenerationOptions(temperature: 0.6, topP: 0.9)
try options.setResponseFormat(type: CityFact.self)

for try await response in conversation.generateResponse(
  message: user,
  generationOptions: options
) {
  // Handle structured output
}

LiquidInferenceEngineRunner exposes advanced utilities such as getPromptTokensSize(messages:addBosToken:) for applications that need to budget tokens ahead of time. These methods are backend-specific and may be elevated to the ModelRunner protocol in a future release.

Function Calling Types

public struct LeapFunction {
  public let name: String
  public let description: String
  public let parameters: [LeapFunctionParameter]
}

public struct LeapFunctionParameter {
  public let name: String
  public let type: LeapFunctionParameterType
  public let description: String
  public let optional: Bool
}

public indirect enum LeapFunctionParameterType: Codable, Equatable {
  case string(StringType)
  case number(NumberType)
  case integer(IntegerType)
  case boolean(BooleanType)
  case array(ArrayType)
  case object(ObjectType)
  case null(NullType)
}

The parameter type wrappers (StringType, NumberType, etc.) let you attach descriptions and enumerations so that JSON schemas accurately describe your tools. See the Function Calling guide for in-depth usage patterns.

Errors

Errors are surfaced as LeapError values. The most common cases are:

LeapError.modelLoadingFailure: Problems reading or validating the model bundle.
LeapError.generationFailure: Unexpected native inference errors.
LeapError.promptExceedContextLengthFailure: Prompt length exceeded the configured context size.
LeapError.serializationFailure: JSON encoding/decoding problems when working with chat history or function calls.

Handle thrown errors with do / catch when using async streams, or use the onErrorCallback in the lower-level API.

Putting it together

let runner = try await Leap.load(url: bundleURL)
let conversation = runner.createConversation(systemPrompt: "You are a travel assistant.")

conversation.registerFunction(weatherFunction)

var options = GenerationOptions(temperature: 0.8)
try options.setResponseFormat(type: TripRecommendation.self)

let userMessage = ChatMessage(
  role: .user,
  content: [.text("Plan a 3-day trip to Kyoto with food highlights")]
)

for try await response in conversation.generateResponse(
  message: userMessage,
  generationOptions: options
) {
  process(response)
}

Refer to the Quick Start for end-to-end project setup, Function Calling for tool invocation, and Constrained Generation for structured outputs.

Leap​

Leap.load(model:quantization:options:downloadProgressHandler:)​

ModelDownloader.downloadModel(model:quantization:downloadProgress:)​

load(url:options:)​

LiquidInferenceEngineOptions​

ModelRunner​

Lifecycle​

Low-level generation API​

GenerationHandler​

Conversation​

Properties​

Streaming Convenience​

Callback Convenience​

Export Chat History​

MessageResponse​

Chat Messages​

Roles​

Message Structure​

Message Content​

GenerationOptions​

Function Calling Types​

Errors​

Putting it together​