Skip to main content

API Spec

Latest version: v0.7.0.

Leap

Leap is the static entry point for loading on-device models.

public struct Leap {
public static func load(
url: URL,
options: LiquidInferenceEngineOptions? = nil
) async throws -> ModelRunner
}

load(url:options:)

  • Loads a local model file (either a .bundle package or a .gguf checkpoint) and returns a ModelRunner instance.
  • Throws LeapError.modelLoadingFailure if the file cannot be loaded.
  • Automatically detects companion files placed alongside your model:
  • mmproj-*.gguf enables multimodal vision tokens for both bundle and GGUF flows.
  • Audio decoder artifacts whose filename contains "audio" and "decoder" with a .gguf or .bin extension unlock audio input/output for compatible checkpoints.
  • Must be called from an async context (for example inside an async function or a Task). Keep the returned ModelRunner alive while you interact with the model.
// ExecuTorch backend via .bundle
let bundleURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "bundle")!
let runner = try await Leap.load(url: bundleURL)

// llama.cpp backend via .gguf
let ggufURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "gguf")!
let ggufRunner = try await Leap.load(url: ggufURL)

LiquidInferenceEngineOptions

Pass a LiquidInferenceEngineOptions value when you need to override the default runtime configuration.

public struct LiquidInferenceEngineOptions {
public var bundlePath: String
public let cacheOptions: LiquidCacheOptions?
public let cpuThreads: UInt32?
public let contextSize: UInt32?
public let nGpuLayers: UInt32?
public let mmProjPath: String?
public let audioDecoderPath: String?
public let chatTemplate: String?
public let audioTokenizerPath: String?
public let extras: String?
}
  • bundlePath: Path to the model file on disk. When you call Leap.load(url:), this is filled automatically.
  • cacheOptions: Configure persistence of KV-cache data between generations.
  • cpuThreads: Number of CPU threads for token generation.
  • contextSize: Override the default maximum context length for the model.
  • nGpuLayers: Number of layers to offload to GPU (for macOS/macCatalyst targets with Metal support).
  • mmProjPath: Optional path to an auxiliary multimodal projection model. Leave nil to auto-detect a sibling mmproj-*.gguf.
  • audioDecoderPath: Optional audio decoder model. Leave nil to auto-detect nearby decoder artifacts.
  • chatTemplate: Advanced override for backend chat templating.
  • audioTokenizerPath: Optional tokenizer for audio-capable checkpoints.
  • extras: Backend-specific configuration payload (advanced use only).
info

Backend selection is automatic: .bundle files run on the ExecuTorch backend, while .gguf checkpoints use the embedded llama.cpp backend. Bundled models reference their projection data in metadata; GGUF checkpoints look for sibling companion files (multimodal projection, audio decoder, audio tokenizer) unless you override the paths through LiquidInferenceEngineOptions. Ensure these artifacts are co-located when you want vision or audio features.

Example overriding the number of CPU threads and context size:

let options = LiquidInferenceEngineOptions(
bundlePath: bundleURL.path,
cpuThreads: 6,
contextSize: 8192
)
let runner = try await Leap.load(url: bundleURL, options: options)

ModelRunner

A ModelRunner represents a loaded model instance. The SDK returns concrete ModelRunner implementations, but your code only needs the protocol surface:

public protocol ModelRunner {
func createConversation(systemPrompt: String?) -> Conversation
func createConversationFromHistory(history: [ChatMessage]) -> Conversation
func generateResponse(
conversation: Conversation,
generationOptions: GenerationOptions?,
onResponseCallback: @escaping (MessageResponse) -> Void,
onErrorCallback: ((LeapError) -> Void)?
) -> GenerationHandler
func unload() async
var modelId: String { get }
}

Lifecycle

  • Create conversations using createConversation(systemPrompt:) or createConversationFromHistory(history:).
  • Hold a strong reference to the ModelRunner for as long as you need to perform generations.
  • Call unload() when you are done to release native resources (optional, happens automatically on deinit).
  • Access modelId to identify the loaded model (for analytics, debugging, or UI labels).

Low-level generation API

generateResponse(...) drives generation with callbacks and returns a GenerationHandler you can store to cancel the run. Most apps call the higher-level streaming helpers on Conversation, but you can invoke this method directly when you need fine-grained control (for example, integrating with custom async primitives).

let handler = runner.generateResponse(
conversation: conversation,
generationOptions: options,
onResponseCallback: { message in
// Handle MessageResponse values here
},
onErrorCallback: { error in
// Handle LeapError
}
)

// Stop generation early if needed
handler.stop()

GenerationHandler

public protocol GenerationHandler: Sendable {
func stop()
}

The handler returned by ModelRunner.generateResponse or Conversation.generateResponse(..., onResponse:) lets you cancel generation without tearing down the conversation.

Conversation

Conversation tracks chat state and provides streaming helpers built on top of the model runner.

public class Conversation {
public let modelRunner: ModelRunner
public private(set) var history: [ChatMessage]
public private(set) var functions: [LeapFunction]
public private(set) var isGenerating: Bool

public init(modelRunner: ModelRunner, history: [ChatMessage])

public func registerFunction(_ function: LeapFunction)
public func exportToJSON() throws -> [[String: Any]]

public func generateResponse(
userTextMessage: String,
generationOptions: GenerationOptions? = nil
) -> AsyncThrowingStream<MessageResponse, Error>

public func generateResponse(
message: ChatMessage,
generationOptions: GenerationOptions? = nil
) -> AsyncThrowingStream<MessageResponse, Error>

@discardableResult
public func generateResponse(
message: ChatMessage,
generationOptions: GenerationOptions? = nil,
onResponse: @escaping (MessageResponse) -> Void
) -> GenerationHandler?
}

Properties

  • history: Copy of the accumulated chat messages. The SDK appends the assistant reply when a generation finishes successfully.
  • functions: Functions registered via registerFunction(_:) for function calling.
  • isGenerating: Boolean flag indicating whether a generation is currently running. Attempts to start a new generation while this is true immediately finish with an empty stream (or nil handler for the callback variant).

Streaming Convenience

The most common pattern is to use the async-stream helpers:

let user = ChatMessage(role: .user, content: [.text("Hello! What can you do?")])

Task {
do {
for try await response in conversation.generateResponse(
message: user,
generationOptions: GenerationOptions(temperature: 0.7)
) {
switch response {
case .chunk(let delta):
print(delta, terminator: "")
case .reasoningChunk(let thought):
print("Reasoning:", thought)
case .functionCall(let calls):
handleFunctionCalls(calls)
case .audioSample(let samples, let sampleRate):
audioRenderer.enqueue(samples, sampleRate: sampleRate)
case .complete(let completion):
let text = completion.message.content.compactMap { item in
if case .text(let value) = item { return value }
return nil
}.joined()
print("\nComplete:", text)
if let stats = completion.stats {
print("Prompt tokens: \(stats.promptTokens), completions: \(stats.completionTokens)")
}
}
}
} catch {
print("Generation failed: \(error)")
}
}

Cancelling the task that iterates the stream stops generation and cleans up native resources.

Callback Convenience

Use generateResponse(message:onResponse:) when you prefer callbacks or need to integrate with imperative UI components:

let handler = conversation.generateResponse(message: user) { response in
updateUI(with: response)
}

// Later
handler?.stop()

If a generation is already running, the method returns nil and emits a .complete message with finishReason == .stop via the callback.

warning

The callback overload does not surface generation errors. Use the async-stream helper or call ModelRunner.generateResponse with onErrorCallback when you need error handling.

Export Chat History

exportToJSON() serializes the conversation history into a [[String: Any]] payload that mirrors OpenAI's chat-completions format. This is useful for persistence, analytics, or debugging tools.

MessageResponse

public enum MessageResponse {
case chunk(String)
case reasoningChunk(String)
case audioSample(samples: [Float], sampleRate: Int)
case functionCall([LeapFunctionCall])
case complete(MessageCompletion)
}

public struct MessageCompletion {
public let message: ChatMessage
public let finishReason: GenerationFinishReason
public let stats: GenerationStats?

public var info: GenerationCompleteInfo { get }
}

public struct GenerationCompleteInfo {
public let finishReason: GenerationFinishReason
public let stats: GenerationStats?
}

public struct GenerationStats {
public var promptTokens: UInt64
public var completionTokens: UInt64
public var totalTokens: UInt64
public var tokenPerSecond: Float
}
  • chunk: Partial assistant text emitted during streaming.
  • reasoningChunk: Model reasoning tokens wrapped between <think> / </think> (only for models that expose reasoning traces).
  • audioSample: PCM audio frames streamed from audio-capable checkpoints. Feed them into an audio renderer or buffer for later playback.
  • functionCall: One or more function/tool invocations requested by the model. See the Function Calling guide.
  • complete: Signals the end of generation. Access the assembled assistant reply through completion.message. Stats and finish reason live on the completion object; completion.info is provided for backward compatibility.

Errors surfaced during streaming are delivered through the thrown error of AsyncThrowingStream, or via the onErrorCallback closure when using the lower-level API.

Chat Messages

Roles

public enum ChatMessageRole: String {
case user
case system
case assistant
case tool
}

Include .tool messages when you append function-call results back into the conversation.

Message Structure

public struct ChatMessage {
public var role: ChatMessageRole
public var content: [ChatMessageContent]
public var reasoningContent: String?
public var functionCalls: [LeapFunctionCall]?

public init(
role: ChatMessageRole,
content: [ChatMessageContent],
reasoningContent: String? = nil,
functionCalls: [LeapFunctionCall]? = nil
)

public init(from json: [String: Any]) throws
}
  • content: Ordered fragments of the message. The SDK supports .text, .image, and .audio parts.
  • reasoningContent: Optional text produced inside <think> tags by eligible models.
  • functionCalls: Attach the calls returned by MessageResponse.functionCall when you include tool execution results in the history.

Message Content

public enum ChatMessageContent {
case text(String)
case image(Data) // JPEG bytes
case audio(Data) // WAV bytes

public init(from json: [String: Any]) throws
}

Provide JPEG-encoded bytes for .image and WAV data for .audio. Helper initializers such as ChatMessageContent.fromUIImage, ChatMessageContent.fromNSImage, ChatMessageContent.fromWAVData, and ChatMessageContent.fromFloatSamples(_:sampleRate:channelCount:) simplify interop with platform-native buffers. On the wire, image parts are encoded as OpenAI-style image_url payloads and audio parts as input_audio arrays with Base64 data.

GenerationOptions

Tune generation behaviour with GenerationOptions.

public struct GenerationOptions {
public var temperature: Float?
public var topP: Float?
public var minP: Float?
public var repetitionPenalty: Float?
public var jsonSchemaConstraint: String?
public var functionCallParser: LeapFunctionCallParserProtocol?

public init(
temperature: Float? = nil,
topP: Float? = nil,
minP: Float? = nil,
repetitionPenalty: Float? = nil,
jsonSchemaConstraint: String? = nil,
functionCallParser: LeapFunctionCallParserProtocol? = LFMFunctionCallParser()
)
}
  • Leave a field as nil to fall back to the defaults packaged with the model bundle.
  • functionCallParser controls how tool-call tokens are parsed. LFMFunctionCallParser (the default) handles Liquid Foundation Model Pythonic function calling. Supply HermesFunctionCallParser() for Hermes/Qwen3 formats, or set the parser to nil to receive raw tool-call text in MessageResponse.chunk.
  • jsonSchemaConstraint activates constrained generation. Use setResponseFormat(type:) to populate it from a type annotated with the @Generatable macro.
extension GenerationOptions {
public mutating func setResponseFormat<T: GeneratableType>(type: T.Type) throws {
self.jsonSchemaConstraint = try JSONSchemaGenerator.getJSONSchema(for: type)
}
}
var options = GenerationOptions(temperature: 0.6, topP: 0.9)
try options.setResponseFormat(type: CityFact.self)

for try await response in conversation.generateResponse(
message: user,
generationOptions: options
) {
// Handle structured output
}

LiquidInferenceEngineRunner exposes advanced utilities such as getPromptTokensSize(messages:addBosToken:) for applications that need to budget tokens ahead of time. These methods are backend-specific and may be elevated to the ModelRunner protocol in a future release.

Function Calling Types

public struct LeapFunction {
public let name: String
public let description: String
public let parameters: [LeapFunctionParameter]
}

public struct LeapFunctionParameter {
public let name: String
public let type: LeapFunctionParameterType
public let description: String
public let optional: Bool
}

public indirect enum LeapFunctionParameterType: Codable, Equatable {
case string(StringType)
case number(NumberType)
case integer(IntegerType)
case boolean(BooleanType)
case array(ArrayType)
case object(ObjectType)
case null(NullType)
}

The parameter type wrappers (StringType, NumberType, etc.) let you attach descriptions and enumerations so that JSON schemas accurately describe your tools. See the Function Calling guide for in-depth usage patterns.

Errors

Errors are surfaced as LeapError values. The most common cases are:

  • LeapError.modelLoadingFailure: Problems reading or validating the model bundle.
  • LeapError.generationFailure: Unexpected native inference errors.
  • LeapError.promptExceedContextLengthFailure: Prompt length exceeded the configured context size.
  • LeapError.serializationFailure: JSON encoding/decoding problems when working with chat history or function calls.

Handle thrown errors with do / catch when using async streams, or use the onErrorCallback in the lower-level API.

Putting it together

let runner = try await Leap.load(url: bundleURL)
let conversation = runner.createConversation(systemPrompt: "You are a travel assistant.")

conversation.registerFunction(weatherFunction)

var options = GenerationOptions(temperature: 0.8)
try options.setResponseFormat(type: TripRecommendation.self)

let userMessage = ChatMessage(
role: .user,
content: [.text("Plan a 3-day trip to Kyoto with food highlights")]
)

for try await response in conversation.generateResponse(
message: userMessage,
generationOptions: options
) {
process(response)
}

Refer to the Quick Start for end-to-end project setup, Function Calling for tool invocation, and Constrained Generation for structured outputs.