API Spec
Latest version: v0.7.0.
Leap
Leap is the static entry point for loading on-device models.
public struct Leap {
public static func load(
url: URL,
options: LiquidInferenceEngineOptions? = nil
) async throws -> ModelRunner
}
load(url:options:)
- Loads a local model file (either a
.bundlepackage or a.ggufcheckpoint) and returns aModelRunnerinstance. - Throws
LeapError.modelLoadingFailureif the file cannot be loaded. - Automatically detects companion files placed alongside your model:
mmproj-*.ggufenables multimodal vision tokens for both bundle and GGUF flows.- Audio decoder artifacts whose filename contains "audio" and "decoder" with a
.ggufor.binextension unlock audio input/output for compatible checkpoints. - Must be called from an async context (for example inside an
asyncfunction or aTask). Keep the returnedModelRunneralive while you interact with the model.
// ExecuTorch backend via .bundle
let bundleURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "bundle")!
let runner = try await Leap.load(url: bundleURL)
// llama.cpp backend via .gguf
let ggufURL = Bundle.main.url(forResource: "qwen3-0_6b", withExtension: "gguf")!
let ggufRunner = try await Leap.load(url: ggufURL)
LiquidInferenceEngineOptions
Pass a LiquidInferenceEngineOptions value when you need to override the default runtime configuration.
public struct LiquidInferenceEngineOptions {
public var bundlePath: String
public let cacheOptions: LiquidCacheOptions?
public let cpuThreads: UInt32?
public let contextSize: UInt32?
public let nGpuLayers: UInt32?
public let mmProjPath: String?
public let audioDecoderPath: String?
public let chatTemplate: String?
public let audioTokenizerPath: String?
public let extras: String?
}
bundlePath: Path to the model file on disk. When you callLeap.load(url:), this is filled automatically.cacheOptions: Configure persistence of KV-cache data between generations.cpuThreads: Number of CPU threads for token generation.contextSize: Override the default maximum context length for the model.nGpuLayers: Number of layers to offload to GPU (for macOS/macCatalyst targets with Metal support).mmProjPath: Optional path to an auxiliary multimodal projection model. Leavenilto auto-detect a siblingmmproj-*.gguf.audioDecoderPath: Optional audio decoder model. Leavenilto auto-detect nearby decoder artifacts.chatTemplate: Advanced override for backend chat templating.audioTokenizerPath: Optional tokenizer for audio-capable checkpoints.extras: Backend-specific configuration payload (advanced use only).
Backend selection is automatic: .bundle files run on the ExecuTorch backend, while .gguf
checkpoints use the embedded llama.cpp backend. Bundled models reference their projection data in
metadata; GGUF checkpoints look for sibling companion files (multimodal projection, audio decoder,
audio tokenizer) unless you override the paths through LiquidInferenceEngineOptions. Ensure
these artifacts are co-located when you want vision or audio features.
Example overriding the number of CPU threads and context size:
let options = LiquidInferenceEngineOptions(
bundlePath: bundleURL.path,
cpuThreads: 6,
contextSize: 8192
)
let runner = try await Leap.load(url: bundleURL, options: options)
ModelRunner
A ModelRunner represents a loaded model instance. The SDK returns concrete ModelRunner implementations, but your code only needs the protocol surface:
public protocol ModelRunner {
func createConversation(systemPrompt: String?) -> Conversation
func createConversationFromHistory(history: [ChatMessage]) -> Conversation
func generateResponse(
conversation: Conversation,
generationOptions: GenerationOptions?,
onResponseCallback: @escaping (MessageResponse) -> Void,
onErrorCallback: ((LeapError) -> Void)?
) -> GenerationHandler
func unload() async
var modelId: String { get }
}
Lifecycle
- Create conversations using
createConversation(systemPrompt:)orcreateConversationFromHistory(history:). - Hold a strong reference to the
ModelRunnerfor as long as you need to perform generations. - Call
unload()when you are done to release native resources (optional, happens automatically on deinit). - Access
modelIdto identify the loaded model (for analytics, debugging, or UI labels).
Low-level generation API
generateResponse(...) drives generation with callbacks and returns a GenerationHandler you can store to cancel the run. Most apps call the higher-level streaming helpers on Conversation, but you can invoke this method directly when you need fine-grained control (for example, integrating with custom async primitives).
let handler = runner.generateResponse(
conversation: conversation,
generationOptions: options,
onResponseCallback: { message in
// Handle MessageResponse values here
},
onErrorCallback: { error in
// Handle LeapError
}
)
// Stop generation early if needed
handler.stop()
GenerationHandler
public protocol GenerationHandler: Sendable {
func stop()
}
The handler returned by ModelRunner.generateResponse or Conversation.generateResponse(..., onResponse:) lets you cancel generation without tearing down the conversation.
Conversation
Conversation tracks chat state and provides streaming helpers built on top of the model runner.
public class Conversation {
public let modelRunner: ModelRunner
public private(set) var history: [ChatMessage]
public private(set) var functions: [LeapFunction]
public private(set) var isGenerating: Bool
public init(modelRunner: ModelRunner, history: [ChatMessage])
public func registerFunction(_ function: LeapFunction)
public func exportToJSON() throws -> [[String: Any]]
public func generateResponse(
userTextMessage: String,
generationOptions: GenerationOptions? = nil
) -> AsyncThrowingStream<MessageResponse, Error>
public func generateResponse(
message: ChatMessage,
generationOptions: GenerationOptions? = nil
) -> AsyncThrowingStream<MessageResponse, Error>
@discardableResult
public func generateResponse(
message: ChatMessage,
generationOptions: GenerationOptions? = nil,
onResponse: @escaping (MessageResponse) -> Void
) -> GenerationHandler?
}
Properties
history: Copy of the accumulated chat messages. The SDK appends the assistant reply when a generation finishes successfully.functions: Functions registered viaregisterFunction(_:)for function calling.isGenerating: Boolean flag indicating whether a generation is currently running. Attempts to start a new generation while this istrueimmediately finish with an empty stream (ornilhandler for the callback variant).
Streaming Convenience
The most common pattern is to use the async-stream helpers:
let user = ChatMessage(role: .user, content: [.text("Hello! What can you do?")])
Task {
do {
for try await response in conversation.generateResponse(
message: user,
generationOptions: GenerationOptions(temperature: 0.7)
) {
switch response {
case .chunk(let delta):
print(delta, terminator: "")
case .reasoningChunk(let thought):
print("Reasoning:", thought)
case .functionCall(let calls):
handleFunctionCalls(calls)
case .audioSample(let samples, let sampleRate):
audioRenderer.enqueue(samples, sampleRate: sampleRate)
case .complete(let completion):
let text = completion.message.content.compactMap { item in
if case .text(let value) = item { return value }
return nil
}.joined()
print("\nComplete:", text)
if let stats = completion.stats {
print("Prompt tokens: \(stats.promptTokens), completions: \(stats.completionTokens)")
}
}
}
} catch {
print("Generation failed: \(error)")
}
}
Cancelling the task that iterates the stream stops generation and cleans up native resources.
Callback Convenience
Use generateResponse(message:onResponse:) when you prefer callbacks or need to integrate with imperative UI components:
let handler = conversation.generateResponse(message: user) { response in
updateUI(with: response)
}
// Later
handler?.stop()
If a generation is already running, the method returns nil and emits a .complete message with finishReason == .stop via the callback.
The callback overload does not surface generation errors. Use the async-stream helper or call
ModelRunner.generateResponse with onErrorCallback when you need error handling.
Export Chat History
exportToJSON() serializes the conversation history into a [[String: Any]] payload that mirrors OpenAI's chat-completions format. This is useful for persistence, analytics, or debugging tools.
MessageResponse
public enum MessageResponse {
case chunk(String)
case reasoningChunk(String)
case audioSample(samples: [Float], sampleRate: Int)
case functionCall([LeapFunctionCall])
case complete(MessageCompletion)
}
public struct MessageCompletion {
public let message: ChatMessage
public let finishReason: GenerationFinishReason
public let stats: GenerationStats?
public var info: GenerationCompleteInfo { get }
}
public struct GenerationCompleteInfo {
public let finishReason: GenerationFinishReason
public let stats: GenerationStats?
}
public struct GenerationStats {
public var promptTokens: UInt64
public var completionTokens: UInt64
public var totalTokens: UInt64
public var tokenPerSecond: Float
}
chunk: Partial assistant text emitted during streaming.reasoningChunk: Model reasoning tokens wrapped between<think>/</think>(only for models that expose reasoning traces).audioSample: PCM audio frames streamed from audio-capable checkpoints. Feed them into an audio renderer or buffer for later playback.functionCall: One or more function/tool invocations requested by the model. See the Function Calling guide.complete: Signals the end of generation. Access the assembled assistant reply throughcompletion.message. Stats and finish reason live on thecompletionobject;completion.infois provided for backward compatibility.
Errors surfaced during streaming are delivered through the thrown error of AsyncThrowingStream, or via the onErrorCallback closure when using the lower-level API.
Chat Messages
Roles
public enum ChatMessageRole: String {
case user
case system
case assistant
case tool
}
Include .tool messages when you append function-call results back into the conversation.
Message Structure
public struct ChatMessage {
public var role: ChatMessageRole
public var content: [ChatMessageContent]
public var reasoningContent: String?
public var functionCalls: [LeapFunctionCall]?
public init(
role: ChatMessageRole,
content: [ChatMessageContent],
reasoningContent: String? = nil,
functionCalls: [LeapFunctionCall]? = nil
)
public init(from json: [String: Any]) throws
}
content: Ordered fragments of the message. The SDK supports.text,.image, and.audioparts.reasoningContent: Optional text produced inside<think>tags by eligible models.functionCalls: Attach the calls returned byMessageResponse.functionCallwhen you include tool execution results in the history.
Message Content
public enum ChatMessageContent {
case text(String)
case image(Data) // JPEG bytes
case audio(Data) // WAV bytes
public init(from json: [String: Any]) throws
}
Provide JPEG-encoded bytes for .image and WAV data for .audio. Helper initializers such as ChatMessageContent.fromUIImage, ChatMessageContent.fromNSImage, ChatMessageContent.fromWAVData, and ChatMessageContent.fromFloatSamples(_:sampleRate:channelCount:) simplify interop with platform-native buffers. On the wire, image parts are encoded as OpenAI-style image_url payloads and audio parts as input_audio arrays with Base64 data.
GenerationOptions
Tune generation behaviour with GenerationOptions.
public struct GenerationOptions {
public var temperature: Float?
public var topP: Float?
public var minP: Float?
public var repetitionPenalty: Float?
public var jsonSchemaConstraint: String?
public var functionCallParser: LeapFunctionCallParserProtocol?
public init(
temperature: Float? = nil,
topP: Float? = nil,
minP: Float? = nil,
repetitionPenalty: Float? = nil,
jsonSchemaConstraint: String? = nil,
functionCallParser: LeapFunctionCallParserProtocol? = LFMFunctionCallParser()
)
}
- Leave a field as
nilto fall back to the defaults packaged with the model bundle. functionCallParsercontrols how tool-call tokens are parsed.LFMFunctionCallParser(the default) handles Liquid Foundation Model Pythonic function calling. SupplyHermesFunctionCallParser()for Hermes/Qwen3 formats, or set the parser tonilto receive raw tool-call text inMessageResponse.chunk.jsonSchemaConstraintactivates constrained generation. UsesetResponseFormat(type:)to populate it from a type annotated with the@Generatablemacro.
extension GenerationOptions {
public mutating func setResponseFormat<T: GeneratableType>(type: T.Type) throws {
self.jsonSchemaConstraint = try JSONSchemaGenerator.getJSONSchema(for: type)
}
}
var options = GenerationOptions(temperature: 0.6, topP: 0.9)
try options.setResponseFormat(type: CityFact.self)
for try await response in conversation.generateResponse(
message: user,
generationOptions: options
) {
// Handle structured output
}
LiquidInferenceEngineRunner exposes advanced utilities such as getPromptTokensSize(messages:addBosToken:) for applications that need to budget tokens ahead of time. These methods are backend-specific and may be elevated to the ModelRunner protocol in a future release.
Function Calling Types
public struct LeapFunction {
public let name: String
public let description: String
public let parameters: [LeapFunctionParameter]
}
public struct LeapFunctionParameter {
public let name: String
public let type: LeapFunctionParameterType
public let description: String
public let optional: Bool
}
public indirect enum LeapFunctionParameterType: Codable, Equatable {
case string(StringType)
case number(NumberType)
case integer(IntegerType)
case boolean(BooleanType)
case array(ArrayType)
case object(ObjectType)
case null(NullType)
}
The parameter type wrappers (StringType, NumberType, etc.) let you attach descriptions and enumerations so that JSON schemas accurately describe your tools. See the Function Calling guide for in-depth usage patterns.
Errors
Errors are surfaced as LeapError values. The most common cases are:
LeapError.modelLoadingFailure: Problems reading or validating the model bundle.LeapError.generationFailure: Unexpected native inference errors.LeapError.promptExceedContextLengthFailure: Prompt length exceeded the configured context size.LeapError.serializationFailure: JSON encoding/decoding problems when working with chat history or function calls.
Handle thrown errors with do / catch when using async streams, or use the onErrorCallback in the lower-level API.
Putting it together
let runner = try await Leap.load(url: bundleURL)
let conversation = runner.createConversation(systemPrompt: "You are a travel assistant.")
conversation.registerFunction(weatherFunction)
var options = GenerationOptions(temperature: 0.8)
try options.setResponseFormat(type: TripRecommendation.self)
let userMessage = ChatMessage(
role: .user,
content: [.text("Plan a 3-day trip to Kyoto with food highlights")]
)
for try await response in conversation.generateResponse(
message: userMessage,
generationOptions: options
) {
process(response)
}
Refer to the Quick Start for end-to-end project setup, Function Calling for tool invocation, and Constrained Generation for structured outputs.