Documentation Index
Fetch the complete documentation index at: https://docs.liquid.ai/llms.txt
Use this file to discover all available pages before exploring further.
All functions documented on this page are safe to call from the main/UI thread; callbacks run on the main thread unless explicitly noted. The API surface is identical across iOS, macOS, Android, JVM, and Kotlin/Native β only the language and a handful of platform conventions differ.
ModelRunner
A ModelRunner represents a loaded model instance. Obtain one via:
- Android (recommended):
LeapModelDownloader.loadModel(...)/loadSimpleModel(...)β one-shot load that transparently routes through the optional Leap Model Service when installed, and adds WorkManager-backed background download staging on top. - iOS / macOS (recommended):
ModelDownloader.loadModel(...)/loadSimpleModel(...)β one-shot load that routes file transfers throughURLSession. PasssessionConfiguration: .background(withIdentifier:)for downloads that survive app suspension. (Class ships in theLeapModelDownloaderSPM library product.) - All platforms (iOS, Android, JVM, Linux native, Windows native, macOS Kotlin):
LeapDownloader.loadModel(...)/loadSimpleModel(...)β the cross-platform manifest loader, with no platform-native background integration. Used directly on JVM/native and as the underlying loader inside both the iOSModelDownloaderand AndroidLeapModelDownloader.
unload() to release native resources. See Model Loading for full reference.
- Swift (iOS / macOS)
- Kotlin (all platforms)
getPromptTokensSize(messages:, addBosToken:) returns the prompt token count for a hypothetical generation against messages β useful for context-budget checks before a request lands.
Lifecycle
- Use
createConversation(systemPrompt:)for a fresh chat, orcreateConversationFromHistory(history:)to resume from persisted state. - Call
unload()when youβre done. On iOS this isasync; on Kotlin itβs asuspendfunction β both release native memory. - If the model runner is unloaded, any conversation it created becomes read-only.
Android lifecycle: If you need a model runner to survive activity destruction, wrap it in an Android Service. For most apps a
ViewModel is sufficient β viewModelScope keeps the model alive across configuration changes and the cleanup pattern below unloads it on destruction.Conversation
Conversation tracks chat state and exposes the streaming generation API. Instances are always created through a ModelRunner β donβt construct one directly.
- Swift (iOS / macOS)
- Kotlin (all platforms)
appendToHistory(message)β record a message without triggering generation. Useful for replaying persisted state, or for inserting tool-result messages (role: .tool) after handling a function call.removeLastMessage()β pop the trailing message. No-op on an empty history. Useful when a generation was cancelled and you want to drop the dangling user turn.registerFunctions(functions)β bulk-register tool definitions; equivalent to looping overregisterFunction(_:).
Properties
historyβ a snapshot copy of the chat messages. Mutations donβt affect generation. Once the stream emitsComplete,historyincludes the final assistant reply.isGeneratingβtruewhile a generation is in flight. Starting a second generation while one is running is blocked.functions(Swift only field, registered viaregisterFunctionon both platforms) β tool definitions the model may invoke.
Streaming generation
The async stream is the recommended way to drive generation β both platforms emit the sameMessageResponse cases in the same order. Cancel the consuming task / coroutine to stop generation cleanly.
- Swift (iOS / macOS)
- Kotlin (all platforms)
onEnum(of:) (introduced in v0.10.0) gives exhaustive switching on Kotlin-bridged sealed types β the compiler errors if a new MessageResponse case is added.Cancellation. Cancelling the Swift
Task or the Kotlin coroutine Job stops generation and frees native resources. On both platforms cancellation is cooperative β the engine checks between tokens, so thereβs at most one extra token of slack after cancel().Export chat history
Both platforms expose a serializer compatible with OpenAIβs chat-completions message format. Useful for persistence, analytics, or replaying conversations through a cloud fallback.- Swift (iOS / macOS)
- Kotlin (all platforms)
MessageResponse
A sealed type with one case per kind of incremental output the engine emits.
- Swift (iOS / macOS)
- Kotlin (all platforms)
onEnum(of:) for exhaustive switching.Chunkβ partial assistant text. Append to your UI buffer.ReasoningChunkβ thinking-style tokens emitted by reasoning models (wrapped between<think>/</think>upstream). Only fires whenGenerationOptions.enableThinking = trueand the model supports it.FunctionCallsβ one or more tool invocations the model wants you to execute. See Function Calling.AudioSampleβ float32 mono PCM frames from audio-capable checkpoints. The sample rate is constant for a generation; route the frames to a renderer.Completeβ final marker.fullMessageis the assembled assistantChatMessage(also present inconversation.history).statsholds token counts andtokenPerSecond(may benullon some backends).
GenerationFinishReason
Complete.finishReason is one of:
| Value | Meaning |
|---|---|
STOP | The model emitted its EOS token β clean completion. |
EXCEED_CONTEXT | The model hit the context-window limit before stopping. The reply may be truncated mid-sentence. |
INTERRUPTED | Generation was cancelled by the caller (collector cancelled the flow / task). |
CONSTRAINT | A constrained-generation constraint (e.g. JSON schema) forced an early stop. |
ERROR | An internal error occurred. The partial fullMessage is not appended to history β your error handler should run instead. |
GenerationOptions
Tune sampling, structured output, tool-call parsing, and reasoning behavior per request. Leave any field as null to fall back to the model bundleβs defaults.
- Swift (iOS / macOS)
- Kotlin (all platforms)
.with(temperature:), .with(topP:), .with(maxTokens:), etc.- Sampling fields (
temperature,topP,minP,topK,repetitionPenalty) β standard sampling knobs. Use the values from the LEAP bundle manifest (sampling_parametersundergeneration_time_parametersin each modelβs<Quant>.jsonon LiquidAI/LeapBundles); theyβre tuned per checkpoint by the training team and differ from the HF model card defaults (the manifest values are the llama.cpp-engine path the SDK runs). Arbitrary β0.7β defaults from generic AI tutorials usually underperform. rngSeedβ set for deterministic / reproducible output (testing, debugging). Default is non-deterministic.maxTokensβ cap the response length. The model stops after this many completion tokens (prompt tokens donβt count). Defaults to βuntil EOS or context limit.β Useful for cost control with constrained output.jsonSchemaConstraintβ JSON Schema string for constrained generation. Use the higher-levelsetResponseFormat(type:)/setResponseFormatType(...)helpers with@Generatabletypes. See Constrained Generation.injectSchemaIntoPromptβ whentrue(default), the schema is appended to the system message for semantic guidance in addition to the structural constraint at decode time. Setfalseto skip the prompt injection (matchesllama-servergrammar mode) β saves prompt tokens for large schemas.functionCallParserβ picks the tokenizer expected by the model.LFMFunctionCallParser(default) for Liquid Foundation Models;HermesFunctionCallParser()for Hermes/Qwen3 formats;nullto receive raw tool-call text inChunks.enableThinkingβ turn on reasoning mode for models that support it (e.g. LFM2.5-Thinking). Reasoning tokens arrive asReasoningChunks.inlineThinkingTagsβ whentrue, thinking tokens are emitted as ordinaryChunks with the literal<think>...</think>tags intact (instead ofReasoningChunk).ChatMessage.reasoningContentis still populated on the final message.extrasβ backend-specific JSON payload (internal use).
GenerationStats
cachedPromptTokens is useful for observing KV-cache effectiveness β a high ratio of cached tokens to total prompt tokens means the prefix matched and you skipped the prefill compute for those tokens.