Skip to main content

Modal

Modal is a serverless cloud platform for running AI/ML workloads with instant autoscaling on GPUs and CPUs.

This guide provides scripts for deploying Liquid AI models on Modal.

Clone the repository

git clone https://github.com/Liquid4All/lfm-inference

Option 1. Use vLLM docker image

You can use the vLLM docker image vllm/vllm-openai to deploy LFM.

Launch command:

cd modal

# deploy LFM2 8B MoE model
modal deploy deploy-vllm-docker.py

# deploy other LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm-docker.py

See full list of open source LFM models on Hugging Face.

note

This is the recommended approach for production deployment.

Option 2. Use vLLM PyPI package

Alternatively, you can also use the vLLM PyPI package to deploy LFM. This approach is based on the Modal example for deploying OpenAI-compatible LLM service with vLLM, with a few modifications.

Launch command:

cd modal

# deploy LFM2 8B MoE model
modal deploy deploy-vllm-pypi.py

# deploy any LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm-pypi.py
(Click to see detailed modifications)
  • Change the MODEL_NAME and MODEL_REVISION to the latest LFM model.
    • E.g. forLFM2-8B-A1B:
      • MODEL_NAME = "LiquidAI/LFM2-8B-A1B"
      • MODEL_REVISION = "6df6a75822a5779f7bf4a21e765cb77d0383935d"
  • Optionally, turn off FAST_BOOT.
  • Optionally, add these environment variables:
    • HF_XET_HIGH_PERFORMANCE=1,
    • VLLM_USE_V1=1,
    • VLLM_USE_FUSED_MOE_GROUPED_TOPK=0.
  • Optionally, add these launch arguments:
    • --dtype bfloat16
    • --gpu-memory-utilization 0.6
    • --max-model-len 32768
    • --max-num-seqs 600
    • --compilation-config '{\"cudagraph_mode\": \"FULL_AND_PIECEWISE\"}'

Production deployment

  • Prefer the deploy-vllm-docker.py script.
  • Since vLLM takes over 2 min to cold start, if you run the inference server for production, it is recommended to keep a minimum number of warm instances with min_containers = 1 and buffer_containers = 1. The buffer_containers config is necessary because all Modal GPUs are subject to preemption. See docs for details about cold start performance tuning.
  • Warm up the vLLM server after deployment by sending a single request. The warm-up process is included in the deploy-vllm-docker.py script already.

Test commands

Test the deployed server with the following curl commands (replace <modal-deployment-url> with your actual deployment URL):

# List deployed model
curl https://<modal-deployment-url>/v1/models

# Query the deployed LFM model
curl https://<modal-deployment-url>/v1/chat/completions \
--json '{
"model": "LiquidAI/LFM2-8B-A1B",
"messages": [
{
"role": "user",
"content": "What is the melting temperature of silver?"
}
],
"max_tokens": 32,
"temperature": 0
}'