Modal
Modal is a serverless cloud platform for running AI/ML workloads with instant autoscaling on GPUs and CPUs.
This guide provides scripts for deploying Liquid AI models on Modal.
Clone the repository
git clone https://github.com/Liquid4All/lfm-inference
Option 1. Use vLLM docker image
You can use the vLLM docker image vllm/vllm-openai to deploy LFM.
Launch command:
cd modal
# deploy LFM2 8B MoE model
modal deploy deploy-vllm-docker.py
# deploy other LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm-docker.py
See full list of open source LFM models on Hugging Face.
note
This is the recommended approach for production deployment.
Option 2. Use vLLM PyPI package
Alternatively, you can also use the vLLM PyPI package to deploy LFM. This approach is based on the Modal example for deploying OpenAI-compatible LLM service with vLLM, with a few modifications.
Launch command:
cd modal
# deploy LFM2 8B MoE model
modal deploy deploy-vllm-pypi.py
# deploy any LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm-pypi.py
(Click to see detailed modifications)
- Change the
MODEL_NAMEandMODEL_REVISIONto the latest LFM model.- E.g. for
LFM2-8B-A1B:MODEL_NAME = "LiquidAI/LFM2-8B-A1B"MODEL_REVISION = "6df6a75822a5779f7bf4a21e765cb77d0383935d"
- E.g. for
- Optionally, turn off
FAST_BOOT. - Optionally, add these environment variables:
HF_XET_HIGH_PERFORMANCE=1,VLLM_USE_V1=1,VLLM_USE_FUSED_MOE_GROUPED_TOPK=0.
- Optionally, add these launch arguments:
--dtype bfloat16--gpu-memory-utilization 0.6--max-model-len 32768--max-num-seqs 600--compilation-config '{\"cudagraph_mode\": \"FULL_AND_PIECEWISE\"}'
Production deployment
- Prefer the
deploy-vllm-docker.pyscript. - Since vLLM takes over 2 min to cold start, if you run the inference server for production, it is recommended to keep a minimum number of warm instances with
min_containers = 1andbuffer_containers = 1. Thebuffer_containersconfig is necessary because all Modal GPUs are subject to preemption. See docs for details about cold start performance tuning. - Warm up the vLLM server after deployment by sending a single request. The warm-up process is included in the deploy-vllm-docker.py script already.
Test commands
Test the deployed server with the following curl commands (replace <modal-deployment-url> with your actual deployment URL):
# List deployed model
curl https://<modal-deployment-url>/v1/models
# Query the deployed LFM model
curl https://<modal-deployment-url>/v1/chat/completions \
--json '{
"model": "LiquidAI/LFM2-8B-A1B",
"messages": [
{
"role": "user",
"content": "What is the melting temperature of silver?"
}
],
"max_tokens": 32,
"temperature": 0
}'