Modal

Modal is a serverless cloud platform for running AI/ML workloads with instant autoscaling on GPUs and CPUs.

This guide provides scripts for deploying Liquid AI models on Modal.

Clone the repository

git clone https://github.com/Liquid4All/lfm-inference

Option 1. Use `vLLM` docker image

You can use the vLLM docker image vllm/vllm-openai to deploy LFM.

Launch command:

cd modal

# deploy LFM2 8B MoE model
modal deploy deploy-vllm-docker.py

# deploy other LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm-docker.py

See full list of open source LFM models on Hugging Face.

note

This is the recommended approach for production deployment.

Option 2. Use `vLLM` PyPI package

Alternatively, you can also use the vLLM PyPI package to deploy LFM. This approach is based on the Modal example for deploying OpenAI-compatible LLM service with vLLM, with a few modifications.

Launch command:

cd modal

# deploy LFM2 8B MoE model
modal deploy deploy-vllm-pypi.py

# deploy any LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm-pypi.py

(Click to see detailed modifications)

Change the MODEL_NAME and MODEL_REVISION to the latest LFM model.
- E.g. forLFM2-8B-A1B:
  - MODEL_NAME = "LiquidAI/LFM2-8B-A1B"
  - MODEL_REVISION = "6df6a75822a5779f7bf4a21e765cb77d0383935d"
Optionally, turn off FAST_BOOT.
Optionally, add these environment variables:
- HF_XET_HIGH_PERFORMANCE=1,
- VLLM_USE_V1=1,
- VLLM_USE_FUSED_MOE_GROUPED_TOPK=0.
Optionally, add these launch arguments:
- --dtype bfloat16
- --gpu-memory-utilization 0.6
- --max-model-len 32768
- --max-num-seqs 600
- --compilation-config '{\"cudagraph_mode\": \"FULL_AND_PIECEWISE\"}'

Production deployment

Prefer the deploy-vllm-docker.py script.
Since vLLM takes over 2 min to cold start, if you run the inference server for production, it is recommended to keep a minimum number of warm instances with min_containers = 1 and buffer_containers = 1. The buffer_containers config is necessary because all Modal GPUs are subject to preemption. See docs for details about cold start performance tuning.
Warm up the vLLM server after deployment by sending a single request. The warm-up process is included in the deploy-vllm-docker.py script already.

Test commands

Test the deployed server with the following curl commands (replace <modal-deployment-url> with your actual deployment URL):

# List deployed model
curl https://<modal-deployment-url>/v1/models

# Query the deployed LFM model
curl https://<modal-deployment-url>/v1/chat/completions \
  --json '{
  "model": "LiquidAI/LFM2-8B-A1B",
  "messages": [
    {
      "role": "user",
      "content": "What is the melting temperature of silver?"
    }
  ],
  "max_tokens": 32,
  "temperature": 0
}'

Clone the repository​

Option 1. Use vLLM docker image​

Option 2. Use vLLM PyPI package​

Production deployment​

Test commands​

Clone the repository

Option 1. Use `vLLM` docker image

Option 2. Use `vLLM` PyPI package

Production deployment

Test commands