Using vLLM with Curator

You can use VLLM as a backend for Curator in two modes: offline (local) and online (server). This guide demonstrates both approaches using structured recipe generation as an example.

Prerequisites

  • Python 3.10+

  • Curator: Install via pip install bespokelabs-curator

  • VLLM: Install via pip install vllm

Offline Mode (Local)

In offline mode, VLLM runs locally on your machine, loading the model directly into memory.

1. Create Pydantic Models for Structured Output

First, define your data structure using Pydantic models:

from pydantic import BaseModel, Field 
from typing import List

class Recipe(BaseModel): 
    title: str = Field(description="Title of the recipe") 
    ingredients: List[str] = Field(description="List of ingredients needed") 
    instructions: List[str] = Field(description="Step by step cooking instructions") 
    prep_time: int = Field(description="Preparation time in minutes") 
    cook_time: int = Field(description="Cooking time in minutes") 
    servings: int = Field(description="Number of servings")

2. Create a Curator LLM Subclass

Create a class that inherits from LLM and implement two key methods:

3. Initialize and Use the Generator

Online Mode (Server)

In online mode, VLLM runs as a server that can handle multiple requests.

1. Start the VLLM Server

Start the VLLM server with your chosen model:

2. Configure the Generator

Use the same Pydantic models and LLM subclass as in offline mode, but initialize with server configuration:

Example Output

The generated recipes will be returned as structured data like:

VLLM Offline Configuration

Backend Parameters (for Offline Mode)

  • tensor_parallel_size: Number of GPUs for tensor parallelism (default: 1)

  • gpu_memory_utilization: GPU memory usage fraction between 0 and 1 (default: 0.95)

  • max_model_length: Maximum sequence length (default: 4096)

  • max_tokens: Maximum number of tokens to generate (default: 4096)

  • min_tokens: Minimum number of tokens to generate (default: 1)

  • enforce_eager: Whether to enforce eager execution (default: False)

  • batch_size: Size of batches for processing (default: 256)

Last updated