Bespoke Labs
  • Welcome
  • BESPOKE CURATOR
    • Getting Started
      • Quick Tour
      • Key Concepts
      • Visualize your dataset with the Bespoke Curator Viewer
      • Automatic recovery and caching
      • Structured Output
    • Save $$$ on LLM inference
      • Using OpenAI for batch inference
      • Using Anthropic for batch inference
      • Using Gemini for batch inference
      • Using Mistral for batch inference
      • Using kluster.ai for batch inference
    • How-to Guides
      • Using vLLM with Curator
      • Using Ollama with Curator
      • Using LiteLLM with curator
      • Handling Multimodal Data in Curator
      • Executing LLM-generated code
      • Using HuggingFace inference providers with Curator
    • Data Curation Recipes
      • Generating a diverse QA dataset
      • Using SimpleStrat block for generating diverse data
      • Curate Reasoning data with Claude-3.7 Sonnet
      • Synthetic Data for function calling
    • Finetuning Examples
      • Aspect based sentiment analysis
      • Finetuning a model to identify features of a product
    • API Reference
  • Models
    • Bespoke MiniCheck
      • Self-Hosting
      • Integrations
      • API Service
    • Bespoke MiniChart
    • OpenThinker
Powered by GitBook
On this page
  • Prerequisites
  • Offline Mode (Local)
  • 1. Create Pydantic Models for Structured Output
  • 2. Create a Curator LLM Subclass
  • Online Mode (Server)
  • 1. Start the VLLM Server
  • 2. Configure the Generator
  • Example Output
  • VLLM Offline Configuration
  1. BESPOKE CURATOR
  2. How-to Guides

Using vLLM with Curator

You can use VLLM as a backend for Curator in two modes: offline (local) and online (server). This guide demonstrates both approaches using structured recipe generation as an example.

Prerequisites

  • Python 3.10+

  • Curator: Install via pip install bespokelabs-curator

  • VLLM: Install via pip install vllm

Offline Mode (Local)

In offline mode, VLLM runs locally on your machine, loading the model directly into memory.

1. Create Pydantic Models for Structured Output

First, define your data structure using Pydantic models:

from pydantic import BaseModel, Field 
from typing import List

class Recipe(BaseModel): 
    title: str = Field(description="Title of the recipe") 
    ingredients: List[str] = Field(description="List of ingredients needed") 
    instructions: List[str] = Field(description="Step by step cooking instructions") 
    prep_time: int = Field(description="Preparation time in minutes") 
    cook_time: int = Field(description="Cooking time in minutes") 
    servings: int = Field(description="Number of servings")

2. Create a Curator LLM Subclass

Create a class that inherits from LLM and implement two key methods:

from bespokelabs import curator

class RecipeGenerator(curator.LLM): 
    response_format = Recipe
    
    def prompt(self, input: dict) -> str:
        return f"Generate a random {input['cuisine']} recipe. Be creative but keep it realistic."
    
    def parse(self, input: dict, response: Recipe) -> dict:
        return {
            "title": response.title,
            "ingredients": response.ingredients,
            "instructions": response.instructions,
            "prep_time": response.prep_time,
            "cook_time": response.cook_time,
            "servings": response.servings,
        }

3. Initialize and Use the Generator

# Initialize with a local model
generator = RecipeGenerator( 
    model_name="Qwen/Qwen2.5-3B-Instruct", 
    backend="vllm", 
    backend_params={ 
        "tensor_parallel_size": 1, # Adjust based on GPU count 
        "gpu_memory_utilization": 0.7 
    }
)
# Create input dataset
cuisines = [{"cuisine": c} for c in ["Italian", "Chinese", "Mexican"]] 
recipes = generator(cuisines) 
print(recipes.to_pandas())

Online Mode (Server)

In online mode, VLLM runs as a server that can handle multiple requests.

1. Start the VLLM Server

Start the VLLM server with your chosen model:

vllm serve Qwen/Qwen2.5-3B-Instruct \
    --host localhost \
    --port 8787 \
    --api-key token-abc123

2. Configure the Generator

Use the same Pydantic models and LLM subclass as in offline mode, but initialize with server configuration:

# Set API key if required
os.environ["HOSTED_VLLM_API_KEY"] = "token-abc123"

# Initialize with server connection
generator = RecipeGenerator( 
    model_name="hosted_vllm/Qwen/Qwen2.5-3B-Instruct", 
    backend="litellm", 
    backend_params={ 
        "base_url": "http://localhost:8787/v1", 
        "request_timeout": 30 
    } 
)

# Generate recipes
recipes = generator(cuisines)
print(recipes.to_pandas())

Example Output

The generated recipes will be returned as structured data like:

{ 
    "title": "Spicy Szechuan Noodles", 
    "ingredients": [ 
        "400g wheat noodles", 
        "2 tbsp Szechuan peppercorns", 
        "3 cloves garlic, minced", 
        "2 tbsp soy sauce" 
    ], 
    "instructions": [ 
        "Boil noodles according to package instructions", 
        "Heat oil in a wok over medium-high heat", 
        "Add peppercorns and garlic, stir-fry until fragrant", 
        "Add noodles and soy sauce, toss to combine" 
    ], 
    "prep_time": 15, 
    "cook_time": 20, 
    "servings": 4 
}

VLLM Offline Configuration

Backend Parameters (for Offline Mode)

  • tensor_parallel_size: Number of GPUs for tensor parallelism (default: 1)

  • gpu_memory_utilization: GPU memory usage fraction between 0 and 1 (default: 0.95)

  • max_model_length: Maximum sequence length (default: 4096)

  • max_tokens: Maximum number of tokens to generate (default: 4096)

  • min_tokens: Minimum number of tokens to generate (default: 1)

  • enforce_eager: Whether to enforce eager execution (default: False)

  • batch_size: Size of batches for processing (default: 256)

PreviousHow-to GuidesNextUsing Ollama with Curator

Last updated 3 months ago