# Using vLLM with Curator

You can use VLLM as a backend for Curator in two modes: offline (local) and online (server). This guide demonstrates both approaches using structured recipe generation as an example.

## Prerequisites <a href="#prerequisites" id="prerequisites"></a>

* Python 3.10+
* Curator: Install via `pip install bespokelabs-curator`
* VLLM: Install via `pip install vllm`

## Offline Mode (Local) <a href="#offline-mode-local" id="offline-mode-local"></a>

In offline mode, VLLM runs locally on your machine, loading the model directly into memory.

### 1. Create Pydantic Models for Structured Output <a href="#id-1-create-pydantic-models-for-structured-output" id="id-1-create-pydantic-models-for-structured-output"></a>

First, define your data structure using Pydantic models:

```python
from pydantic import BaseModel, Field 
from typing import List

class Recipe(BaseModel): 
    title: str = Field(description="Title of the recipe") 
    ingredients: List[str] = Field(description="List of ingredients needed") 
    instructions: List[str] = Field(description="Step by step cooking instructions") 
    prep_time: int = Field(description="Preparation time in minutes") 
    cook_time: int = Field(description="Cooking time in minutes") 
    servings: int = Field(description="Number of servings")
```

### 2. Create a Curator LLM Subclass <a href="#id-2-create-a-curator-llm-subclass" id="id-2-create-a-curator-llm-subclass"></a>

Create a class that inherits from `LLM` and implement two key methods:&#x20;

<pre class="language-python"><code class="lang-python">from bespokelabs import curator

class RecipeGenerator(curator.LLM): 
    response_format = Recipe
    
<strong>    def prompt(self, input: dict) -> str:
</strong>        return f"Generate a random {input['cuisine']} recipe. Be creative but keep it realistic."
    
    def parse(self, input: dict, response: Recipe) -> dict:
        return {
            "title": response.title,
            "ingredients": response.ingredients,
            "instructions": response.instructions,
            "prep_time": response.prep_time,
            "cook_time": response.cook_time,
            "servings": response.servings,
        }
</code></pre>

#### 3. Initialize and Use the Generator <a href="#id-3-initialize-and-use-the-generator" id="id-3-initialize-and-use-the-generator"></a>

```python
# Initialize with a local model
generator = RecipeGenerator( 
    model_name="Qwen/Qwen2.5-3B-Instruct", 
    backend="vllm", 
    backend_params={ 
        "tensor_parallel_size": 1, # Adjust based on GPU count 
        "gpu_memory_utilization": 0.7 
    }
)
# Create input dataset
cuisines = [{"cuisine": c} for c in ["Italian", "Chinese", "Mexican"]] 
recipes = generator(cuisines) 
print(recipes.dataset.to_pandas())
```

## Online Mode (Server) <a href="#online-mode-server" id="online-mode-server"></a>

In online mode, VLLM runs as a server that can handle multiple requests.

### 1. Start the VLLM Server <a href="#id-1-start-the-vllm-server" id="id-1-start-the-vllm-server"></a>

Start the VLLM server with your chosen model:

<pre class="language-bash"><code class="lang-bash">vllm serve Qwen/Qwen2.5-3B-Instruct \
<strong>    --host localhost \
</strong>    --port 8787 \
    --api-key token-abc123
</code></pre>

### 2. Configure the Generator <a href="#id-2-configure-the-generator" id="id-2-configure-the-generator"></a>

Use the same Pydantic models and LLM subclass as in offline mode, but initialize with server configuration:

<pre class="language-python"><code class="lang-python"><strong># Set API key if required
</strong>os.environ["HOSTED_VLLM_API_KEY"] = "token-abc123"

# Initialize with server connection
generator = RecipeGenerator( 
    model_name="hosted_vllm/Qwen/Qwen2.5-3B-Instruct", 
    backend="litellm", 
    backend_params={ 
        "base_url": "http://localhost:8787/v1", 
        "request_timeout": 30 
    } 
)

# Generate recipes
recipes = generator(cuisines)
print(recipes.dataset.to_pandas())
</code></pre>

## Example Output <a href="#example-output" id="example-output"></a>

The generated recipes will be returned as structured data like:

```json
{ 
    "title": "Spicy Szechuan Noodles", 
    "ingredients": [ 
        "400g wheat noodles", 
        "2 tbsp Szechuan peppercorns", 
        "3 cloves garlic, minced", 
        "2 tbsp soy sauce" 
    ], 
    "instructions": [ 
        "Boil noodles according to package instructions", 
        "Heat oil in a wok over medium-high heat", 
        "Add peppercorns and garlic, stir-fry until fragrant", 
        "Add noodles and soy sauce, toss to combine" 
    ], 
    "prep_time": 15, 
    "cook_time": 20, 
    "servings": 4 
}
```

## VLLM Offline Configuration <a href="#configuration-options" id="configuration-options"></a>

#### Backend Parameters (for Offline Mode) <a href="#backend-parameters" id="backend-parameters"></a>

* `tensor_parallel_size`: Number of GPUs for tensor parallelism (default: 1)
* `gpu_memory_utilization`: GPU memory usage fraction between 0 and 1 (default: 0.95)
* `max_model_length`: Maximum sequence length (default: 4096)
* `max_tokens`: Maximum number of tokens to generate (default: 4096)
* `min_tokens`: Minimum number of tokens to generate (default: 1)
* `enforce_eager`: Whether to enforce eager execution (default: False)
* `batch_size`: Size of batches for processing (default: 256)
