# Using vLLM with Curator

You can use VLLM as a backend for Curator in two modes: offline (local) and online (server). This guide demonstrates both approaches using structured recipe generation as an example.

## Prerequisites <a href="#prerequisites" id="prerequisites"></a>

* Python 3.10+
* Curator: Install via `pip install bespokelabs-curator`
* VLLM: Install via `pip install vllm`

## Offline Mode (Local) <a href="#offline-mode-local" id="offline-mode-local"></a>

In offline mode, VLLM runs locally on your machine, loading the model directly into memory.

### 1. Create Pydantic Models for Structured Output <a href="#id-1-create-pydantic-models-for-structured-output" id="id-1-create-pydantic-models-for-structured-output"></a>

First, define your data structure using Pydantic models:

```python
from pydantic import BaseModel, Field 
from typing import List

class Recipe(BaseModel): 
    title: str = Field(description="Title of the recipe") 
    ingredients: List[str] = Field(description="List of ingredients needed") 
    instructions: List[str] = Field(description="Step by step cooking instructions") 
    prep_time: int = Field(description="Preparation time in minutes") 
    cook_time: int = Field(description="Cooking time in minutes") 
    servings: int = Field(description="Number of servings")
```

### 2. Create a Curator LLM Subclass <a href="#id-2-create-a-curator-llm-subclass" id="id-2-create-a-curator-llm-subclass"></a>

Create a class that inherits from `LLM` and implement two key methods:&#x20;

<pre class="language-python"><code class="lang-python">from bespokelabs import curator

class RecipeGenerator(curator.LLM): 
    response_format = Recipe
    
<strong>    def prompt(self, input: dict) -> str:
</strong>        return f"Generate a random {input['cuisine']} recipe. Be creative but keep it realistic."
    
    def parse(self, input: dict, response: Recipe) -> dict:
        return {
            "title": response.title,
            "ingredients": response.ingredients,
            "instructions": response.instructions,
            "prep_time": response.prep_time,
            "cook_time": response.cook_time,
            "servings": response.servings,
        }
</code></pre>

#### 3. Initialize and Use the Generator <a href="#id-3-initialize-and-use-the-generator" id="id-3-initialize-and-use-the-generator"></a>

```python
# Initialize with a local model
generator = RecipeGenerator( 
    model_name="Qwen/Qwen2.5-3B-Instruct", 
    backend="vllm", 
    backend_params={ 
        "tensor_parallel_size": 1, # Adjust based on GPU count 
        "gpu_memory_utilization": 0.7 
    }
)
# Create input dataset
cuisines = [{"cuisine": c} for c in ["Italian", "Chinese", "Mexican"]] 
recipes = generator(cuisines) 
print(recipes.dataset.to_pandas())
```

## Online Mode (Server) <a href="#online-mode-server" id="online-mode-server"></a>

In online mode, VLLM runs as a server that can handle multiple requests.

### 1. Start the VLLM Server <a href="#id-1-start-the-vllm-server" id="id-1-start-the-vllm-server"></a>

Start the VLLM server with your chosen model:

<pre class="language-bash"><code class="lang-bash">vllm serve Qwen/Qwen2.5-3B-Instruct \
<strong>    --host localhost \
</strong>    --port 8787 \
    --api-key token-abc123
</code></pre>

### 2. Configure the Generator <a href="#id-2-configure-the-generator" id="id-2-configure-the-generator"></a>

Use the same Pydantic models and LLM subclass as in offline mode, but initialize with server configuration:

<pre class="language-python"><code class="lang-python"><strong># Set API key if required
</strong>os.environ["HOSTED_VLLM_API_KEY"] = "token-abc123"

# Initialize with server connection
generator = RecipeGenerator( 
    model_name="hosted_vllm/Qwen/Qwen2.5-3B-Instruct", 
    backend="litellm", 
    backend_params={ 
        "base_url": "http://localhost:8787/v1", 
        "request_timeout": 30 
    } 
)

# Generate recipes
recipes = generator(cuisines)
print(recipes.dataset.to_pandas())
</code></pre>

## Example Output <a href="#example-output" id="example-output"></a>

The generated recipes will be returned as structured data like:

```json
{ 
    "title": "Spicy Szechuan Noodles", 
    "ingredients": [ 
        "400g wheat noodles", 
        "2 tbsp Szechuan peppercorns", 
        "3 cloves garlic, minced", 
        "2 tbsp soy sauce" 
    ], 
    "instructions": [ 
        "Boil noodles according to package instructions", 
        "Heat oil in a wok over medium-high heat", 
        "Add peppercorns and garlic, stir-fry until fragrant", 
        "Add noodles and soy sauce, toss to combine" 
    ], 
    "prep_time": 15, 
    "cook_time": 20, 
    "servings": 4 
}
```

## VLLM Offline Configuration <a href="#configuration-options" id="configuration-options"></a>

#### Backend Parameters (for Offline Mode) <a href="#backend-parameters" id="backend-parameters"></a>

* `tensor_parallel_size`: Number of GPUs for tensor parallelism (default: 1)
* `gpu_memory_utilization`: GPU memory usage fraction between 0 and 1 (default: 0.95)
* `max_model_length`: Maximum sequence length (default: 4096)
* `max_tokens`: Maximum number of tokens to generate (default: 4096)
* `min_tokens`: Minimum number of tokens to generate (default: 1)
* `enforce_eager`: Whether to enforce eager execution (default: False)
* `batch_size`: Size of batches for processing (default: 256)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-vllm-with-curator.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
