# Batch Processing

This guide demonstrates how to use **Curator** for batch processing, specifically reannotating datasets. We'll walk through an example using the **WildChat** dataset to create new responses for its conversations.

## **Prerequisites**

* **Python 3.10+**
* **Curator**: Install via `pip install bespokelabs-curator`
* Access to an LLM provider (e.g., OpenAI or equivalent API)

## **Steps**

### **1. Load and Prepare the Dataset**

Use the **Hugging Face Datasets** library to load the WildChat dataset and select a subset for reannotation.

```python
from datasets import load_dataset

dataset = load_dataset("allenai/WildChat", split="train")
dataset = dataset.select(range(3_000))  # Select a subset of 3,000 samples
```

### **2. Create a Curator.LLM Subclass**

Define a subclass of `curator.LLM` to handle prompt generation and parsing. The subclass defines how the model processes inputs and outputs.

```python
from bespokelabs import curator

class WildChatReannotator(curator.LLM):
    """A reannotator for the WildChat dataset."""

    def prompt(self, input: dict) -> str:
        """Extract the first message from a conversation to use as the prompt."""
        return input["conversation"][0]["content"]

    def parse(self, input: dict, response: str) -> dict:
        """Parse the model response along with the input to the model into the desired output format."""
        instruction = input["conversation"][0]["content"]
        return {"instruction": instruction, "new_response": response}
```

### **3. Configurer Batch Processing**

Set up the LLM reannotator with batch processing enabled. Specify the `batch_size` parameter to determine the number of samples processed in each batch.

```python
import logging

# Enable detailed logging for batch processing
logger = logging.getLogger("bespokelabs.curator")
logger.setLevel(logging.INFO)

# Initialize the reannotator with batch processing
distiller = WildChatReannotator(
    model_name="gpt-4o-mini",
    batch=True,  # Enable batch processing
    backend_params={"batch_size": 1_000},  # Specify batch size
)
```

### **4. Process the Dataset**

Run the distiller on the dataset to generate new annotations.

```python
distilled_dataset = distiller(dataset)
```

### **5. Inspect the Results**

Print the distilled dataset to verify the new annotations.

```python
print(distilled_dataset)
print(distilled_dataset[0])
```

Example Output:

```python
{'instruction': 'What is the capital of France?', 'new_response': 'The capital of France is Paris.'}
```

### Batch Processing Configuration:

### **1. Supported models**

**`model_name`**

Currently, we only support batch mode for Anthropic and OpenAI models. You can change the model by setting the `model_name` argument in the `LLM` constructor.

### **2. Batch Size**

**`batch_size`**

* **Description**: Maximum number of requests to process in a single batch.
* **Best Practice**:
  * For large datasets, choose a value that balances efficiency and memory usage.
  * For LLM APIs refer the number of requests per batch allowed and set this accordingly.

**Example**:

```python
backend_params={
    "batch_size": 1_000  # Process 1,000 requests per batch
}
```

### **3. Batch Check Interval**

**`batch_check_interval`**

* **Description**: Time in seconds between status checks for active batches.
* **Best Practice**:
  * Set a low value (e.g., 5–10 seconds) for near real-time monitoring.
  * Increase the interval for long-running jobs to reduce overhead.

**Example**:

```python
backend_params={
    "batch_check_interval": 10  # Check batch status every 10 seconds
}
```

### **4. Delete Successful Batch Files**

**`delete_successful_batch_files`**

* **Description**: Whether to delete batch files after successful processing to save storage.
* **Best Practice**:
  * Enable (`True`) for production or disk-constrained environments.
  * Keep (`False`) for debugging or when an audit trail is needed.

**Example**:

```python
backend_params={
    "delete_successful_batch_files": True  # Automatically delete successful batch files
}
```

### **5. Delete Failed Batch Files**

**`delete_failed_batch_files`**

* **Description**: Whether to delete batch files after failed processing to free up space.
* **Best Practice**:
  * Enable (`True`) if the failures are logged elsewhere or can be regenerated.
  * Disable (`False`) when debugging or troubleshooting errors.

**Example**:

```python
backend_params={
    "delete_failed_batch_files": False  # Retain failed batch files for debugging
}
```

### **Example Configuration**

Below is an example combining all the options for optimized batch processing:

```python
backend_params={
    "batch_size": 1_000,                    # Process 1,000 requests per batch
    "batch_check_interval": 10,            # Check batch status every 10 seconds
    "delete_successful_batch_files": True, # Delete files after successful processing
    "delete_failed_batch_files": False     # Retain files after failed processing
}
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/batch-processing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
