Batch Processing

This guide demonstrates how to use Curator for batch processing, specifically reannotating datasets. We'll walk through an example using the WildChat dataset to create new responses for its conversations.

Prerequisites

  • Python 3.10+

  • Curator: Install via pip install bespokelabs-curator

  • Access to an LLM provider (e.g., OpenAI or equivalent API)

Steps

1. Load and Prepare the Dataset

Use the Hugging Face Datasets library to load the WildChat dataset and select a subset for reannotation.

from datasets import load_dataset

dataset = load_dataset("allenai/WildChat", split="train")
dataset = dataset.select(range(3_000))  # Select a subset of 3,000 samples

2. Create a Curator.LLM Subclass

Define a subclass of curator.LLM to handle prompt generation and parsing. The subclass defines how the model processes inputs and outputs.

from bespokelabs import curator

class WildChatReannotator(curator.LLM):
    """A reannotator for the WildChat dataset."""

    def prompt(self, input: dict) -> str:
        """Extract the first message from a conversation to use as the prompt."""
        return input["conversation"][0]["content"]

    def parse(self, input: dict, response: str) -> dict:
        """Parse the model response along with the input to the model into the desired output format."""
        instruction = input["conversation"][0]["content"]
        return {"instruction": instruction, "new_response": response}

3. Configurer Batch Processing

Set up the LLM reannotator with batch processing enabled. Specify the batch_size parameter to determine the number of samples processed in each batch.

4. Process the Dataset

Run the distiller on the dataset to generate new annotations.

5. Inspect the Results

Print the distilled dataset to verify the new annotations.

Example Output:

Batch Processing Configuration:

1. Supported models

model_name

Currently, we only support batch mode for Anthropic and OpenAI models. You can change the model by setting the model_name argument in the LLM constructor.

2. Batch Size

batch_size

  • Description: Maximum number of requests to process in a single batch.

  • Best Practice:

    • For large datasets, choose a value that balances efficiency and memory usage.

    • For LLM APIs refer the number of requests per batch allowed and set this accordingly.

Example:

3. Batch Check Interval

batch_check_interval

  • Description: Time in seconds between status checks for active batches.

  • Best Practice:

    • Set a low value (e.g., 5–10 seconds) for near real-time monitoring.

    • Increase the interval for long-running jobs to reduce overhead.

Example:

4. Delete Successful Batch Files

delete_successful_batch_files

  • Description: Whether to delete batch files after successful processing to save storage.

  • Best Practice:

    • Enable (True) for production or disk-constrained environments.

    • Keep (False) for debugging or when an audit trail is needed.

Example:

5. Delete Failed Batch Files

delete_failed_batch_files

  • Description: Whether to delete batch files after failed processing to free up space.

  • Best Practice:

    • Enable (True) if the failures are logged elsewhere or can be regenerated.

    • Disable (False) when debugging or troubleshooting errors.

Example:

Example Configuration

Below is an example combining all the options for optimized batch processing:

Last updated