Basics

Key Features, Installation, and Example.

Overview

Curator is an open-source library for building highly programmable and efficient synthetic data pipelines for post training and structured data extraction at scale. It has built-in support for performance optimization, caching and retries, and comes with a Curator-viewer that lets you iterate on the data generation.

Key Features

  1. Programmability and Structured Outputs: Synthetic data generation is lot more than just calling one prompt. It is a sequence of calling LLMs. You can orchestrate complex pipelines of calling LLMs and use structured output to decide on control-flow. Curator treats structured outputs as first class citizens.

  2. Built-in Performance Optimization: We often see calling LLMs in loops, or inefficient implementation of multi-threading. We have baked in performance optimizations so that you don't need to worry about those!

  3. Intelligent Caching and Fault Recovery: Given LLM calls can add up in cost and time, failures are undesirable but sometimes unavoidable. We cache the LLM requests and responses so that it is easy to recover from a failure. Moreover, when working on a multi-stage pipeline, caching of stages makes it easy to iterate.

  4. Native HuggingFace Dataset Integration: Work directly on HuggingFace Dataset objects throughput your pipeline. Your synthetic data is immediately ready for fine-tuning!

  5. Interactive Curator Viewer: Improve and iterate on your prompts using our built-in viewer. Inspect LLM requests and responses in real-time, allowing you to iterate and refine your data generation strategy with immediate feedback.

Coming soon: Observability for cost monitoring, pre-baked validators.

Installation

pip install bespokelabs-curator

Quick Example

from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import List

# Create a dataset object for the topics you want to create the poems.
topics = Dataset.from_dict({"topic": [
    "Urban loneliness in a bustling city",
    "Beauty of Bespoke Labs's Curator library"
]})

# Define a class to encapsulate a list of poems.
class Poem(BaseModel):
    poem: str = Field(description="A poem.")

class Poems(BaseModel):
    poems_list: List[Poem] = Field(description="A list of poems.")


# We define a Prompter that generates poems which gets applied to the topics dataset.
poet = curator.Prompter(
    # `prompt_func` takes a row of the dataset as input.
    # `row` is a dictionary with a single key 'topic' in this case.
    prompt_func=lambda row: f"Write two poems about {row['topic']}.",
    model_name="gpt-4o-mini",
    response_format=Poems,
    # `row` is the input row, and `poems` is the `Poems` class which 
    # is parsed from the structured output from the LLM.
    parse_func=lambda row, poems: [
        {"topic": row["topic"], "poem": p.poem} for p in poems.poems_list
    ],
)

poem = poet(topics)
# Example output:
#    topic                                     poem
# 0  Urban loneliness in a bustling city       In the city's heart, where the sirens wail,\nA...
# 1  Urban loneliness in a bustling city       City streets hum with a bittersweet song,\nHor...
# 2  Beauty of Bespoke Labs's Curator library  In whispers of design and crafted grace,\nBesp...
# 3  Beauty of Bespoke Labs's Curator library  In the hushed breath of parchment and ink,\nBe...

poems is a HuggingFace Dataset object with two columns:

Dataset({
    features: ['topic', 'poem'],
    num_rows: 4
})

Key Components of Prompter

Here is an invocation of Prompter :

poet = curator.Prompter(
    prompt_func=lambda row: f"Write two poems about {row['topic']}.",
    model_name="gpt-4o-mini",
    response_format=Poems,
    parse_func=lambda row, poems: [
        {"topic": row["topic"], "poem": p} for p in poems.poems_list
    ],
)

It has two important arguments: prompt_func and parse_func.

prompt_func

This calls an LLM on each row of the input dataset in parallel.

  1. Takes a dataset row as input

  2. Returns the prompt for the LLM.

parse_func

Converts LLM output into structured data by adding it back to the dataset.

  1. Takes two arguments:

    • Input row (this was given to the LLM).

    • LLM's response (in response_format --- string or Pydantic)

  2. Returns new rows (in list of dictionaries)

Data Flow Example

Input Dataset:

Row A 
Row B 

Processing by Prompter:

Row A → prompt_func(A) → Response R1 → parse_func(A, R1) → [C, D] 
Row B → prompt_func(B) → Response R2 → parse_func(B, R2) → [E, F]

Output Dataset:

Row C 
Row D 
Row E 
Row F

In this example:

  • The two input rows (A and B) are processed in parallel to prompt the LLM

  • Each generates a response (R1 and R2)

  • The parse function converts each response into (multiple) new rows (C, D, E, F)

  • The final dataset contains all generated rows

You can chain prompters together to iteratively build up a dataset.

Last updated