Bespoke Labs
  • Welcome
  • BESPOKE CURATOR
    • Getting Started
      • Quick Tour
      • Key Concepts
      • Visualize your dataset with the Bespoke Curator Viewer
      • Automatic recovery and caching
      • Structured Output
    • Save $$$ on LLM inference
      • Using OpenAI for batch inference
      • Using Anthropic for batch inference
      • Using Gemini for batch inference
      • Using Mistral for batch inference
      • Using kluster.ai for batch inference
    • How-to Guides
      • Using vLLM with Curator
      • Using Ollama with Curator
      • Using LiteLLM with curator
      • Handling Multimodal Data in Curator
      • Executing LLM-generated code
      • Using HuggingFace inference providers with Curator
    • Data Curation Recipes
      • Generating a diverse QA dataset
      • Using SimpleStrat block for generating diverse data
      • Curate Reasoning data with Claude-3.7 Sonnet
      • Synthetic Data for function calling
    • Finetuning Examples
      • Aspect based sentiment analysis
      • Finetuning a model to identify features of a product
    • API Reference
  • Models
    • Bespoke MiniCheck
      • Self-Hosting
      • Integrations
      • API Service
    • Bespoke MiniChart
    • OpenThinker
Powered by GitBook
On this page
  • Structured Output for Data Generation with LLMs
  • How This Works:
  • Chaining LLM calls with structured output
  1. BESPOKE CURATOR
  2. Getting Started

Structured Output

Structured Output for Data Generation with LLMs

This example demonstrates how to use structured output with a custom LLM class to generate poems on different topics while maintaining a clean data structure:

from typing import Dict, List
from datasets import Dataset
from pydantic import BaseModel, Field
from bespokelabs import curator

# Define our structured output models
class Poem(BaseModel):
    poem: str = Field(description="A poem.")

class Poems(BaseModel):
    poems: List[Poem] = Field(description="A list of poems.")

# Create a custom LLM class with specialized prompting and parsing
class Poet(curator.LLM):
    response_format = Poems

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poems) -> Dict:
        return [{"topic": input["topic"], "poem": p.poem} for p in response.poems]

# Initialize our custom LLM
poet = Poet(model_name="gpt-4o-mini")

# Create a dataset of topics
topics = Dataset.from_dict({
    "topic": [
        "Urban loneliness in a bustling city", 
        "Beauty of Bespoke Labs's Curator library"
    ]
})

# Generate poems
poem = poet(topics)
print(poem.to_pandas())

# Output:
#                                       topic                                               poem
# 0       Urban loneliness in a bustling city  In the city's heart, where the lights never di...
# 1       Urban loneliness in a bustling city  Steps echo loudly, pavement slick with rain,\n...
# 2  Beauty of Bespoke Labs's Curator library  In the heart of Curation's realm,\nWhere art...
# 3  Beauty of Bespoke Labs's Curator library  Step within the library's embrace,\nA sanctu...

How This Works:

  1. Structured Models: We define Pydantic models (Poem and Poems) that specify the expected structure of our LLM output.

  2. Custom Poet Class: By inheriting from curator.LLM, we create a specialized class that:

    • Sets response_format = Poems to specify the output structure

    • Implements a prompt() method that formats our input into a proper prompt

    • Implements a parse() method that transforms the structured response into a list of dictionaries where each poem is a separate row with its associated topic

  3. Processing Pipeline: When we call poet(topics), our custom class:

    • Takes each topic from the dataset

    • Creates a prompt for each topic

    • Sends the prompt to the LLM

    • Parses the structured response

    • Returns a dataset where each row contains a topic and a single poem

This approach gives us clean, structured data that's ready for analysis or further processing while maintaining the relationship between inputs (topics) and outputs (poems).

Chaining LLM calls with structured output

Using structured output along with custom prompting and parsing logic allows you to chain together multiple calls to the LLM class to create powerful data generation pipelines.

Let's return to our example of generating poems. Suppose we want to also use LLMs to generate the topics of the poems. This can be accomplished by using another LLM object to generate the topics, as shown in the example below.

from typing import Dict, List

from pydantic import BaseModel, Field

from bespokelabs import curator


class Topic(BaseModel):
    topic: str = Field(description="A topic.")


class Topics(BaseModel):
    topics: List[Topic] = Field(description="A list of topics.")


class Muse(curator.LLM):
    response_format = Topics

    def prompt(self, input: Dict) -> str:
        return "Generate ten evocative poetry topics."

    def parse(self, input: Dict, response: Topics) -> Dict:
        return [{"topic": topic.topic} for topic in response.topics]


class Poem(BaseModel):
    poem: str = Field(description="A poem.")


class Poems(BaseModel):
    poems: List[Poem] = Field(description="A list of poems.")


class Poet(curator.LLM):
    response_format = Poems

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poems) -> Dict:
        return [{"topic": input["topic"], "poem": p.poem} for p in response.poems]


muse = Muse(model_name="gpt-4o-mini")
topics = muse()
print(topics.to_pandas())

poet = Poet(model_name="gpt-4o-mini")
poem = poet(topics)
print(poem.to_pandas())
# Output:
#                                                topic                                               poem
# 0               The fleeting beauty of autumn leaves  In a whisper of wind, they dance and they sway...
# 1               The fleeting beauty of autumn leaves  Once vibrant with life, now a radiant fade,\nC...
# 2                 The whispers of an abandoned house  In shadows deep where light won’t tread,  \nAn...
# 3                 The whispers of an abandoned house  Abandoned now, my heart does fade,  \nOnce a h...
# 4               The warmth of a forgotten summer day  In the stillness of a memory's embrace,  \nA w...
# 5               The warmth of a forgotten summer day  A gentle breeze delivers the trace  \nOf a day...
# ...

Chaining multiple LLMcalls this way allows us to build powerful synthetic data pipelines that can create millions of examples.

PreviousAutomatic recovery and cachingNextSave $$$ on LLM inference

Last updated 1 month ago