Structured Output

Structured Output for Data Generation with LLMs

This example demonstrates how to use structured output with a custom LLM class to generate poems on different topics while maintaining a clean data structure:

from typing import Dict, List
from datasets import Dataset
from pydantic import BaseModel, Field
from bespokelabs import curator

# Define our structured output models
class Poem(BaseModel):
    poem: str = Field(description="A poem.")

class Poems(BaseModel):
    poems: List[Poem] = Field(description="A list of poems.")

# Create a custom LLM class with specialized prompting and parsing
class Poet(curator.LLM):
    response_format = Poems

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poems) -> Dict:
        return [{"topic": input["topic"], "poem": p.poem} for p in response.poems]

# Initialize our custom LLM
poet = Poet(model_name="gpt-4o-mini")

# Create a dataset of topics
topics = Dataset.from_dict({
    "topic": [
        "Urban loneliness in a bustling city", 
        "Beauty of Bespoke Labs's Curator library"
    ]
})

# Generate poems
poem = poet(topics)
print(poem.dataset.to_pandas())

# Output:
#                                       topic                                               poem
# 0       Urban loneliness in a bustling city  In the city's heart, where the lights never di...
# 1       Urban loneliness in a bustling city  Steps echo loudly, pavement slick with rain,\n...
# 2  Beauty of Bespoke Labs's Curator library  In the heart of Curation's realm,\nWhere art...
# 3  Beauty of Bespoke Labs's Curator library  Step within the library's embrace,\nA sanctu...

How This Works:

Structured Models: We define Pydantic models (Poem and Poems) that specify the expected structure of our LLM output.
Custom Poet Class: By inheriting from curator.LLM, we create a specialized class that:
- Sets response_format = Poems to specify the output structure
- Implements a prompt() method that formats our input into a proper prompt
- Implements a parse() method that transforms the structured response into a list of dictionaries where each poem is a separate row with its associated topic
Processing Pipeline: When we call poet(topics), our custom class:
- Takes each topic from the dataset
- Creates a prompt for each topic
- Sends the prompt to the LLM
- Parses the structured response
- Returns a dataset where each row contains a topic and a single poem

This approach gives us clean, structured data that's ready for analysis or further processing while maintaining the relationship between inputs (topics) and outputs (poems).

Chaining LLM calls with structured output

Using structured output along with custom prompting and parsing logic allows you to chain together multiple calls to the LLM class to create powerful data generation pipelines.

Let's return to our example of generating poems. Suppose we want to also use LLMs to generate the topics of the poems. This can be accomplished by using another LLM object to generate the topics, as shown in the example below.

from typing import Dict, List

from pydantic import BaseModel, Field

from bespokelabs import curator


class Topic(BaseModel):
    topic: str = Field(description="A topic.")


class Topics(BaseModel):
    topics: List[Topic] = Field(description="A list of topics.")


class Muse(curator.LLM):
    response_format = Topics

    def prompt(self, input: Dict) -> str:
        return "Generate ten evocative poetry topics."

    def parse(self, input: Dict, response: Topics) -> Dict:
        return [{"topic": topic.topic} for topic in response.topics]


class Poem(BaseModel):
    poem: str = Field(description="A poem.")


class Poems(BaseModel):
    poems: List[Poem] = Field(description="A list of poems.")


class Poet(curator.LLM):
    response_format = Poems

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poems) -> Dict:
        return [{"topic": input["topic"], "poem": p.poem} for p in response.poems]


muse = Muse(model_name="gpt-4o-mini")
topics = muse()
print(topics.dataset.to_pandas())

poet = Poet(model_name="gpt-4o-mini")
poem = poet(topics)
print(poem.dataset.to_pandas())
# Output:
#                                                topic                                               poem
# 0               The fleeting beauty of autumn leaves  In a whisper of wind, they dance and they sway...
# 1               The fleeting beauty of autumn leaves  Once vibrant with life, now a radiant fade,\nC...
# 2                 The whispers of an abandoned house  In shadows deep where light won’t tread,  \nAn...
# 3                 The whispers of an abandoned house  Abandoned now, my heart does fade,  \nOnce a h...
# 4               The warmth of a forgotten summer day  In the stillness of a memory's embrace,  \nA w...
# 5               The warmth of a forgotten summer day  A gentle breeze delivers the trace  \nOf a day...
# ...

Chaining multiple LLMcalls this way allows us to build powerful synthetic data pipelines that can create millions of examples.

PreviousAutomatic recovery and caching NextSave $$$ on LLM inference

Last updated 2 months ago