Structured Output
Structured Output for Data Generation with LLMs
This example demonstrates how to use structured output with a custom LLM class to generate poems on different topics while maintaining a clean data structure:
How This Works:
Structured Models: We define Pydantic models (
Poem
andPoems
) that specify the expected structure of our LLM output.Custom Poet Class: By inheriting from
curator.LLM
, we create a specialized class that:Sets
response_format = Poems
to specify the output structureImplements a
prompt()
method that formats our input into a proper promptImplements a
parse()
method that transforms the structured response into a list of dictionaries where each poem is a separate row with its associated topic
Processing Pipeline: When we call
poet(topics)
, our custom class:Takes each topic from the dataset
Creates a prompt for each topic
Sends the prompt to the LLM
Parses the structured response
Returns a dataset where each row contains a topic and a single poem
This approach gives us clean, structured data that's ready for analysis or further processing while maintaining the relationship between inputs (topics) and outputs (poems).
Chaining LLM calls with structured output
Using structured output along with custom prompting and parsing logic allows you to chain together multiple calls to the LLM
class to create powerful data generation pipelines.
Let's return to our example of generating poems. Suppose we want to also use LLMs to generate the topics of the poems. This can be accomplished by using another LLM
object to generate the topics, as shown in the example below.
Chaining multiple LLM
calls this way allows us to build powerful synthetic data pipelines that can create millions of examples.
Last updated