Quick Tour

Installation

pip install bespokelabs-curator

Hello World with LLM

The LLMclass provides a flexible interface to generate data with LLMs. Below is a minimal example of using LLM: we simply create an LLM object with a model_name, in this case gpt-4o-mini, and passing in a prompt.

from bespokelabs import curator
llm = curator.LLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem.to_pandas())
# Output:
#                                             response
# 0  In the realm where silence once held sway,  \n...

# Or you can pass a list of prompts to generate multiple responses.
poems = llm(["Write a poem about the importance of data in AI.",
            "Write a haiku about the importance of data in AI."])
print(poems.to_pandas())
# Output:
#                                             response
# 0  In the realm where silence once held sway,  \n...
# 1  Silent streams of truth,  \nData shapes the le...

Using Different Models

You can also use models from other providers by simply changing model_name (supported via LiteLLM):

from bespokelabs import curator

llm = curator.LLM(model_name="claude-3-5-haiku-20241022")
poem = llm("Write a poem about the importance of data in AI.")
print(poem.to_pandas())

Using Structured Output

Adding structured output to your generation

Let's look at some more interesting examples of data generation using structured output.

Suppose you want to generate multiple poems from a single a LLM call. Structured output is your friend! Using structure output allows you to easily validate and parse LLM responses:

from typing import List

from pydantic import BaseModel, Field

from bespokelabs import curator


class Poem(BaseModel):
    poem: str = Field(description="A poem.")


class Poems(BaseModel):
    poems_list: List[Poem] = Field(description="A list of poems.")


llm = curator.LLM(model_name="gpt-4o-mini", response_format=Poems)
poems = llm(["Write two poems about the importance of data in AI.", 
              "Write three haikus about the importance of data in AI."])
print(poems.to_pandas())

# Output: 
#                                           poems_list
# 0  [{'poem': 'In shadows deep where silence lies,...
# 1  [{'poem': 'Data whispers truth,  
# Patterns wea...

Note how each row in the dataset is now a Poems object that is easy to parse and manipulate using Python code.

Defining your custom prompting and parsing logic

Sometimes, it might not be enough to simply get back the responses. For example, you might want to preserve the mapping between each topic and its corresponding poems, and you might want each poem to occupy only a single row. In this case, you can define a Poet object that inherits from LLM, and define your custom prompting and parsing logic:

from typing import Dict, List

from datasets import Dataset
from pydantic import BaseModel, Field

from bespokelabs import curator


class Poem(BaseModel):
    poem: str = Field(description="A poem.")


class Poems(BaseModel):
    poems: List[Poem] = Field(description="A list of poems.")


class Poet(curator.LLM):
    response_format = Poems

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poems) -> Dict:
        return [{"topic": input["topic"], "poem": p.poem} for p in response.poems]


poet = Poet(model_name="gpt-4o-mini")

topics = Dataset.from_dict({"topic": ["Urban loneliness in a bustling city", "Beauty of Bespoke Labs's Curator library"]})
poem = poet(topics)
print(poem.to_pandas())
# Output:
#                                       topic                                               poem
# 0       Urban loneliness in a bustling city  In the city’s heart, where the lights never di...
# 1       Urban loneliness in a bustling city  Steps echo loudly, pavement slick with rain,\n...
# 2  Beauty of Bespoke Labs's Curator library  In the heart of Curation’s realm,  \nWhere art...
# 3  Beauty of Bespoke Labs's Curator library  Step within the library’s embrace,  \nA sanctu...

In the Poet class:

  • response_format is the structured output class we defined above.

  • prompt takes the input (input) and returns the prompt for the LLM.

  • parse takes the input (input) and the structured output (response) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.

Chaining LLM calls with structured output

Using structured output along with custom prompting and parsing logic allows you to chain together multiple calls to the LLM class to create powerful data generation pipelines.

Let's return to our example of generating poems. Suppose we want to also use LLMs to generate the topics of the poems. This can be accomplished by using another LLM object to generate the topics, as shown in the example below.

from typing import Dict, List

from pydantic import BaseModel, Field

from bespokelabs import curator


class Topic(BaseModel):
    topic: str = Field(description="A topic.")


class Topics(BaseModel):
    topics: List[Topic] = Field(description="A list of topics.")


class Muse(curator.LLM):
    response_format = Topics

    def prompt(self, input: Dict) -> str:
        return "Generate ten evocative poetry topics."

    def parse(self, input: Dict, response: Topics) -> Dict:
        return [{"topic": topic.topic} for topic in response.topics]


class Poem(BaseModel):
    poem: str = Field(description="A poem.")


class Poems(BaseModel):
    poems: List[Poem] = Field(description="A list of poems.")


class Poet(curator.LLM):
    response_format = Poems

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poems) -> Dict:
        return [{"topic": input["topic"], "poem": p.poem} for p in response.poems]


muse = Muse(model_name="gpt-4o-mini")
topics = muse()
print(topics.to_pandas())

poet = Poet(model_name="gpt-4o-mini")
poem = poet(topics)
print(poem.to_pandas())
# Output:
#                                                topic                                               poem
# 0               The fleeting beauty of autumn leaves  In a whisper of wind, they dance and they sway...
# 1               The fleeting beauty of autumn leaves  Once vibrant with life, now a radiant fade,\nC...
# 2                 The whispers of an abandoned house  In shadows deep where light won’t tread,  \nAn...
# 3                 The whispers of an abandoned house  Abandoned now, my heart does fade,  \nOnce a h...
# 4               The warmth of a forgotten summer day  In the stillness of a memory's embrace,  \nA w...
# 5               The warmth of a forgotten summer day  A gentle breeze delivers the trace  \nOf a day...
# ...

Chaining multiple LLMcalls this way allows us to build powerful synthetic data pipelines that can create millions of examples.

What's next?

For an in-depth tutorial of the core features in our library, please continue to Tutorials.

For how-to guides on specific topics and workflows, please continue to How-to Guides.

For an end-to-end example, see Finetuning a model to identify features of a product.

Last updated