Conceptual Guides

Key Components of curator.LLM

Conceptually, curator.LLM has two important methods, prompt and parse.

class Poem(BaseModel):
    poem: str = Field(description="A poem.")


class Poems(BaseModel):
    poems_list: List[Poem] = Field(description="A list of poems.")
    
class Poet(curator.LLM):
    response_format = Poems

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poems) -> Dict:
        return [{"topic": input["topic"], "poem": p.poem} for p in response.poems]

prompt

This calls an LLM on each row of the input dataset in parallel.

  1. Takes a dataset row as input

  2. Returns the prompt for the LLM.

parse

Converts LLM output into structured data by adding it back to the dataset.

  1. Takes two arguments:

    • Input row (this was given to the LLM).

    • LLM's response (in response_format --- string or Pydantic)

  2. Returns new rows (in list of dictionaries)

Data Flow Example

Input Dataset:

Row A 
Row B 

Processing by curator.LLM:

Row A → prompt(A) → Response R1 → parse(A, R1) → [C, D] 
Row B → prompt(B) → Response R2 → parse(B, R2) → [E, F]

Output Dataset:

Row C 
Row D 
Row E 
Row F

In this example:

  • The two input rows (A and B) are processed in parallel to prompt the LLM

  • Each generates a response (R1 and R2)

  • The parse function converts each response into (multiple) new rows (C, D, E, F)

  • The final dataset contains all generated rows

You can chain LLM objects together to iteratively build up a dataset.

Last updated