Basics
Key Features, Installation, and Example.
Overview
Curator is an open-source library for building highly programmable and efficient synthetic data pipelines for post training and structured data extraction at scale. It has built-in support for performance optimization, caching and retries, and comes with a Curator-viewer that lets you iterate on the data generation.
Key Features
Programmability and Structured Outputs: Synthetic data generation is lot more than just calling one prompt. It is a sequence of calling LLMs. You can orchestrate complex pipelines of calling LLMs and use structured output to decide on control-flow. Curator treats structured outputs as first class citizens.
Built-in Performance Optimization: We often see calling LLMs in loops, or inefficient implementation of multi-threading. We have baked in performance optimizations so that you don't need to worry about those!
Intelligent Caching and Fault Recovery: Given LLM calls can add up in cost and time, failures are undesirable but sometimes unavoidable. We cache the LLM requests and responses so that it is easy to recover from a failure. Moreover, when working on a multi-stage pipeline, caching of stages makes it easy to iterate.
Native HuggingFace Dataset Integration: Work directly on HuggingFace Dataset objects throughput your pipeline. Your synthetic data is immediately ready for fine-tuning!
Interactive Curator Viewer: Improve and iterate on your prompts using our built-in viewer. Inspect LLM requests and responses in real-time, allowing you to iterate and refine your data generation strategy with immediate feedback.
Coming soon: Observability for cost monitoring, pre-baked validators.
Installation
Quick Example
poems
is a HuggingFace Dataset object with two columns:
Key Components of Prompter
Here is an invocation of Prompter
:
It has two important arguments: prompt_func
and parse_func
.
prompt_func
This calls an LLM on each row of the input dataset in parallel.
Takes a dataset row as input
Returns the prompt for the LLM.
parse_func
Converts LLM output into structured data by adding it back to the dataset.
Takes two arguments:
Input row (this was given to the LLM).
LLM's response (in response_format --- string or Pydantic)
Returns new rows (in list of dictionaries)
Data Flow Example
Input Dataset:
Processing by Prompter:
Output Dataset:
In this example:
The two input rows (A and B) are processed in parallel to prompt the LLM
Each generates a response (R1 and R2)
The parse function converts each response into (multiple) new rows (C, D, E, F)
The final dataset contains all generated rows
You can chain prompters together to iteratively build up a dataset.
Last updated