Data Curation Recipes

Here are some simple data curation recipes to get you started with generating synthetic data at scale using Curator:

In addition to these examples, we also have the following larger examples in our github repo:

Task

Link(s)

Goal

Reasoning dataset generation (Bespoke Stratos)

Generate the Bespoke-Stratos-17k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.

Reasoning dataset generation (Open Thoughts)

Generate the Open-Thoughts-114k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.

3Blue1Brown video generation

Generate videos similar to 3Blue1Brown and render them using code execution.

Last updated