Bespoke Labs
  • Welcome
  • BESPOKE CURATOR
    • Getting Started
      • Quick Tour
      • Key Concepts
      • Visualize your dataset with the Bespoke Curator Viewer
      • Automatic recovery and caching
      • Structured Output
    • Save $$$ on LLM inference
      • Using OpenAI for batch inference
      • Using Anthropic for batch inference
      • Using Gemini for batch inference
      • Using Mistral for batch inference
      • Using kluster.ai for batch inference
    • How-to Guides
      • Using vLLM with Curator
      • Using Ollama with Curator
      • Using LiteLLM with curator
      • Handling Multimodal Data in Curator
      • Executing LLM-generated code
      • Using HuggingFace inference providers with Curator
    • Data Curation Recipes
      • Generating a diverse QA dataset
      • Using SimpleStrat block for generating diverse data
      • Curate Reasoning data with Claude-3.7 Sonnet
      • Synthetic Data for function calling
    • Finetuning Examples
      • Aspect based sentiment analysis
      • Finetuning a model to identify features of a product
    • API Reference
  • Models
    • Bespoke MiniCheck
      • Self-Hosting
      • Integrations
      • API Service
    • Bespoke MiniChart
    • OpenThinker
Powered by GitBook
On this page
  • StratifiedGenerator: Generate Balanced Question-Answer Pairs
  • Overview
  • Installation
  • Quick Start Example
  • How StratifiedGenerator Works
  • Advanced Usage
  • Performance Considerations
  • Common Applications
  • Troubleshooting
  • Additional Resources
  1. BESPOKE CURATOR
  2. Data Curation Recipes

Using SimpleStrat block for generating diverse data

PreviousGenerating a diverse QA datasetNextCurate Reasoning data with Claude-3.7 Sonnet

Last updated 1 month ago

StratifiedGenerator: Generate Balanced Question-Answer Pairs

Overview

The StratifiedGenerator is a powerful tool for creating high-quality question-answer (QA) pairs with balanced and diverse coverage across your input questions. It ensures your generated dataset avoids biases and provides comprehensive representation of your input space.

📝 Research Background: For a comprehensive understanding of the methodology and theoretical foundation, see the paper:

Installation

pip install bespokelabs-curator

Quick Start Example

from datasets import Dataset
from bespokelabs.curator.blocks.simplestrat import StratifiedGenerator

# Create a simple dataset of questions
questions = Dataset.from_dict({"question": [f"{i}. Name a periodic element" for i in range(20)]})

# Initialize the generator with your preferred model
generator = StratifiedGenerator(model_name="gpt-4o-mini")

# Generate stratified QA pairs
qa_pairs = generator(questions)

# Examine the results
print(f"Generated {len(qa_pairs)} QA pairs")
print(qa_pairs[0])  # View the first QA pair

How StratifiedGenerator Works

  1. Input: A Dataset containing questions you want to generate answers for

  2. Stratification Process: The algorithm:

    • Clusters similar questions together

    • Ensures balanced coverage across different question types

    • Prevents overrepresentation of common patterns

  3. Model Integration: Uses your specified LLM to generate high-quality answers

  4. Output: Returns a new dataset of QA pairs with well-distributed coverage

Advanced Usage

Customizing the Generator

# With custom parameters
generator = StratifiedGenerator(
    model_name="gpt-4o",  # Use a more powerful model
    generation_params={
        "temperature":0.7,      # Adjust creativity
    }
)

Saving and Loading Results

# Save your generated QA pairs
qa_pairs.push_to_hub('hf_org/dataset_name')

Performance Considerations

  • Model Selection: Larger models (e.g., GPT-4) produce higher quality answers but cost more

Common Applications

  • Creating balanced training datasets for QA systems

  • Generating diverse test sets for robustness evaluation

  • Augmenting existing datasets with additional QA pairs

  • Creating instruction-tuning datasets with varied coverage

Troubleshooting

Q: My generated answers seem too similar across different questions. A: Try increasing the temperature parameter or the number of clusters.

Q: I'm getting API errors during generation. A: Try changing the model or backend params. You can check the API reference here.

Additional Resources

Happy data generation!

Stratified Generation for Artificial Data in Question Answering
BespokeLabs Documentation