# Using SimpleStrat block for generating diverse data

## StratifiedGenerator: Generate Balanced Question-Answer Pairs

### Overview

The `StratifiedGenerator` is a powerful tool for creating high-quality question-answer (QA) pairs with balanced and diverse coverage across your input questions. It ensures your generated dataset avoids biases and provides comprehensive representation of your input space.

> 📝 **Research Background**: For a comprehensive understanding of the methodology and theoretical foundation, see the paper: [Stratified Generation for Artificial Data in Question Answering](https://arxiv.org/pdf/2410.09038)

### Installation

```bash
pip install bespokelabs-curator
```

### Quick Start Example

```python
from datasets import Dataset
from bespokelabs.curator.blocks.simplestrat import StratifiedGenerator

# Create a simple dataset of questions
questions = Dataset.from_dict({"question": [f"{i}. Name a periodic element" for i in range(20)]})

# Initialize the generator with your preferred model
generator = StratifiedGenerator(model_name="gpt-4o-mini")

# Generate stratified QA pairs
qa_pairs = generator(questions).dataset

# Examine the results
print(f"Generated {len(qa_pairs)} QA pairs")
print(qa_pairs[0])  # View the first QA pair
```

### How StratifiedGenerator Works

1. **Input**: A `Dataset` containing questions you want to generate answers for
2. **Stratification Process**: The algorithm:
   * Clusters similar questions together
   * Ensures balanced coverage across different question types
   * Prevents overrepresentation of common patterns
3. **Model Integration**: Uses your specified LLM to generate high-quality answers
4. **Output**: Returns a new dataset of QA pairs with well-distributed coverage

### Advanced Usage

#### Customizing the Generator

```python
# With custom parameters
generator = StratifiedGenerator(
    model_name="gpt-4o",  # Use a more powerful model
    generation_params={
        "temperature":0.7,      # Adjust creativity
    }
)
```

#### Saving and Loading Results

```python
# Save your generated QA pairs
qa_pairs.push_to_hub('hf_org/dataset_name')
```

### Performance Considerations

* **Model Selection**: Larger models (e.g., GPT-4) produce higher quality answers but cost more

### Common Applications

* Creating balanced training datasets for QA systems
* Generating diverse test sets for robustness evaluation
* Augmenting existing datasets with additional QA pairs
* Creating instruction-tuning datasets with varied coverage

### Troubleshooting

**Q: My generated answers seem too similar across different questions.**\
A: Try increasing the temperature parameter or the number of clusters.

**Q: I'm getting API errors during generation.**\
A: Try changing the model or backend params. You can check the API reference [here](broken://pages/VAxLcu9BxAB30omwvyzp).

### Additional Resources

* [BespokeLabs Documentation](https://docs.bespokelabs.io/)

Happy data generation!<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bespokelabs.ai/bespoke-curator/data-curation-recipes/using-simplestrat-block-for-generating-diverse-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
