Generating a diverse QA dataset

This tutorial will guide you through creating a hierarchical dataset of diverse question-answer pairs using a structured approach similar to the CAMEL dataset. We'll build a pipeline that generates subjects, subsubjects, and corresponding Q&A pairs.

Introduction

In many AI training scenarios, having diverse question-answer pairs across multiple domains is valuable. This tutorial demonstrates how to create "ungrounded" Q&A pairs - meaning they're generated through language models rather than extracted from existing texts.

Let's begin!

Step 1: Setting Up the Environment

First, let's set up our environment and import the required libraries.

# Install required packages if not already installed
# !pip install pydantic bespokelabs-curator

# Import necessary libraries
from typing import List
from pydantic import BaseModel, Field
from bespokelabs import curator

import os
# disable this if you don't want to use Curator Viewer
os.environ["CURATOR_VIEWER"] = 1

Step 2: Define Data Models

We'll use Pydantic to create structured data models that will help validate our language model outputs.

class Subject(BaseModel):
    """A single subject."""
    subject: str = Field(description="A subject")

class Subjects(BaseModel):
    """A list of subjects."""
    subjects: List[Subject] = Field(description="A list of subjects")

class QA(BaseModel):
    """A question and answer pair."""
    question: str = Field(description="A question")
    answer: str = Field(description="An answer")

class QAs(BaseModel):
    """A list of question and answer pairs."""
    qas: List[QA] = Field(description="A list of QAs")

These models ensure that our data maintains a consistent structure throughout the pipeline. The Field objects provide descriptions that can be useful for documentation and validation.

Step 3: Create Subject Generator

Now, let's build our first component - a generator for high-level subjects.

class SubjectGenerator(curator.LLM):
    """A subject generator that generates diverse subjects."""
    response_format = Subjects

    def prompt(self, input: dict) -> str:
        """Generate a prompt for the subject generator."""
        return "Generate a diverse list of 3 subjects. Keep it high-level (e.g. Math, Science)."

    def parse(self, input: dict, response: Subjects) -> dict:
        """Parse the model response into the desired output format."""
        return response.subjects

This generator will produce high-level subjects like "Math," "History," or "Computer Science." The response_format tells the system what structure to expect from the language model's response.

Step 4: Create Subsubject Generator

Next, we'll create a generator for subsubjects that takes each subject and generates more specific topics within it.

class SubsubjectGenerator(curator.LLM):
    """A subsubject generator that generates diverse subsubjects for a given subject."""
    response_format = Subjects

    def prompt(self, input: dict) -> str:
        """Generate a prompt for the subsubject generator."""
        return f"For the given subject {input['subject']}. Generate 3 diverse subsubjects. No explanation."

    def parse(self, input: dict, response: Subjects) -> dict:
        """Parse the model response into the desired output format."""
        return [{"subject": input["subject"], "subsubject": subsubject.subject} 
                for subsubject in response.subjects]

For example, if the subject is "Math," this generator might produce subsubjects like "Calculus," "Algebra," and "Statistics."

Step 5: Create QA Generator

Now, we'll create our final generator component that produces question-answer pairs for each subsubject.

class QAGenerator(curator.LLM):
    """A QA generator that generates diverse questions and answers for a given subsubject."""
    response_format = QAs

    def prompt(self, input: dict) -> str:
        """Generate a prompt for the QA generator."""
        return f"For the given subsubject {input['subsubject']}. Generate 3 diverse questions and answers. No explanation."

    def parse(self, input: dict, response: QAs) -> dict:
        """Parse the model response into the desired output format."""
        return [
            {
                "subject": input["subject"],
                "subsubject": input["subsubject"],
                "question": qa.question,
                "answer": qa.answer,
            }
            for qa in response.qas
        ]

This generator takes a subsubject and creates Q&A pairs relevant to that topic. It maintains the hierarchical structure by keeping track of both the subject and subsubject.

Step 7: Run the Complete Pipeline

Now let's run our complete pipeline and see the results:

# Step 1: Generate subjects
subject_generator = SubjectGenerator(model_name="gpt-4o-mini")
subject_dataset = subject_generator()

# Step 2: Generate subsubjects for each subject
subsubject_generator = SubsubjectGenerator(model_name="gpt-4o-mini")
subsubject_dataset = subsubject_generator(subject_dataset.dataset)

# Step 3: Generate Q&A pairs for each subsubject
qa_generator = QAGenerator(model_name="gpt-4o-mini")
qa_dataset = qa_generator(subsubject_dataset.dataset)

# Clean up answers by stripping whitespace
qa_dataset = qa_dataset.dataset.map(lambda row: {"answer": row["answer"].strip()}, num_proc=2)

# Print the final dataset
print(qa_dataset.to_pandas())

Example Output

When run, the code might produce output like this:

Generated Subjects:
- Mathematics
- History
- Biology

Generated Subsubjects:
- Mathematics → Calculus
- Mathematics → Linear Algebra
- Mathematics → Number Theory
- History → Ancient Civilizations
- History → World War II
- History → Renaissance Period
- Biology → Genetics
- Biology → Ecology
- Biology → Cell Biology

Sample of Generated Q&A Pairs:
         subject       subsubject                                  question                                             answer
0    Mathematics        Calculus    What is the derivative of the function f(x) = e^x?    The derivative of f(x) = e^x is also e^x. This is a special property of the exponential function.
1    Mathematics        Calculus    What is the fundamental theorem of calculus?    The fundamental theorem of calculus states that differentiation and integration are inverse processes. It connects the concept of the derivative of a function with the concept of the definite integral.
2    Mathematics        Calculus    How do you find the area under a curve?    To find the area under a curve, you can use a definite integral. First, identify the function and the bounds of integration. Then calculate the integral over that interval.
3    Mathematics    Linear Algebra    What are eigenvalues and eigenvectors?    Eigenvalues are special scalars associated with a linear system of equations, while eigenvectors are the corresponding vectors that, when that linear transformation is applied, change only in scale (not direction). For a matrix A, if Av = λv, then λ is an eigenvalue and v is an eigenvector.
...

Customizing the Pipeline

Now that you have the basic pipeline working, here are some ways you can customize it:

# Modify the subject generator to produce more subjects
class CustomSubjectGenerator(SubjectGenerator):
    def prompt(self, input: dict) -> str:
        return "Generate a diverse list of 5 subjects spanning sciences, arts, and humanities."

# Customize the Q&A generator to produce more complex questions
class ComplexQAGenerator(QAGenerator):
    def prompt(self, input: dict) -> str:
        return f"""
        For the given subsubject {input['subsubject']}, generate 3 diverse questions and answers.
        Include at least one factual question, one conceptual question, and one application question.
        Make the questions challenging but clear.
        """

Conclusion

You've now built a complete pipeline for generating diverse, hierarchical question-answer datasets! This approach is similar to how the CAMEL dataset was created, though with a simplified implementation that focuses on educational utility.

This technique is particularly useful for:

Creating training data for question-answering systems
Developing educational resources across diverse domains
Evaluating AI systems on breadth of knowledge
Generating prompts for further research or content creation

You can extend this framework by adding question types, difficulty levels, or specialized domains based on your specific needs.

PreviousData Curation Recipes NextUsing SimpleStrat block for generating diverse data

Last updated 2 months ago