Generating a diverse QA dataset

This tutorial will guide you through creating a hierarchical dataset of diverse question-answer pairs using a structured approach similar to the CAMEL dataset. We'll build a pipeline that generates subjects, subsubjects, and corresponding Q&A pairs.

Introduction

In many AI training scenarios, having diverse question-answer pairs across multiple domains is valuable. This tutorial demonstrates how to create "ungrounded" Q&A pairs - meaning they're generated through language models rather than extracted from existing texts.

Let's begin!

Step 1: Setting Up the Environment

First, let's set up our environment and import the required libraries.

# Install required packages if not already installed
# !pip install pydantic bespokelabs-curator

# Import necessary libraries
from typing import List
from pydantic import BaseModel, Field
from bespokelabs import curator

import os
# disable this if you don't want to use Curator Viewer
os.environ["CURATOR_VIEWER"] = 1 

Step 2: Define Data Models

We'll use Pydantic to create structured data models that will help validate our language model outputs.

class Subject(BaseModel):
    """A single subject."""
    subject: str = Field(description="A subject")

class Subjects(BaseModel):
    """A list of subjects."""
    subjects: List[Subject] = Field(description="A list of subjects")

class QA(BaseModel):
    """A question and answer pair."""
    question: str = Field(description="A question")
    answer: str = Field(description="An answer")

class QAs(BaseModel):
    """A list of question and answer pairs."""
    qas: List[QA] = Field(description="A list of QAs")

These models ensure that our data maintains a consistent structure throughout the pipeline. The Field objects provide descriptions that can be useful for documentation and validation.

Step 3: Create Subject Generator

Now, let's build our first component - a generator for high-level subjects.

class SubjectGenerator(curator.LLM):
    """A subject generator that generates diverse subjects."""
    response_format = Subjects

    def prompt(self, input: dict) -> str:
        """Generate a prompt for the subject generator."""
        return "Generate a diverse list of 3 subjects. Keep it high-level (e.g. Math, Science)."

    def parse(self, input: dict, response: Subjects) -> dict:
        """Parse the model response into the desired output format."""
        return response.subjects

This generator will produce high-level subjects like "Math," "History," or "Computer Science." The response_format tells the system what structure to expect from the language model's response.

Step 4: Create Subsubject Generator

Next, we'll create a generator for subsubjects that takes each subject and generates more specific topics within it.

class SubsubjectGenerator(curator.LLM):
    """A subsubject generator that generates diverse subsubjects for a given subject."""
    response_format = Subjects

    def prompt(self, input: dict) -> str:
        """Generate a prompt for the subsubject generator."""
        return f"For the given subject {input['subject']}. Generate 3 diverse subsubjects. No explanation."

    def parse(self, input: dict, response: Subjects) -> dict:
        """Parse the model response into the desired output format."""
        return [{"subject": input["subject"], "subsubject": subsubject.subject} 
                for subsubject in response.subjects]

For example, if the subject is "Math," this generator might produce subsubjects like "Calculus," "Algebra," and "Statistics."

Step 5: Create QA Generator

Now, we'll create our final generator component that produces question-answer pairs for each subsubject.

class QAGenerator(curator.LLM):
    """A QA generator that generates diverse questions and answers for a given subsubject."""
    response_format = QAs

    def prompt(self, input: dict) -> str:
        """Generate a prompt for the QA generator."""
        return f"For the given subsubject {input['subsubject']}. Generate 3 diverse questions and answers. No explanation."

    def parse(self, input: dict, response: QAs) -> dict:
        """Parse the model response into the desired output format."""
        return [
            {
                "subject": input["subject"],
                "subsubject": input["subsubject"],
                "question": qa.question,
                "answer": qa.answer,
            }
            for qa in response.qas
        ]

This generator takes a subsubject and creates Q&A pairs relevant to that topic. It maintains the hierarchical structure by keeping track of both the subject and subsubject.

Step 7: Run the Complete Pipeline

Now let's run our complete pipeline and see the results:

# Step 1: Generate subjects
subject_generator = SubjectGenerator(model_name="gpt-4o-mini")
subject_dataset = subject_generator()

# Step 2: Generate subsubjects for each subject
subsubject_generator = SubsubjectGenerator(model_name="gpt-4o-mini")
subsubject_dataset = subsubject_generator(subject_dataset.dataset)

# Step 3: Generate Q&A pairs for each subsubject
qa_generator = QAGenerator(model_name="gpt-4o-mini")
qa_dataset = qa_generator(subsubject_dataset.dataset)

# Clean up answers by stripping whitespace
qa_dataset = qa_dataset.dataset.map(lambda row: {"answer": row["answer"].strip()}, num_proc=2)

# Print the final dataset
print(qa_dataset.to_pandas())

Example Output

When run, the code might produce output like this:

Generated Subjects:
- Mathematics
- History
- Biology

Generated Subsubjects:
- Mathematics → Calculus
- Mathematics → Linear Algebra
- Mathematics → Number Theory
- History → Ancient Civilizations
- History → World War II
- History → Renaissance Period
- Biology → Genetics
- Biology → Ecology
- Biology → Cell Biology

Sample of Generated Q&A Pairs:
         subject       subsubject                                  question                                             answer
0    Mathematics        Calculus    What is the derivative of the function f(x) = e^x?    The derivative of f(x) = e^x is also e^x. This is a special property of the exponential function.
1    Mathematics        Calculus    What is the fundamental theorem of calculus?    The fundamental theorem of calculus states that differentiation and integration are inverse processes. It connects the concept of the derivative of a function with the concept of the definite integral.
2    Mathematics        Calculus    How do you find the area under a curve?    To find the area under a curve, you can use a definite integral. First, identify the function and the bounds of integration. Then calculate the integral over that interval.
3    Mathematics    Linear Algebra    What are eigenvalues and eigenvectors?    Eigenvalues are special scalars associated with a linear system of equations, while eigenvectors are the corresponding vectors that, when that linear transformation is applied, change only in scale (not direction). For a matrix A, if Av = λv, then λ is an eigenvalue and v is an eigenvector.
...

Customizing the Pipeline

Now that you have the basic pipeline working, here are some ways you can customize it:

# Modify the subject generator to produce more subjects
class CustomSubjectGenerator(SubjectGenerator):
    def prompt(self, input: dict) -> str:
        return "Generate a diverse list of 5 subjects spanning sciences, arts, and humanities."

# Customize the Q&A generator to produce more complex questions
class ComplexQAGenerator(QAGenerator):
    def prompt(self, input: dict) -> str:
        return f"""
        For the given subsubject {input['subsubject']}, generate 3 diverse questions and answers.
        Include at least one factual question, one conceptual question, and one application question.
        Make the questions challenging but clear.
        """

Conclusion

You've now built a complete pipeline for generating diverse, hierarchical question-answer datasets! This approach is similar to how the CAMEL dataset was created, though with a simplified implementation that focuses on educational utility.

This technique is particularly useful for:

  • Creating training data for question-answering systems

  • Developing educational resources across diverse domains

  • Evaluating AI systems on breadth of knowledge

  • Generating prompts for further research or content creation

You can extend this framework by adding question types, difficulty levels, or specialized domains based on your specific needs.

Last updated