Synthetic Data for function calling

This step-by-step tutorial will guide you through creating a system that generates customized function calls using different parameters for each row in a dataset. We'll explore how to override default generation parameters at the row level when using language models.

Introduction

In this tutorial, we'll learn how to:

  1. Create a function call generator using Curator

  2. Define different function tools (APIs)

  3. Configure different generation parameters for each row in a dataset

  4. Handle both successful function calls and regular message responses

Let's dive in!

Step 1: Import Required Libraries

First, let's set up our environment and import the necessary libraries:

# pip install bespokelabs-curator 

import json
from typing import Dict

from datasets import Dataset
from bespokelabs import curator

Step 2: Define the Function Call Generator

We'll create a custom LLM class that generates function calls based on user requests:

class FunctionCallGenerator(curator.LLM):
    """A simple function calling generator."""

    return_completions_object = True

    def prompt(self, input: Dict) -> str:
        """The prompt is used to generate the function call."""
        return f"""You are a function calling expert. Given the user request:
        {input['user_request']}.
        Generate a function call that can be used to satisfy the user request.
        """

    def parse(self, input: Dict, response) -> Dict:
        """Parse the response to extract the function call or the message."""
        if "tool_calls" in response["choices"][0]["message"]:
            input["function_call"] = str([tool_call["function"] for tool_call in response["choices"][0]["message"]["tool_calls"]])
        else:
            # Handle the case where the model returns a string instead of a function call
            input["function_call"] = response["choices"][0]["message"]["content"]
        return input

This class does two main things:

  • Generates a prompt asking the model to create a function call based on a user request

  • Parses the response to extract either the function call or regular message

Step 3: Define Function Tools

Now, let's define two function tools that our model can use:

function_docs = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Retrieves current weather for the given location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country e.g. Bogotá, Colombia"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Units the temperature will be returned in."},
                },
                "required": ["location", "units"],
                "additionalProperties": False,
            },
            "strict": True,
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_local_time",
            "description": "Get the local time of a given location",
            "strict": True,
            "parameters": {
                "type": "object",
                "required": ["location", "timezone"],
                "properties": {
                    "location": {"type": "string", "description": "The name or coordinates of the location for which to get the local time"},
                    "timezone": {"type": "string", "description": "The timezone of the location, defaults to the location's timezone if not provided"},
                },
                "additionalProperties": False,
            },
        },
    },
]

These function definitions describe:

  • A weather API that requires location and units parameters

  • A local time API that requires location and timezone parameters

Step 4: Create an LLM Instance with Default Parameters

Let's instantiate our function call generator with default parameters:

llm = FunctionCallGenerator(
    model_name="gpt-4o-mini",
    # Default generation_params has both functions
    generation_params={"tools": function_docs},
    backend_params={"max_retries": 1, "require_all_responses": False},
)

This LLM instance has:

  • The "gpt-4o-mini" model

  • Both function tools available by default

  • Configuration for retries and response handling

Step 5: Create a Dataset with Row-Level Parameters

Now, let's create a dataset where each row has its own generation parameters:

dataset = Dataset.from_dict(
    {
        "user_request": ["What's the current temperature in New York?", "What time is it in Tokyo?"],
        # WARNING: The generation_params in Dataset must be a string otherwise the Dataset operation automatically expand dictionary keys
        # See https://github.com/bespokelabsai/curator/issues/325 for more detail
        # The generation_params from the row will override the default generation_params during inference
        "generation_params": [json.dumps({"tools": [function_docs[0]]}), json.dumps({"tools": [function_docs[1]]})],
    }
)

Important notes:

  • The first row only has access to the weather function

  • The second row only has access to the time function

  • The generation_params must be JSON strings to prevent dataset operations from expanding dictionary keys

Step 6: Run the Generator and Display Results

Let's run our function call generator on the dataset:

function_calls = llm(dataset)
# The model is expected to return a function call for each row
print(function_calls.to_pandas())

This will:

  • Process each row with its specific generation parameters

  • Generate appropriate function calls for each user request

  • Display the results in a pandas DataFrame

Practical Applications

This technique is useful for:

  • Processing diverse user requests with specialized tools

  • A/B testing different function configurations

  • Creating targeted function call generators for specific domains

  • Building efficient pipelines that adapt to different input types

Conclusion

You've learned how to create a flexible function call generation system that can adapt to different rows in a dataset. This approach allows for more targeted and efficient use of language models when generating function calls, particularly when different requests require different tools or configurations.

Remember to properly configure both default and row-level parameters, and to handle both function call and regular message responses in your parsing logic.

Last updated