Synthetic Data for function calling

This step-by-step tutorial will guide you through creating a system that generates customized function calls using different parameters for each row in a dataset. We'll explore how to override default generation parameters at the row level when using language models.

Introduction

In this tutorial, we'll learn how to:

Create a function call generator using Curator
Define different function tools (APIs)
Configure different generation parameters for each row in a dataset
Handle both successful function calls and regular message responses

Let's dive in!

Step 1: Import Required Libraries

First, let's set up our environment and import the necessary libraries:

# pip install bespokelabs-curator 

import json
from typing import Dict

from datasets import Dataset
from bespokelabs import curator

Step 2: Define the Function Call Generator

We'll create a custom LLM class that generates function calls based on user requests:

class FunctionCallGenerator(curator.LLM):
    """A simple function calling generator."""

    return_completions_object = True

    def prompt(self, input: Dict) -> str:
        """The prompt is used to generate the function call."""
        return f"""You are a function calling expert. Given the user request:
        {input['user_request']}.
        Generate a function call that can be used to satisfy the user request.
        """

    def parse(self, input: Dict, response) -> Dict:
        """Parse the response to extract the function call or the message."""
        if "tool_calls" in response["choices"][0]["message"]:
            input["function_call"] = str([tool_call["function"] for tool_call in response["choices"][0]["message"]["tool_calls"]])
        else:
            # Handle the case where the model returns a string instead of a function call
            input["function_call"] = response["choices"][0]["message"]["content"]
        return input

This class does two main things:

Generates a prompt asking the model to create a function call based on a user request
Parses the response to extract either the function call or regular message

Step 3: Define Function Tools

Now, let's define two function tools that our model can use:

function_docs = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Retrieves current weather for the given location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City and country e.g. Bogotá, Colombia"},
                    "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Units the temperature will be returned in."},
                },
                "required": ["location", "units"],
                "additionalProperties": False,
            },
            "strict": True,
        },
    },
    {
        "type": "function",
        "function": {
            "name": "get_local_time",
            "description": "Get the local time of a given location",
            "strict": True,
            "parameters": {
                "type": "object",
                "required": ["location", "timezone"],
                "properties": {
                    "location": {"type": "string", "description": "The name or coordinates of the location for which to get the local time"},
                    "timezone": {"type": "string", "description": "The timezone of the location, defaults to the location's timezone if not provided"},
                },
                "additionalProperties": False,
            },
        },
    },
]

These function definitions describe:

A weather API that requires location and units parameters
A local time API that requires location and timezone parameters

Step 4: Create an LLM Instance with Default Parameters

Let's instantiate our function call generator with default parameters:

llm = FunctionCallGenerator(
    model_name="gpt-4o-mini",
    # Default generation_params has both functions
    generation_params={"tools": function_docs},
    backend_params={"max_retries": 1, "require_all_responses": False},
)

This LLM instance has:

The "gpt-4o-mini" model
Both function tools available by default
Configuration for retries and response handling

Step 5: Create a Dataset with Row-Level Parameters

Now, let's create a dataset where each row has its own generation parameters:

dataset = Dataset.from_dict(
    {
        "user_request": ["What's the current temperature in New York?", "What time is it in Tokyo?"],
        # WARNING: The generation_params in Dataset must be a string otherwise the Dataset operation automatically expand dictionary keys
        # See https://github.com/bespokelabsai/curator/issues/325 for more detail
        # The generation_params from the row will override the default generation_params during inference
        "generation_params": [json.dumps({"tools": [function_docs[0]]}), json.dumps({"tools": [function_docs[1]]})],
    }
)

Important notes:

The first row only has access to the weather function
The second row only has access to the time function
The generation_params must be JSON strings to prevent dataset operations from expanding dictionary keys

Step 6: Run the Generator and Display Results

Let's run our function call generator on the dataset:

function_calls = llm(dataset)
# The model is expected to return a function call for each row
print(function_calls.to_pandas())

This will:

Process each row with its specific generation parameters
Generate appropriate function calls for each user request
Display the results in a pandas DataFrame

Practical Applications

This technique is useful for:

Processing diverse user requests with specialized tools
A/B testing different function configurations
Creating targeted function call generators for specific domains
Building efficient pipelines that adapt to different input types

Conclusion

You've learned how to create a flexible function call generation system that can adapt to different rows in a dataset. This approach allows for more targeted and efficient use of language models when generating function calls, particularly when different requests require different tools or configurations.

Remember to properly configure both default and row-level parameters, and to handle both function call and regular message responses in your parsing logic.

PreviousCurate Reasoning data with Claude-3.7 Sonnet NextFinetuning Examples

Last updated 1 month ago

class FunctionCallGenerator(curator.LLM): """A simple function calling generator.""" return_completions_object = True def prompt(self, input: Dict) -> str: """The prompt is used to generate the function call.""" return f"""You are a function calling expert. Given the user request: {input['user_request']}. Generate a function call that can be used to satisfy the user request. """ def parse(self, input: Dict, response) -> Dict: """Parse the response to extract the function call or the message.""" if "tool_calls" in response["choices"][0]["message"]: input["function_call"] = str([tool_call["function"] for tool_call in response["choices"][0]["message"]["tool_calls"]]) else: # Handle the case where the model returns a string instead of a function call input["function_call"] = response["choices"][0]["message"]["content"] return input

function_docs = [ { "type": "function", "function": { "name": "get_weather", "description": "Retrieves current weather for the given location.", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City and country e.g. Bogotá, Colombia"}, "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Units the temperature will be returned in."}, }, "required": ["location", "units"], "additionalProperties": False, }, "strict": True, }, }, { "type": "function", "function": { "name": "get_local_time", "description": "Get the local time of a given location", "strict": True, "parameters": { "type": "object", "required": ["location", "timezone"], "properties": { "location": {"type": "string", "description": "The name or coordinates of the location for which to get the local time"}, "timezone": {"type": "string", "description": "The timezone of the location, defaults to the location's timezone if not provided"}, }, "additionalProperties": False, }, }, }, ]

llm = FunctionCallGenerator( model_name="gpt-4o-mini", # Default generation_params has both functions generation_params={"tools": function_docs}, backend_params={"max_retries": 1, "require_all_responses": False}, )

dataset = Dataset.from_dict( { "user_request": ["What's the current temperature in New York?", "What time is it in Tokyo?"], # WARNING: The generation_params in Dataset must be a string otherwise the Dataset operation automatically expand dictionary keys # See https://github.com/bespokelabsai/curator/issues/325 for more detail # The generation_params from the row will override the default generation_params during inference "generation_params": [json.dumps({"tools": [function_docs[0]]}), json.dumps({"tools": [function_docs[1]]})], } )