Synthetic Data for function calling
This step-by-step tutorial will guide you through creating a system that generates customized function calls using different parameters for each row in a dataset. We'll explore how to override default generation parameters at the row level when using language models.
Introduction
In this tutorial, we'll learn how to:
Create a function call generator using Curator
Define different function tools (APIs)
Configure different generation parameters for each row in a dataset
Handle both successful function calls and regular message responses
Let's dive in!
Step 1: Import Required Libraries
First, let's set up our environment and import the necessary libraries:
# pip install bespokelabs-curator
import json
from typing import Dict
from datasets import Dataset
from bespokelabs import curator
Step 2: Define the Function Call Generator
We'll create a custom LLM class that generates function calls based on user requests:
class FunctionCallGenerator(curator.LLM):
"""A simple function calling generator."""
return_completions_object = True
def prompt(self, input: Dict) -> str:
"""The prompt is used to generate the function call."""
return f"""You are a function calling expert. Given the user request:
{input['user_request']}.
Generate a function call that can be used to satisfy the user request.
"""
def parse(self, input: Dict, response) -> Dict:
"""Parse the response to extract the function call or the message."""
if "tool_calls" in response["choices"][0]["message"]:
input["function_call"] = str([tool_call["function"] for tool_call in response["choices"][0]["message"]["tool_calls"]])
else:
# Handle the case where the model returns a string instead of a function call
input["function_call"] = response["choices"][0]["message"]["content"]
return input
This class does two main things:
Generates a prompt asking the model to create a function call based on a user request
Parses the response to extract either the function call or regular message
Step 3: Define Function Tools
Now, let's define two function tools that our model can use:
function_docs = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Retrieves current weather for the given location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City and country e.g. Bogotá, Colombia"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Units the temperature will be returned in."},
},
"required": ["location", "units"],
"additionalProperties": False,
},
"strict": True,
},
},
{
"type": "function",
"function": {
"name": "get_local_time",
"description": "Get the local time of a given location",
"strict": True,
"parameters": {
"type": "object",
"required": ["location", "timezone"],
"properties": {
"location": {"type": "string", "description": "The name or coordinates of the location for which to get the local time"},
"timezone": {"type": "string", "description": "The timezone of the location, defaults to the location's timezone if not provided"},
},
"additionalProperties": False,
},
},
},
]
These function definitions describe:
A weather API that requires location and units parameters
A local time API that requires location and timezone parameters
Step 4: Create an LLM Instance with Default Parameters
Let's instantiate our function call generator with default parameters:
llm = FunctionCallGenerator(
model_name="gpt-4o-mini",
# Default generation_params has both functions
generation_params={"tools": function_docs},
backend_params={"max_retries": 1, "require_all_responses": False},
)
This LLM instance has:
The "gpt-4o-mini" model
Both function tools available by default
Configuration for retries and response handling
Step 5: Create a Dataset with Row-Level Parameters
Now, let's create a dataset where each row has its own generation parameters:
dataset = Dataset.from_dict(
{
"user_request": ["What's the current temperature in New York?", "What time is it in Tokyo?"],
# WARNING: The generation_params in Dataset must be a string otherwise the Dataset operation automatically expand dictionary keys
# See https://github.com/bespokelabsai/curator/issues/325 for more detail
# The generation_params from the row will override the default generation_params during inference
"generation_params": [json.dumps({"tools": [function_docs[0]]}), json.dumps({"tools": [function_docs[1]]})],
}
)
Important notes:
The first row only has access to the weather function
The second row only has access to the time function
The
generation_params
must be JSON strings to prevent dataset operations from expanding dictionary keys
Step 6: Run the Generator and Display Results
Let's run our function call generator on the dataset:
function_calls = llm(dataset)
# The model is expected to return a function call for each row
print(function_calls.dataset.to_pandas())
This will:
Process each row with its specific generation parameters
Generate appropriate function calls for each user request
Display the results in a pandas DataFrame
Practical Applications
This technique is useful for:
Processing diverse user requests with specialized tools
A/B testing different function configurations
Creating targeted function call generators for specific domains
Building efficient pipelines that adapt to different input types
Conclusion
You've learned how to create a flexible function call generation system that can adapt to different rows in a dataset. This approach allows for more targeted and efficient use of language models when generating function calls, particularly when different requests require different tools or configurations.
Remember to properly configure both default and row-level parameters, and to handle both function call and regular message responses in your parsing logic.
Last updated