In this notebook, we will demonstrate how to use Curator to distill capabilities from a large language model to a much smaller 3B parameter model.
We will use Yelp restaurant reviews dataset to train a sentiment analysis model. We will generate a synthetic dataset using curator and finetune a model using Together's finetuning API.
!pip install bespokelabs-curator datasets together
Imports
from bespokelabs import curator
from datasets import load_dataset
from together import Together
import os
import json
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
os.environ["TOGETHER_API_KEY"] = getpass.getpass("Enter your Together API key: ")
# We use curator viewer to visualize the data fast.
# You can comment it out if you don't want to use it.
os.environ['HOSTED_CURATOR_VIEWER']='1'
Dataset Curation
The data curation process is pretty simple. We will use a prompt to instruct the model to analyze the review and output the sentiment for each aspect.
Note that here we are not using structured outputs, since the same prompt/curator block will be used to evaluate the base model that we finetune below (Llama-3.2-3B-Instruct) which doesn't support structured outputs.
PROMPT ="""You are a sentiment analysis expert specializing in restaurant reviews. You need to analyze the sentiment of the given restaurant review.
Analyze the review for the following specific aspects:
1. Food: Quality, taste, presentation, menu variety, etc.
2. Service: Staff behavior, responsiveness, professionalism, etc.
3. Ambience: Atmosphere, decor, comfort, noise level, etc.
4. Price: Value for money, affordability, etc.
5. Overall: General impression of the restaurant experience
For each aspect, classify the sentiment as exactly one of the following:
- Positive: The review expresses satisfaction or praise
- Negative: The review expresses dissatisfaction or criticism
- Neutral: The review is balanced or doesn't mention the aspect
If an aspect is not mentioned in the review, classify it as Neutral.
Output the sentiment for each aspect in the following format:
```json
{{
"food_sentiment": "Positive",
"service_sentiment": "Negative",
"ambience_sentiment": "Neutral",
"price_sentiment": "Positive",
"overall_sentiment": "Negative"
}}```
"""
class AspectBasedSentimentCurator(curator.LLM):
def prompt(self, input: dict) -> str:
return [{"role": "system", "content": PROMPT},
{"role": "user", "content": f"The review is: {input['text']}"}]
def parse(self, input: dict, raw_response: str) -> dict:
response = raw_response.split("```json")[1].split("```")[0]
try:
response = json.loads(response)
except:
response = {}
return {
**input,
"food_sentiment": response.get("food_sentiment", "None"),
"service_sentiment": response.get("service_sentiment", "None"),
"ambience_sentiment": response.get("ambience_sentiment", "None"),
"price_sentiment": response.get("price_sentiment", "None"),
"overall_sentiment": response.get("overall_sentiment", "None")
}
We will run this curator on yelp restaurant reviews dataset to generate asepct based sentiment annotations for each review.
source_dataset = load_dataset("bespokelabs/yelp_restaurant_reviews", split="train")
# We can visualize data using Curator viewer easily.
from bespokelabs.curator.utils import push_to_viewer
url = push_to_viewer(source_dataset)
Above, we can see that the overall accuracy is 82.8% and the aspect accuracies are not very good.
Thus we will use the curated dataset to finetune a 3B parameter model. Below is the dataset if you wish to analyze further:
Formatting the dataset for finetuning
def _format_response(data_point):
return f"""
```json
{{
"food_sentiment": "{data_point['food_sentiment_gt']}",
"service_sentiment": "{data_point['service_sentiment_gt']}",
"ambience_sentiment": "{data_point['ambience_sentiment_gt']}",
"price_sentiment": "{data_point['price_sentiment_gt']}",
"overall_sentiment": "{data_point['overall_sentiment_gt']}"
}}
```
"""
finetuning_dataset = []
for data_point in train_dataset:
finetuning_dataset.append({
"messages": [
{"role": "system", "content": PROMPT},
{"role": "user", "content": f"The review is: {data_point['text']}"},
{"role": "assistant", "content": _format_response(data_point)}
],
})
# upload the dataset to together
# create a temporary file and upload it to together
with open("finetuning_dataset.jsonl", "w") as f:
for data_point in finetuning_dataset:
f.write(json.dumps(data_point) + "\n")
# upload the file to together
client = Together()
file = client.files.upload("finetuning_dataset.jsonl")
# Wait until job is completed
!together fine-tuning list-events ft-1e714d5e-71b7 # paste your job ID here
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
| | Message | Type | Created At | Hash |
+====+===================================================+=============================================+============================+========+
| 0 | Fine tune request created | FinetuneEventType.JOB_PENDING | 2025-03-22 00:27:51.988000 | |
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
| 1 | Job started at Sat Mar 22 00:28:31 UTC 2025 | FinetuneEventType.JOB_START | 2025-03-22 00:28:31 | |
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
| 2 | Model data downloaded for togethercomputer/Meta- | FinetuneEventType.MODEL_DOWNLOAD_COMPLETE | 2025-03-22 00:28:34 | |
| | Llama-3.2-3B-Instruct-Reference__TOG__FT at Sat | | | |
| | Mar 22 00:28:33 UTC 2025 | | | |
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
| 3 | Data downloaded for togethercomputer/Meta- | FinetuneEventType.TRAINING_DATA_DOWNLOADING | 2025-03-22 00:28:39 | |
| | Llama-3.2-3B-Instruct-Reference__TOG__FT at | | | |
| | $2025-03-22T00:28:39.643415 | | | |
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
| 4 | Training started for model togethercomputer/Meta- | FinetuneEventType.TRAINING_START | 2025-03-22 00:29:08 | |
| | Llama-3.2-3B-Instruct-Reference__TOG__FT | | | |
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
| 5 | Epoch completed, at step 6 | FinetuneEventType.EPOCH_COMPLETE | 2025-03-22 00:29:30 | |
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
| 6 | Epoch completed, at step 12 | FinetuneEventType.EPOCH_COMPLETE | 2025-03-22 00:29:48 | |
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
| 7 | Epoch completed, at step 18 | FinetuneEventType.EPOCH_COMPLETE | 2025-03-22 00:30:13 | |
+----+---------------------------------------------------+---------------------------------------------+----------------------------+--------+
# Run llama on the test dataset
llama_ft_output = annotated_dataset = AspectBasedSentimentCurator(
# Replace with the model id of the fine-tuned model
# You will get the model ID from here: https://api.together.xyz/models
"together_ai/username/Llama-3.2-3B-Instruct--aspect-based-sentiment-analysis-abcdef",
generation_params = {
"temperature": 0.0,
},
backend_params = {
"max_tokens_per_minute": 100000000,
}
)(test_dataset)
llama_ft_eval = evaluate_sentiment(llama_ft_output)
print(json.dumps(llama_ft_eval, indent=4))
Comparing Results
import pandas as pd
from IPython.display import display
base_model_results = llama_base_output
fine_tuned_results = llama_ft_eval
# Create a comparison table
comparison_data = {
"Metric": ["Overall Accuracy"] + [f"{k.replace('_', ' ').title()}" for k in base_model_results["aspect_accuracies"].keys()],
"Base Model": [base_model_results["overall_accuracy"]] + list(base_model_results["aspect_accuracies"].values()),
"Fine-tuned Model": [fine_tuned_results["overall_accuracy"]] + list(fine_tuned_results["aspect_accuracies"].values()),
}
# Create and display the DataFrame
comparison_df = pd.DataFrame(comparison_data)
display(comparison_df.style.format({
"Base Model": "{:.3f}",
"Fine-tuned Model": "{:.3f}",
}).set_caption("Model Performance Comparison"))
Metric
Base Model
Fine-tuned Model
0
Overall Accuracy
0.828
0.844
1
Food Sentiment
0.870
0.870
2
Service Sentiment
0.930
0.940
3
Ambience Sentiment
0.650
0.680
4
Price Sentiment
0.770
0.800
5
Overall Sentiment
0.920
0.930
Conclusion
We can see that the fine-tuned model has higher overall accuracy and also better aspect accuracies. Also, it is 16x cheaper than the base model! As next steps, we can rerun with a larger dataset and better hyperparameter settings, to match the performance of GPT-4o.