Automatic recovery and caching
Curator automatically caches the output generated by the LLM
class. This is very useful for:
Recovering from failures and interruption: During large data generation runs, you can run into unexpected failures or interruption. Caching partially completed responses from a data generation run allows you to recover from the latest completed output instead of starting from scratch when you restart your run.
Caching previous completed runs: When working with multi-stage pipelines, you might want to reuse earlier stages in the pipeline while iterating on the later stages. Caching previously completed runs from the earlier stages allows you to iterate quickly while saving time & money.
To see caching in action, try running the Hello World example below twice. The second run should reuse the cached responses from the first run instead of making an LLM call.
from bespokelabs import curator
llm = curator.LLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem.dataset.to_pandas())
Disable caching
To disable caching, you can simply set CURATOR_DISABLE_CACHE=1
before generating your data.
Custom cache directory
By default, all cached datasets are saved to ~/.cache/curator
, but you can change it in two ways:
Setting the CURATOR_CACHE_DIR environmental variable to the desired directory
Passing the desired directory to the
working_dir
parameter when applying theLLM
object on a dataset, e.g.llm("Write a poem about the importance of data in AI.", working_dir="/path/to/my/poems")
.
Cache Internals (subject to future changes)
The cache directory contains the following:
metadata.db: a SQLite database containing metadata about data generation runs
cache directories of individual data generation runs: each directory is named after the fingerprint of the data generation run.
>> ls ~/.cache/curator
032bc5ead2892f8f 6d1f31229726231d
137851647e75f9a7 91f4ad23d5821c9f
24b1d8917f7ef6f1 a2a3c8e5a58e3fc3
metadata.db
The fingerprint of a data generation run is based on the following:
The input dataset on which the
LLM
object is being applied.The
prompt
function of theLLM
object.Whether or not the data generation is using batch mode
The response format of the
LLM
object.The model name defined in the
LLM
object.Generation parameters, e.g. temperature, top_k, etc.
Troubleshooting
Corrupt or full working directory
The cache directory can get too large or become corrupt due to unexpected errors. You can recover from these types of failures by deleting the cache directory: rm -rf ~/.cache/curator
. Note that this will delete *all* cached responses.
Last updated