Automatic recovery and caching

Curator automatically caches the output generated by the LLM class. This is very useful for:

Recovering from failures and interruption: During large data generation runs, you can run into unexpected failures or interruption. Caching partially completed responses from a data generation run allows you to recover from the latest completed output instead of starting from scratch when you restart your run.
Caching previous completed runs: When working with multi-stage pipelines, you might want to reuse earlier stages in the pipeline while iterating on the later stages. Caching previously completed runs from the earlier stages allows you to iterate quickly while saving time & money.

To see caching in action, try running the Hello World example below twice. The second run should reuse the cached responses from the first run instead of making an LLM call.

from bespokelabs import curator
llm = curator.LLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem.dataset.to_pandas())

Disable caching

To disable caching, you can simply set CURATOR_DISABLE_CACHE=1 before generating your data.

Custom cache directory

By default, all cached datasets are saved to ~/.cache/curator, but you can change it in two ways:

Setting the CURATOR_CACHE_DIR environmental variable to the desired directory
Passing the desired directory to the working_dir parameter when applying the LLM object on a dataset, e.g. llm("Write a poem about the importance of data in AI.", working_dir="/path/to/my/poems").

Cache Internals (subject to future changes)

The cache directory contains the following:

metadata.db: a SQLite database containing metadata about data generation runs
cache directories of individual data generation runs: each directory is named after the fingerprint of the data generation run.

>> ls ~/.cache/curator 
032bc5ead2892f8f        6d1f31229726231d
137851647e75f9a7        91f4ad23d5821c9f
24b1d8917f7ef6f1        a2a3c8e5a58e3fc3
metadata.db

The fingerprint of a data generation run is based on the following:

The input dataset on which the LLM object is being applied.
The prompt function of the LLM object.
Whether or not the data generation is using batch mode
The response format of the LLM object.
The model name defined in the LLM object.
Generation parameters, e.g. temperature, top_k, etc.

Troubleshooting

Corrupt or full working directory

The cache directory can get too large or become corrupt due to unexpected errors. You can recover from these types of failures by deleting the cache directory: rm -rf ~/.cache/curator. Note that this will delete *all* cached responses.

PreviousVisualize your dataset with the Bespoke Curator Viewer NextStructured Output

Last updated 2 months ago