Automatic recovery and caching
Curator automatically caches the output generated by the LLM
class. This is very useful for:
Recovering from failures and interruption: During large data generation runs, you can run into unexpected failures or interruption. Caching partially completed responses from a data generation run allows you to recover from the latest completed output instead of starting from scratch when you restart your run.
Caching previous completed runs: When working with multi-stage pipelines, you might want to reuse earlier stages in the pipeline while iterating on the later stages. Caching previously completed runs from the earlier stages allows you to iterate quickly while saving time & money.
To see caching in action, try running the Hello World example below twice. The second run should reuse the cached responses from the first run instead of making an LLM call.
Disable caching
To disable caching, you can simply set CURATOR_DISABLE_CACHE=1
before generating your data.
Custom cache directory
By default, all cached datasets are saved to ~/.cache/curator
, but you can change it in two ways:
Setting the CURATOR_CACHE_DIR environmental variable to the desired directory
Passing the desired directory to the
working_dir
parameter when applying theLLM
object on a dataset, e.g.llm("Write a poem about the importance of data in AI.", working_directory="/path/to/my/poems")
.
Cache Internals (subject to future changes)
The cache directory contains the following:
metadata.db: a SQLite database containing metadata about data generation runs
cache directories of individual data generation runs: each directory is named after the fingerprint of the data generation run.
The fingerprint of a data generation run is based on the following:
The input dataset on which the
LLM
object is being applied.The
prompt
function of theLLM
object.Whether or not the data generation is using batch mode
The response format of the
LLM
object.The model name defined in the
LLM
object.Generation parameters, e.g. temperature, top_k, etc.
Troubleshooting
Corrupt or full working directory
The cache directory can get too large or become corrupt due to unexpected errors. You can recover from these types of failures by deleting the cache directory: rm -rf ~/.cache/curator
. Note that this will delete *all* cached responses.
Last updated