Some of the benchmarks (NewsQA) depend on data that's not public: all such data
will be stored in the restricted
directory. You need to make sure that
directory exists.
The helm-run
provides several flags that can be used to test that the configuration and scenario are working correctly without actually sending requests to the model
# Just load the config file
helm-run --conf src/helm/benchmark/presentation/run_specs_small.conf --max-eval-instances 10 --suite v1 --skip-instances
# Create the instances and the requests, but don't send requests to the model
helm-run --conf src/helm/benchmark/presentation/run_specs_small.conf --max-eval-instances 10 --suite v1 --dry-run
To estimate token usage without making any requests, append the --dry-run
option:
helm-run -r <RunSpec to estimate token usage> --suite $SUITE --max-eval-instances <Number of eval instances> --dry-run
and check the output in benchmark_output/runs/$SUITE
.
sum
indicates the estimated total number of tokens used for the specific RunSpec
.
For the OpenAI models, we use a GPT-2 Tokenizer to estimate the token usage. The tokenizer will be downloaded and cached when running a dry run.
We use Google's Perspective API to calculate the toxicity of completions.
To send requests to PerspectiveAPI, we need to generate an API key from GCP. Follow the
Get Started guide
to request the service and the Enable the API guide
to generate the API key. Once you have a valid API key, add an entry to credentials.conf
:
perspectiveApiKey: <Generated API key>
By default, Perspective API allows only 1 query per second. Fill out this form to increase the request quota.