Here's a birds-eye view of how the benchmarking process interacts with the main
classes (see benchmark
):
-
A
Scenario
(given by aScenarioSpec
) specifies a task and a data distribution. It specifies a set ofInstance
s, where eachInstance
has an input (e.g., question) and a set ofReference
outputs (e.g., multiple choice answers). -
A
DataPreprocessor
takes in aScenario
and produces a list ofInstance
s EachInstance
is given a unique ID. The set ofInstance
s is augmented according toDataAugmenterSpec
. -
An
Adapter
(given by anAdaptationSpec
) takes a list ofInstance
s and adapts it to a set ofRequest
s to the API (e.g., the model, temperature, number of in-context training examples). Formally, the output is aScenarioState
containing a set ofRequestState
s, where eachRequestState
consists of aRequest
and any metadata used to track the role of thisRequest
(e.g., the relevantInstance
andReference
). -
An
Executor
(given by anExecutionSpec
) executes eachRequest
in theRequestState
to produce aRequestResult
for each one; everything is encapsulated in aScenarioState
. -
A
Metric
(given by aMetricSpec
) takes aScenarioState
containingRequestResults
s and produces a set ofStat
s (e.g., accuracy, accuracy@5, toxicity, bias, etc.). -
A
Runner
is the top-level controller that runs the above steps and is driven by a set ofRunSpec
s.
There are three types of classes:
- Specifications (e.g.,
AdapterSpec
,ExecutionSpec
,RunSpec
): specified manually by the user. Note thatScenario
andMetric
are subclassed, so they are constructed byObjectSpec
, which specifies the subclass name and a free-form dictionary of arguments. - States (e.g.,
Instance
,ScenarioState
,Request
,RequestResult
): these are automatically generated and can be serialized. - Controllers (e.g.,
Scenario
,Adapter
,Executor
,Metric
,Runner
): these have the bulk of the code and should not be serialized.
In order to implement new scenarios:
- Create a new file as a new Python scenario file in the
scenarios
folder. - Within the scenario file, create a
Scenario
class, e.g.YourScenario
. YourScenario
should implementget_instances
, a method that downloads the dataset files if they don't already exist and returns a list ofInstance
s. EachInstance
must have a list of (potentially one)Reference
answers: a correct answer may be indicated with aCORRECT_TAG
in aReference
instance'stags
argument. In addition, you must specify thesplit
of theInstance
as one ofTRAIN_SPLIT
,VALID_SPLIT
, orTEST_SPLIT
constants as inscenario.py
.- For
Scenario
s with datasets that cannot be publicly shared, place a copy of the dataset at pathrestricted/<Name of the Scenario>
and read from that path. SeeNewsQAScenario
andICEScenario
for some examples.
- For
- Note that you need not enumerate every possible correct answer (nor must there even necessarily be a correct answer).
- Make sure to document your scenario well with a clear docstring.
- In addition, specify its
name
,description
, andtags
. - Define a function
get_specname_spec
inrun_specs.py
to retrieve aScenarioSpec
for your scenario using a class name corresponding to the Python path of the class (e.g.helm.benchmark.scenarios.your_scenario.YourScenario
) and any arguments which must be passed as a dictionary ofargs
. - Have the
get_specname_spec
function retrieve anAdapterSpec
for your scenario specifying the type of language model generation which must be performed for the task. - Identify the appropriate metric for your task in one of the
*_metrics.py
files. If the metric you'd like to use does not exist, follow the directions in Adding new metrics. Many will be inbasic_metrics.py
. - Have a
get_metric_spec
function retrieve one or moreMetricSpec
objects for your task, specifying the classname with the Python path of the object, with the same arguments as theScenarioSpec
constructor. - Have the
get_specname_spec
function return aRunSpec
object, with aname
corresponding to the scenario name and any patterns to match in curly braces, ascenario_spec
, anadapter_spec
,metric_specs
, andgroups
. - Attempt to run your task with
venv/bin/helm-run -r yourscenarioname:arg=value
whereyourscenarioname
matches thename
specified in YourScenario - Add the spec to dictionary
CANONICAL_RUN_SPEC_FUNCS
insrc/helm/benchmark/run_specs.py
. - Update
src/helm/proxy/static/contamination.yaml
with models that we trained on your scenario (i.e. contaminated). - Add a schema to
src/helm/benchmark/static/schema.yaml
and add the scenario tosubgroups
as needed.
To add a new metric:
- If the metric is task-specific, create a new
yourtask_metrics.py
file. Otherwise, if the metric is generic and likely to be widely used, add it tobasic_metrics.py
. - If you are creating a task-specific metric, create a
YourTaskMetric
which inherits fromMetric
inmetric.py
. - Define methods
__init__
andevaluate_generation
returning a list ofStat
objects. - Each
Stat
should correspond to a distinct aggregate measurement over the generated examples. Some may have one metric (e.g. accuracy), while others may quantify multiple aspects (e.g. multiple distance metrics). - For each
value
generated for aStat
, add it toyourstat
usingyourstat.add(value)
. Usually, there will only be one value for eachStat
, but multiple can be used, e.g. to show variance.
To apply data augmentation, create a DataAugmenterSpec
with a list of
PerturbationSpec
s and pass it into RunSpec
. The following is an
example:
data_augmenter_spec = DataAugmenterSpec(
perturbation_specs=[
PerturbationSpec(
class_name="helm.benchmark.augmentations.perturbation.ExtraSpacePerturbation",
args={"num_spaces": 5},
)
],
should_perturb_references=False,
should_augment_train_instances=False,
should_include_original_train=False,
should_augment_eval_instances=True,
should_include_original_eval=True,
)
run_spec = RunSpec(
...
data_augmenter_spec=data_augmenter_spec
)
In the example above, the DataPreprocessor
will augment the set of evaluation instances by perturbing
the original set of instances with the ExtraSpacePerturbation
, where spaces in the text are
replaced with num_spaces
number of spaces.
We currently only support applying a single perturbation to an instance instead of chaining multiple perturbations and applying it onto a single instance.
- To add a new perturbation to the framework, create a new file at
src/helm/benchmark/augmentations
with the name<Name of perturbation>_perturbation.py
e.g.,typo_perturbation.py
. Inside the file, create a new class (name it<Name of the perturbation>Perturbation
e.g.,TypoPerturbation
) that extends the abstract classPerturbation
and implement theperturb
method which takes in text and outputs the perturbed text. - Add a test for the new perturbation in
test_perturbation.py
.
- Give the tokenizer a name. Use the same name that's used in Hugging Face (e.g., "EleutherAI/gpt-j-6B").
- In
HuggingFaceTokenizers
, we load and cache tokenizers in memory. Add logic to handle the tokenizer in theload_tokenizer
method. - Add a test in
test_huggingface_tokenizer.py
to make sure we can load the tokenizer from Hugging Face. - Add a new class
<Name of tokenizer>WindowService
in file<Name of tokenizer>_window_service.py
. Follow what we did forGPTJWindowService
. - Import the new
WindowService
and map the model(s) to it inWindowServiceFactory
.