Evaluate documentation
Evaluator
Evaluator
The evaluator classes for automatic evaluation.
Evaluator classes
The main entry point for using the evaluator:
evaluate.evaluator
< source >( task: str = None ) → Evaluator
Parameters
-
task (
str
) — The task defining which evaluator will be returned. Currently accepted tasks are:"image-classification"
: will return a ImageClassificationEvaluator."question-answering"
: will return a QuestionAnsweringEvaluator."text-classification"
(alias"sentiment-analysis"
available): will return a TextClassificationEvaluator."token-classification"
: will return a TokenClassificationEvaluator.
Returns
An evaluator suitable for the task.
Utility factory method to build an Evaluator.
Evaluators encapsulate a task and a default metric name. They leverage pipeline
functionalify from transformers
to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.
The base class for all evaluator classes:
The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.
compute_metric
< source >( metric: EvaluationModule metric_inputs: typing.Dict strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None )
Compute and return metrics.
A core method of the Evaluator
class, which processes the pipeline outputs for compatibility with the metric.
prepare_data
< source >(
data: typing.Union[str, datasets.arrow_dataset.Dataset]
input_column: str
label_column: str
)
→
dict
Parameters
-
data (
str
orDataset
, defaults toNone) -- Specifies the dataset we will run evaluation on. If it is of type
str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
input_column (
str
, defaults to"text"
) — the name of the column containing the text feature in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
.
Returns
dict
metric inputs.
list
: pipeline inputs.
Prepare data.
prepare_metric
< source >( metric: typing.Union[str, evaluate.module.EvaluationModule] )
Prepare metric.
prepare_pipeline
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None feature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None device: int = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, — -
defaults to
None
) — If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
preprocessor (
PreTrainedTokenizerBase
orFeatureExtractionMixin
, optional, defaults toNone
) — Argument can be used to overwrite a default preprocessor ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument.
Prepare pipeline.
The task specific evaluators
ImageClassificationEvaluator
class evaluate.ImageClassificationEvaluator
< source >( task = 'image-classification' default_metric_name = None )
Image classification evaluator.
This image classification evaluator can currently be loaded from evaluator() using the default task name
image-classification
.
Methods in this class assume a data format compatible with the ImageClassificationPipeline
.
compute
< source >( input_column: str = 'image' *args **kwargs )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this caseimage-classification
. If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
feature_extractor (
str
orFeatureExtractionMixin
, optional, defaults toNone
) — Argument can be used to overwrite a default feature extractor ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. -
input_column (
str
, defaults to"image"
) — the name of the column containing the images as PIL ImageFile in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. -
label_mapping (
Dict[str, Number]
, optional, defaults toNone
) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in thelabel_column
of thedata
dataset.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="nateraw/vit-base-beans",
>>> data=data,
>>> label_column="labels",
>>> metric="accuracy",
>>> label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>> strategy="bootstrap"
>>> )
QuestionAnsweringEvaluator
class evaluate.QuestionAnsweringEvaluator
< source >( task = 'question-answering' default_metric_name = None )
Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.
This question answering evaluator can currently be loaded from evaluator() using the default task name
question-answering
.
Methods in this class assume a data format compatible with the
QuestionAnsweringPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None question_column: str = 'question' context_column: str = 'context' id_column: str = 'id' label_column: str = 'answers' squad_v2_format: typing.Optional[bool] = None )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casequestion-answering
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. -
question_column (
str
, defaults to"question"
) — the name of the column containing the question in the dataset specified bydata
. -
context_column (
str
, defaults to"context"
) — the name of the column containing the context in the dataset specified bydata
. -
id_column (
str
, defaults to"id"
) — the name of the column cointaing the identification field of the question and answer pair in the dataset specified bydata
. -
label_column (
str
, defaults to"answers"
) — the name of the column containing the answers in the dataset specified bydata
. -
squad_v2_format (
bool
, optional, defaults toNone
) — whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>> data=data,
>>> metric="squad",
>>> )
Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True
to
the compute() call.
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>> data=data,
>>> metric="squad_v2",
>>> squad_v2_format=True,
>>> )
TextClassificationEvaluator
class evaluate.TextClassificationEvaluator
< source >( task = 'text-classification' default_metric_name = None )
Text classification evaluator.
This text classification evaluator can currently be loaded from evaluator() using the default task name
text-classification
or with a "sentiment-analysis"
alias.
Methods in this class assume a data format compatible with the TextClassificationPipeline
- a single textual
feature as input and a categorical label as output.
compute
< source >( *args **kwargs )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classification
or its alias -sentiment-analysis
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. -
input_column (
str
, defaults to"text"
) — the name of the column containing the text feature in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. -
label_mapping (
Dict[str, Number]
, optional, defaults toNone
) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in thelabel_column
of thedata
dataset.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>> data=data,
>>> metric="accuracy",
>>> label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>> strategy="bootstrap",
>>> n_resamples=10,
>>> random_state=0
>>> )
TokenClassificationEvaluator
class evaluate.TokenClassificationEvaluator
< source >( task = 'token-classification' default_metric_name = None )
Token classification evaluator.
This token classification evaluator can currently be loaded from evaluator() using the default task name
token-classification
.
Methods in this class assume a data format compatible with the TokenClassificationPipeline
.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, ForwardRef('EvaluationModule')] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: typing.Optional[int] = None random_state: typing.Optional[int] = None input_column: str = 'tokens' label_column: str = 'ner_tags' join_by: typing.Optional[str] = ' ' )
Parameters
-
model_or_pipeline (
str
orPipeline
orCallable
orPreTrainedModel
orTFPreTrainedModel
, defaults toNone
) — If the argument in not specified, we initialize the default pipeline for the task (in this casetoken-classification
). If the argument is of the typestr
or is a model instance, we use it to initialize a newPipeline
with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline. -
data (
str
orDataset
, defaults toNone
) — Specifies the dataset we will run evaluation on. If it is of typestr
, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. -
metric (
str
orEvaluationModule
, defaults toNone
) — Specifies the metric we use in evaluator. If it is of typestr
, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric. -
tokenizer (
str
orPreTrainedTokenizer
, optional, defaults toNone
) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipeline
represents a model for which we build a pipeline. Ifmodel_or_pipeline
isNone
or a pre-initialized pipeline, we ignore this argument. -
strategy (
Literal["simple", "bootstrap"]
, defaults to “simple”) — specifies the evaluation strategy. Possible values are:"simple"
- we evaluate the metric and return the scores."bootstrap"
- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, usingscipy
’sbootstrap
method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
-
confidence_level (
float
, defaults to0.95
) — Theconfidence_level
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
n_resamples (
int
, defaults to9999
) — Then_resamples
value passed tobootstrap
if"bootstrap"
strategy is chosen. -
device (
int
, defaults toNone
) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone
is provided it will be inferred and CUDA:0 used if available, CPU otherwise. -
random_state (
int
, optional, defaults toNone
) — Therandom_state
value passed tobootstrap
if"bootstrap"
strategy is chosen. Useful for debugging. -
input_column (
str
, defaults to"tokens"
) — the name of the column containing the tokens feature in the dataset specified bydata
. -
label_column (
str
, defaults to"label"
) — the name of the column containing the labels in the dataset specified bydata
. -
join_by (
str
, optional, defaults to" "
) — This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.
Compute the metric for a given pipeline and dataset combination.
The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>> model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>> data=data,
>>> metric="seqeval",
>>> )
For example, the following dataset format is accepted by the evaluator:
dataset = Dataset.from_dict(
mapping={
"tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
"ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
},
features=Features({
"tokens": Sequence(feature=Value(dtype="string")),
"ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
}),
)
For example, the following dataset format is not accepted by the evaluator:
dataset = Dataset.from_dict(
mapping={
"tokens": [["New York is a city and Felix a person."]],
"starts": [[0, 23]],
"ends": [[7, 27]],
"ner_tags": [["LOC", "PER"]],
},
features=Features({
"tokens": Value(dtype="string"),
"starts": Sequence(feature=Value(dtype="int32")),
"ends": Sequence(feature=Value(dtype="int32")),
"ner_tags": Sequence(feature=Value(dtype="string")),
}),
)