Evaluator

The evaluator classes for automatic evaluation.

Evaluator classes

The main entry point for using the evaluator:

evaluate.evaluator

< source >

( task: str = None ) → Evaluator

Parameters

task (str) — The task defining which evaluator will be returned. Currently accepted tasks are:
- "image-classification": will return a ImageClassificationEvaluator.
- "question-answering": will return a QuestionAnsweringEvaluator.
- "text-classification" (alias "sentiment-analysis" available): will return a TextClassificationEvaluator.
- "token-classification": will return a TokenClassificationEvaluator.

Returns

Evaluator

An evaluator suitable for the task.

Utility factory method to build an Evaluator. Evaluators encapsulate a task and a default metric name. They leverage pipeline functionalify from transformers to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.

Examples:

>>> from evaluate import evaluator
>>> # Sentiment analysis evaluator
>>> evaluator("sentiment-analysis")

The base class for all evaluator classes:

class evaluate.Evaluator

< source >

( task: str default_metric_name: str = None )

The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.

compute_metric

< source >

( metric: EvaluationModule metric_inputs: typing.Dict strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None )

Compute and return metrics.

predictions_processor

< source >

( *args **kwargs )

A core method of the Evaluator class, which processes the pipeline outputs for compatibility with the metric.

prepare_data

< source >

( data: typing.Union[str, datasets.arrow_dataset.Dataset] input_column: str label_column: str ) → dict

Parameters

data (str or Dataset, defaults to None) -- Specifies the dataset we will run evaluation on. If it is of type str`, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
input_column (str, defaults to "text") — the name of the column containing the text feature in the dataset specified by data.
label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.

Returns

dict

metric inputs. list: pipeline inputs.

Prepare data.

prepare_metric

< source >

( metric: typing.Union[str, evaluate.module.EvaluationModule] )

Parameters

metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.

Prepare metric.

prepare_pipeline

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None feature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None device: int = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, —
defaults to None) — If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
preprocessor (PreTrainedTokenizerBase or FeatureExtractionMixin, optional, defaults to None) — Argument can be used to overwrite a default preprocessor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.

Prepare pipeline.

The task specific evaluators

ImageClassificationEvaluator

class evaluate.ImageClassificationEvaluator

< source >

( task = 'image-classification' default_metric_name = None )

Image classification evaluator. This image classification evaluator can currently be loaded from evaluator() using the default task name image-classification. Methods in this class assume a data format compatible with the ImageClassificationPipeline.

compute

< source >

( input_column: str = 'image' *args **kwargs )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case image-classification. If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
feature_extractor (str or FeatureExtractionMixin, optional, defaults to None) — Argument can be used to overwrite a default feature extractor if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "image") — the name of the column containing the images as PIL ImageFile in the dataset specified by data.
label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
label_mapping (Dict[str, Number], optional, defaults to None) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="nateraw/vit-base-beans",
>>>     data=data,
>>>     label_column="labels",
>>>     metric="accuracy",
>>>     label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>>     strategy="bootstrap"
>>> )

QuestionAnsweringEvaluator

class evaluate.QuestionAnsweringEvaluator

< source >

( task = 'question-answering' default_metric_name = None )

Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.

This question answering evaluator can currently be loaded from evaluator() using the default task name question-answering.

Methods in this class assume a data format compatible with the QuestionAnsweringPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None question_column: str = 'question' context_column: str = 'context' id_column: str = 'id' label_column: str = 'answers' squad_v2_format: typing.Optional[bool] = None )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case question-answering). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device ID. IfNone is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
question_column (str, defaults to "question") — the name of the column containing the question in the dataset specified by data.
context_column (str, defaults to "context") — the name of the column containing the context in the dataset specified by data.
id_column (str, defaults to "id") — the name of the column cointaing the identification field of the question and answer pair in the dataset specified by data.
label_column (str, defaults to "answers") — the name of the column containing the answers in the dataset specified by data.
squad_v2_format (bool, optional, defaults to None) — whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>>     data=data,
>>>     metric="squad",
>>> )

Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True to the compute() call.

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>>     data=data,
>>>     metric="squad_v2",
>>>     squad_v2_format=True,
>>> )

TextClassificationEvaluator

class evaluate.TextClassificationEvaluator

< source >

( task = 'text-classification' default_metric_name = None )

Text classification evaluator. This text classification evaluator can currently be loaded from evaluator() using the default task name text-classification or with a "sentiment-analysis" alias. Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual feature as input and a categorical label as output.

compute

< source >

( *args **kwargs )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case text-classification or its alias - sentiment-analysis). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "text") — the name of the column containing the text feature in the dataset specified by data.
label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
label_mapping (Dict[str, Number], optional, defaults to None) — We want to map class labels defined by the model in the pipeline to values consistent with those defined in the label_column of the data dataset.

Compute the metric for a given pipeline and dataset combination.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>>     data=data,
>>>     metric="accuracy",
>>>     label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>>     strategy="bootstrap",
>>>     n_resamples=10,
>>>     random_state=0
>>> )

TokenClassificationEvaluator

class evaluate.TokenClassificationEvaluator

< source >

( task = 'token-classification' default_metric_name = None )

Token classification evaluator.

This token classification evaluator can currently be loaded from evaluator() using the default task name token-classification.

Methods in this class assume a data format compatible with the TokenClassificationPipeline.

compute

< source >

( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None metric: typing.Union[str, ForwardRef('EvaluationModule')] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: typing.Optional[int] = None random_state: typing.Optional[int] = None input_column: str = 'tokens' label_column: str = 'ner_tags' join_by: typing.Optional[str] = ' ' )

Parameters

model_or_pipeline (str or Pipeline or Callable or PreTrainedModel or TFPreTrainedModel, defaults to None) — If the argument in not specified, we initialize the default pipeline for the task (in this case token-classification). If the argument is of the type str or is a model instance, we use it to initialize a new Pipeline with the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
data (str or Dataset, defaults to None) — Specifies the dataset we will run evaluation on. If it is of type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
metric (str or EvaluationModule, defaults to None) — Specifies the metric we use in evaluator. If it is of type str, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
tokenizer (str or PreTrainedTokenizer, optional, defaults to None) — Argument can be used to overwrite a default tokenizer if model_or_pipeline represents a model for which we build a pipeline. If model_or_pipeline is None or a pre-initialized pipeline, we ignore this argument.
strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:
- "simple" - we evaluate the metric and return the scores.
- "bootstrap" - on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using scipy’s bootstrap method https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
confidence_level (float, defaults to 0.95) — The confidence_level value passed to bootstrap if "bootstrap" strategy is chosen.
n_resamples (int, defaults to 9999) — The n_resamples value passed to bootstrap if "bootstrap" strategy is chosen.
device (int, defaults to None) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNone is provided it will be inferred and CUDA:0 used if available, CPU otherwise.
random_state (int, optional, defaults to None) — The random_state value passed to bootstrap if "bootstrap" strategy is chosen. Useful for debugging.
input_column (str, defaults to "tokens") — the name of the column containing the tokens feature in the dataset specified by data.
label_column (str, defaults to "label") — the name of the column containing the labels in the dataset specified by data.
join_by (str, optional, defaults to " ") — This evaluator supports dataset whose input column is a list of words. This parameter specifies how to join words to generate a string input. This is especially useful for languages that do not separate words by a space.

Compute the metric for a given pipeline and dataset combination.

The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.

Examples:

>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>>     data=data,
>>>     metric="seqeval",
>>> )

For example, the following dataset format is accepted by the evaluator:

dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
        "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
    },
    features=Features({
        "tokens": Sequence(feature=Value(dtype="string")),
        "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
        }),
)

For example, the following dataset format is not accepted by the evaluator:

dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New York is a city and Felix a person."]],
        "starts": [[0, 23]],
        "ends": [[7, 27]],
        "ner_tags": [["LOC", "PER"]],
    },
    features=Features({
        "tokens": Value(dtype="string"),
        "starts": Sequence(feature=Value(dtype="int32")),
        "ends": Sequence(feature=Value(dtype="int32")),
        "ner_tags": Sequence(feature=Value(dtype="string")),
    }),
)

Evaluate

Evaluator

Evaluator classes

evaluate.evaluator

class evaluate.Evaluator

compute_metric

predictions_processor

prepare_data

prepare_metric

prepare_pipeline

The task specific evaluators

ImageClassificationEvaluator

class evaluate.ImageClassificationEvaluator

compute

QuestionAnsweringEvaluator

class evaluate.QuestionAnsweringEvaluator

compute

TextClassificationEvaluator

class evaluate.TextClassificationEvaluator

compute

TokenClassificationEvaluator

class evaluate.TokenClassificationEvaluator

compute