Evaluation

With a finetuned model through either SFT or LoRA SFT, we should evaluate it on standard benchmarks. As machine learning engineers you should maintain a suite of relevant evaluations for your targeted domain of interest. In this page, we will look at some of the most common benchmarks and how to use them to evaluate your model. We’ll also look at how to create custom benchmarks for your specific use case.

Automatic Benchmarks

Automatic benchmarks serve as standardized tools for evaluating language models across different tasks and capabilities. While they provide a useful starting point for understanding model performance, it’s important to recognize that they represent only one piece of a comprehensive evaluation strategy.

Understanding Automatic Benchmarks

Automatic benchmarks typically consist of curated datasets with predefined tasks and evaluation metrics. These benchmarks aim to assess various aspects of model capability, from basic language understanding to complex reasoning. The key advantage of using automatic benchmarks is their standardization - they allow for consistent comparison across different models and provide reproducible results.

However, it’s crucial to understand that benchmark performance doesn’t always translate directly to real-world effectiveness. A model that excels at academic benchmarks may still struggle with specific domain applications or practical use cases.

General Knowledge Benchmarks

MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects, from science to humanities. While comprehensive, it may not reflect the depth of expertise needed for specific domains. TruthfulQA evaluates a model’s tendency to reproduce common misconceptions, though it can’t capture all forms of misinformation.

Reasoning Benchmarks

BBH (Big Bench Hard) and GSM8K focus on complex reasoning tasks. BBH tests logical thinking and planning, while GSM8K specifically targets mathematical problem-solving. These benchmarks help assess analytical capabilities but may not capture the nuanced reasoning required in real-world scenarios.

Language Understanding

HELM provides a holistic evaluation framework. Benchmarks like HELM offer insights into language processing capabilities on aspects like commonsense, world knowledge, and reasoning. But may not fully represent the complexity of natural conversation or domain-specific terminology.

Domain-Specific Benchmarks

Let’s look at a few benchmarks that focus on specific domains like math, coding, and chat.

The MATH benchmark is another important evaluation tool for mathematical reasoning. It consists of 12,500 problems from mathematics competitions, covering algebra, geometry, number theory, counting, probability, and more. What makes MATH particularly challenging is that it requires multi-step reasoning, formal mathematical notation understanding, and the ability to generate step-by-step solutions. Unlike simpler arithmetic tasks, MATH problems often demand sophisticated problem-solving strategies and mathematical concept applications.

The HumanEval Benchmark is a coding-focused evaluation dataset consisting of 164 programming problems. The benchmark tests a model’s ability to generate functionally correct Python code that solves the given programming tasks. What makes HumanEval particularly valuable is that it evaluates both code generation capabilities and functional correctness through actual test case execution, rather than just superficial similarity to reference solutions. The problems range from basic string manipulation to more complex algorithms and data structures.

Alpaca Eval is an automated evaluation framework designed to assess the quality of instruction-following language models. It uses GPT-4 as a judge to evaluate model outputs across various dimensions including helpfulness, honesty, and harmlessness. The framework includes a dataset of 805 carefully curated prompts and can evaluate responses against multiple reference models like Claude, GPT-4, and others. What makes Alpaca Eval particularly useful is its ability to provide consistent, scalable evaluations without requiring human annotators, while still capturing nuanced aspects of model performance that traditional metrics might miss.

Alternative Evaluation Approaches

Many organizations have developed alternative evaluation methods to address the limitations of standard benchmarks:

LLM-as-Judge

Using one language model to evaluate another’s outputs has become increasingly popular. This approach can provide more nuanced feedback than traditional metrics, though it comes with its own biases and limitations.

Evaluation Arenas

Evaluation arenas like Chatbot Arena offer a unique approach to LLM assessment through crowdsourced feedback. In these platforms, users engage in anonymous “battles” between two LLMs, asking questions and voting on which model provides better responses. This approach captures real-world usage patterns and preferences through diverse, challenging questions, with studies showing strong agreement between crowd-sourced votes and expert evaluations. While powerful, these platforms have limitations including potential user base bias, skewed prompt distributions, and a primary focus on helpfulness rather than safety considerations.

Custom Benchmark Suites

Organizations often develop internal benchmark suites tailored to their specific needs and use cases. These might include domain-specific knowledge tests or evaluation scenarios that mirror actual deployment conditions.

Custom Evaluation

While standard benchmarks provide a useful baseline, they shouldn’t be your only evaluation method. Here’s how to develop a more comprehensive approach:

Start with relevant standard benchmarks to establish a baseline and enable comparison with other models.
Identify the specific requirements and challenges of your use case. What tasks will your model actually perform? What kinds of errors would be most problematic?
Develop custom evaluation datasets that reflect your actual use case. This might include:
- Real user queries from your domain
- Common edge cases you’ve encountered
- Examples of particularly challenging scenarios
Consider implementing a multi-layered evaluation strategy:
- Automated metrics for quick feedback
- Human evaluation for nuanced understanding
- Domain expert review for specialized applications
- A/B testing in controlled environments

Implementing Custom Evaluations

In this section, we will implement evaluation for our finetuned model. We can use lighteval to evaluate our finetuned model on standard benchmarks, which contains a wide range of tasks built into the library. We just need to define the tasks we want to evaluate and the parameters for the evaluation.

LightEval tasks are defined using a specific format:

{suite}|{task}|{num_few_shot}|{auto_reduce}

Parameter	Description
`suite`	The benchmark suite (e.g., ‘mmlu’, ‘truthfulqa’)
`task`	Specific task within the suite (e.g., ‘abstract_algebra’)
`num_few_shot`	Number of examples to include in prompt (0 for zero-shot)
`auto_reduce`	Whether to automatically reduce few-shot examples if prompt is too long (0 or 1)

Example: "mmlu|abstract_algebra|0|0" evaluates on MMLU’s abstract algebra task with zero-shot inference.

Example Evaluation Pipeline

Let’s set up an evaluation pipeline for our finetuned model. We will evaluate the model on set of sub tasks that relate to the domain of medicine.

Here’s a complete example of evaluating on automatic benchmarks relevant to one specific domain using Lighteval with the VLLM backend:

lighteval accelerate \
    "pretrained=your-model-name" \
    "mmlu|anatomy|0|0" \
    "mmlu|high_school_biology|0|0" \
    "mmlu|high_school_chemistry|0|0" \
    "mmlu|professional_medicine|0|0" \
    --max_samples 40 \
    --batch_size 1 \
    --output_path "./results" \
    --save_generations true

Results are displayed in a tabular format showing:

|                  Task                  |Version|Metric|Value |   |Stderr|
|----------------------------------------|------:|------|-----:|---|-----:|
|all                                     |       |acc   |0.3333|±  |0.1169|
|leaderboard:mmlu:_average:5             |       |acc   |0.3400|±  |0.1121|
|leaderboard:mmlu:anatomy:5              |      0|acc   |0.4500|±  |0.1141|
|leaderboard:mmlu:high_school_biology:5  |      0|acc   |0.1500|±  |0.0819|

Lighteval also include a python API for more detailed evaluation tasks, which is useful for manipulating the results in a more flexible way. Check out the Lighteval documentation for more information.

✏️ Try it out! Evaluate your finetuned model on a specific task in lighteval.

LLM Course

Evaluation

Automatic Benchmarks

Understanding Automatic Benchmarks

General Knowledge Benchmarks

Reasoning Benchmarks

Language Understanding

Domain-Specific Benchmarks

Alternative Evaluation Approaches

LLM-as-Judge

Evaluation Arenas

Custom Benchmark Suites

Custom Evaluation

Implementing Custom Evaluations

Example Evaluation Pipeline

End-of-chapter quiz

1. What are the main advantages of using automatic benchmarks for model evaluation?

2. Which benchmark specifically tests knowledge across 57 different subjects?

3. What is LLM-as-Judge?

4. What should be included in a comprehensive evaluation strategy?

5. What is a limitation of automatic benchmarks?

6. What is the purpose of creating custom evaluation datasets?