TIGER-Lab/MMLU-Pro · Replicating results

Hi there - I'm trying to replicate the results you have for a handful of Open-source models with 10B params or less (e.g., Llama 3.2, Mistral-7B, others). My hardware setup is a single A100.

It currently takes ~45 minutes to run a full evaluation, 5-shot. I'm using vLLM, and mostly using your script with a couple of modifications.

A couple of questions:
(A) You report in your paper that it takes 20-30 minutes to run a 7B param model on a single A100. Any ideas on specific modifications or tricks to make the evals faster (apart from using vLLM - any modification to your script as it currently exists)? Quantization, I guess - anything else?
(B) When you run your own eval scripts on your leaderboard for larger models (e.g., Mistral-Large-Instruct, which is 123B params), what hardware setup do you use? In general, I'd love to understand the hardware structure you use for generating your leaderboard.

Thanks very much!

Hello,

Thank you for your interest in replicating our evaluation results. You raise some good points about the evaluation time.
Regarding your questions:

(A) The evaluation time for 7B parameter models can indeed vary based on several factors, particularly the length of generated text. While our paper reports 20-30 minutes on a single A100, the 45 minutes you're experiencing is still within a reasonable range. I'm curious about your motivation for accelerating the evaluation process further - is there a specific constraint you're working with? Generally, both 30 and 45 minutes are considered acceptable evaluation times for thorough assessments.

(B) For evaluating larger models like the 123B Mistral-Large-Instruct, we typically use a multi-GPU setup. Models of this size generally require 4 or 8 A100 or H100 GPUs with 80GB memory each. The exact configuration depends on the model architecture and memory requirements.

Best