The leaderboard
About
With the rapid release of numerous large language models (LLMs) and chatbots, often accompanied by bold claims regarding their performance, it can be challenging to discern genuine progress from the open-source community and identify the current state-of-the-art models.
The Huggingface leaderboard addresses this need by providing a transparent and standardised evaluation of these models.
They use the Eleuther AI Language Model Evaluation Harness, a unified framework for testing generative language models on various tasks, to evaluate the models on six key benchmarks.
Detailed information about the evaluation tasks, results, and how to reproduce them is provided below.
Evaluation Tasks
The models are evaluated on the following six benchmarks:
IFEval
Description: Tests the model’s ability to follow explicit instructions, focusing on formatting adherence.
Shots: 0-shot.
Big Bench Hard (BBH)
Description: Evaluates models on 23 challenging tasks from the BigBench dataset.
Shots: 3-shot.
Subtasks include sports understanding, object tracking, logical deduction, and more.
MATH Level 5
Description: Compiles high-school level competition problems requiring specific output formatting.
Shots: 4-shot.
Graduate-Level Google-Proof Q&A Benchmark (GPQA)
Description: Contains challenging knowledge questions crafted by PhD-level experts in various fields.
Shots: 0-shot.
Multistep Soft Reasoning (MuSR)
Description: Consists of complex problems requiring reasoning and long-range context parsing.
Subtasks: 0-shot evaluation on murder mysteries, object placement, and team allocation.
Massive Multitask Language Understanding - Professional (MMLU-PRO)
Description: A refined version of the MMLU dataset, featuring more challenging and noise-reduced questions.
Shots: 5-shot.
Model Comparison
Here is a detailed comparison of different language models across various evaluation categories:
Last updated