# The leaderboard

### About

With the rapid release of numerous large language models (LLMs) and chatbots, often accompanied by bold claims regarding their performance, it can be challenging to discern genuine progress from the open-source community and identify the current state-of-the-art models.&#x20;

The Huggingface leaderboard addresses this need by providing a transparent and standardised evaluation of these models.

{% embed url="<https://huggingface.co/open-llm-leaderboard>" %}

They use the Eleuther AI Language Model Evaluation Harness, a unified framework for testing generative language models on various tasks, to evaluate the models on six key benchmarks.&#x20;

Detailed information about the evaluation tasks, results, and how to reproduce them is provided below.

### <mark style="color:purple;">Evaluation Tasks</mark>

The models are evaluated on the following six benchmarks:

1. <mark style="color:blue;">**IFEval**</mark>
   * **Description**: Tests the model’s ability to <mark style="color:yellow;">follow explicit instructions</mark>, focusing on formatting adherence.
   * **Shots**: 0-shot.
2. <mark style="color:blue;">**Big Bench Hard (BBH)**</mark>
   * **Description**: Evaluates models on 23 challenging tasks from the BigBench dataset.
   * **Shots**: 3-shot.
   * **Subtasks** include sports understanding, object tracking, logical deduction, and more.
3. <mark style="color:blue;">**MATH Level 5**</mark>
   * **Description**: Compiles high-school level competition problems requiring specific output formatting.
   * **Shots**: 4-shot.
4. <mark style="color:blue;">**Graduate-Level Google-Proof Q\&A Benchmark (GPQA)**</mark>
   * **Description**: Contains challenging knowledge questions crafted by PhD-level experts in various fields.
   * **Shots**: 0-shot.
5. <mark style="color:blue;">**Multistep Soft Reasoning (MuSR)**</mark>
   * **Description**: Consists of complex problems requiring reasoning and long-range context parsing.
   * **Subtasks**: 0-shot evaluation on murder mysteries, object placement, and team allocation.
6. <mark style="color:blue;">**Massive Multitask Language Understanding - Professional (MMLU-PRO)**</mark>
   * **Description**: A refined version of the MMLU dataset, featuring more challenging and noise-reduced questions.
   * **Shots**: 5-shot.

### <mark style="color:purple;">Model Comparison</mark>

Here is a detailed comparison of different language models across various evaluation categories:

<table data-view="cards"><thead><tr><th>Model Name</th><th>Average</th><th>Multi choice</th><th>Reasoning</th><th>Coding</th><th>Future Capabilities</th><th>Grade School Math</th><th>Math Problems</th></tr></thead><tbody><tr><td><strong>Claude 3.5 Sonnet</strong></td><td>88.38%</td><td>88.70%</td><td>89.00%</td><td>92.00%</td><td>93.10%</td><td>96.40%</td><td>71.10%</td></tr><tr><td><strong>Claude 3 Opus</strong></td><td>84.83%</td><td>86.80%</td><td>95.40%</td><td>84.90%</td><td>86.80%</td><td>95.00%</td><td>60.10%</td></tr><tr><td><strong>Gemini 1.5 Pro</strong></td><td>80.08%</td><td>81.90%</td><td>92.50%</td><td>71.90%</td><td>84.00%</td><td>91.70%</td><td>58.50%</td></tr><tr><td><strong>Gemini Ultra</strong></td><td>79.52%</td><td>83.70%</td><td>87.80%</td><td>74.40%</td><td>83.60%</td><td>94.40%</td><td>53.20%</td></tr><tr><td><strong>GPT-4</strong></td><td>79.45%</td><td>86.40%</td><td>95.30%</td><td>67.00%</td><td>83.10%</td><td>92.00%</td><td>52.90%</td></tr><tr><td><strong>Llama 3 Instruct - 70B</strong></td><td>79.23%</td><td>82.00%</td><td>87.00%</td><td>81.70%</td><td>81.30%</td><td>93.00%</td><td>50.40%</td></tr><tr><td><strong>Claude 3 Sonnet</strong></td><td>76.55%</td><td>79.00%</td><td>89.00%</td><td>73.00%</td><td>82.90%</td><td>92.30%</td><td>43.10%</td></tr><tr><td><strong>Claude 3 Haiku</strong></td><td>73.08%</td><td>75.20%</td><td>85.90%</td><td>75.90%</td><td>73.70%</td><td>88.90%</td><td>38.90%</td></tr><tr><td><strong>Gemini Pro</strong></td><td>68.28%</td><td>71.80%</td><td>84.70%</td><td>67.70%</td><td>75.00%</td><td>77.90%</td><td>32.60%</td></tr><tr><td><strong>GPT-3.5</strong></td><td>65.46%</td><td>70.00%</td><td>85.50%</td><td>48.10%</td><td>66.60%</td><td>57.10%</td><td>34.10%</td></tr><tr><td><strong>Mixtral 8x7B</strong></td><td>59.79%</td><td>70.60%</td><td>84.40%</td><td>40.20%</td><td>60.76%</td><td>74.40%</td><td>28.40%</td></tr><tr><td><strong>Llama 3 Instruct - 8B</strong></td><td>-</td><td>68.40%</td><td>-</td><td>62.00%</td><td>61.00%</td><td>79.60%</td><td>30.00%</td></tr></tbody></table>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/models/foundation-models/the-leaderboard.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
