The leaderboard

About

With the rapid release of numerous large language models (LLMs) and chatbots, often accompanied by bold claims regarding their performance, it can be challenging to discern genuine progress from the open-source community and identify the current state-of-the-art models.

The Huggingface leaderboard addresses this need by providing a transparent and standardised evaluation of these models.

They use the Eleuther AI Language Model Evaluation Harness, a unified framework for testing generative language models on various tasks, to evaluate the models on six key benchmarks.

Detailed information about the evaluation tasks, results, and how to reproduce them is provided below.

Evaluation Tasks

The models are evaluated on the following six benchmarks:

  1. IFEval

    • Description: Tests the model’s ability to follow explicit instructions, focusing on formatting adherence.

    • Shots: 0-shot.

  2. Big Bench Hard (BBH)

    • Description: Evaluates models on 23 challenging tasks from the BigBench dataset.

    • Shots: 3-shot.

    • Subtasks include sports understanding, object tracking, logical deduction, and more.

  3. MATH Level 5

    • Description: Compiles high-school level competition problems requiring specific output formatting.

    • Shots: 4-shot.

  4. Graduate-Level Google-Proof Q&A Benchmark (GPQA)

    • Description: Contains challenging knowledge questions crafted by PhD-level experts in various fields.

    • Shots: 0-shot.

  5. Multistep Soft Reasoning (MuSR)

    • Description: Consists of complex problems requiring reasoning and long-range context parsing.

    • Subtasks: 0-shot evaluation on murder mysteries, object placement, and team allocation.

  6. Massive Multitask Language Understanding - Professional (MMLU-PRO)

    • Description: A refined version of the MMLU dataset, featuring more challenging and noise-reduced questions.

    • Shots: 5-shot.

Model Comparison

Here is a detailed comparison of different language models across various evaluation categories:

Model Name

Claude 3.5 Sonnet

Average

88.38%

Multi choice

88.70%

Reasoning

89.00%

Coding

92.00%

Future Capabilities

93.10%

Grade School Math

96.40%

Math Problems

71.10%

Model Name

Claude 3 Opus

Average

84.83%

Multi choice

86.80%

Reasoning

95.40%

Coding

84.90%

Future Capabilities

86.80%

Grade School Math

95.00%

Math Problems

60.10%

Model Name

Gemini 1.5 Pro

Average

80.08%

Multi choice

81.90%

Reasoning

92.50%

Coding

71.90%

Future Capabilities

84.00%

Grade School Math

91.70%

Math Problems

58.50%

Model Name

Gemini Ultra

Average

79.52%

Multi choice

83.70%

Reasoning

87.80%

Coding

74.40%

Future Capabilities

83.60%

Grade School Math

94.40%

Math Problems

53.20%

Model Name

GPT-4

Average

79.45%

Multi choice

86.40%

Reasoning

95.30%

Coding

67.00%

Future Capabilities

83.10%

Grade School Math

92.00%

Math Problems

52.90%

Model Name

Llama 3 Instruct - 70B

Average

79.23%

Multi choice

82.00%

Reasoning

87.00%

Coding

81.70%

Future Capabilities

81.30%

Grade School Math

93.00%

Math Problems

50.40%

Model Name

Claude 3 Sonnet

Average

76.55%

Multi choice

79.00%

Reasoning

89.00%

Coding

73.00%

Future Capabilities

82.90%

Grade School Math

92.30%

Math Problems

43.10%

Model Name

Claude 3 Haiku

Average

73.08%

Multi choice

75.20%

Reasoning

85.90%

Coding

75.90%

Future Capabilities

73.70%

Grade School Math

88.90%

Math Problems

38.90%

Model Name

Gemini Pro

Average

68.28%

Multi choice

71.80%

Reasoning

84.70%

Coding

67.70%

Future Capabilities

75.00%

Grade School Math

77.90%

Math Problems

32.60%

Model Name

GPT-3.5

Average

65.46%

Multi choice

70.00%

Reasoning

85.50%

Coding

48.10%

Future Capabilities

66.60%

Grade School Math

57.10%

Math Problems

34.10%

Model Name

Mixtral 8x7B

Average

59.79%

Multi choice

70.60%

Reasoning

84.40%

Coding

40.20%

Future Capabilities

60.76%

Grade School Math

74.40%

Math Problems

28.40%

Model Name

Llama 3 Instruct - 8B

Average

-

Multi choice

68.40%

Reasoning

-

Coding

62.00%

Future Capabilities

61.00%

Grade School Math

79.60%

Math Problems

30.00%

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023