The leaderboard

About

With the rapid release of numerous large language models (LLMs) and chatbots, often accompanied by bold claims regarding their performance, it can be challenging to discern genuine progress from the open-source community and identify the current state-of-the-art models.

The Huggingface leaderboard addresses this need by providing a transparent and standardised evaluation of these models.

They use the Eleuther AI Language Model Evaluation Harness, a unified framework for testing generative language models on various tasks, to evaluate the models on six key benchmarks.

Detailed information about the evaluation tasks, results, and how to reproduce them is provided below.

Evaluation Tasks

The models are evaluated on the following six benchmarks:

  1. IFEval

    • Description: Tests the model’s ability to follow explicit instructions, focusing on formatting adherence.

    • Shots: 0-shot.

  2. Big Bench Hard (BBH)

    • Description: Evaluates models on 23 challenging tasks from the BigBench dataset.

    • Shots: 3-shot.

    • Subtasks include sports understanding, object tracking, logical deduction, and more.

  3. MATH Level 5

    • Description: Compiles high-school level competition problems requiring specific output formatting.

    • Shots: 4-shot.

  4. Graduate-Level Google-Proof Q&A Benchmark (GPQA)

    • Description: Contains challenging knowledge questions crafted by PhD-level experts in various fields.

    • Shots: 0-shot.

  5. Multistep Soft Reasoning (MuSR)

    • Description: Consists of complex problems requiring reasoning and long-range context parsing.

    • Subtasks: 0-shot evaluation on murder mysteries, object placement, and team allocation.

  6. Massive Multitask Language Understanding - Professional (MMLU-PRO)

    • Description: A refined version of the MMLU dataset, featuring more challenging and noise-reduced questions.

    • Shots: 5-shot.

Model Comparison

Here is a detailed comparison of different language models across various evaluation categories:

Claude 3.5 Sonnet

88.38%

88.70%

89.00%

92.00%

93.10%

96.40%

71.10%

Claude 3 Opus

84.83%

86.80%

95.40%

84.90%

86.80%

95.00%

60.10%

Gemini 1.5 Pro

80.08%

81.90%

92.50%

71.90%

84.00%

91.70%

58.50%

Gemini Ultra

79.52%

83.70%

87.80%

74.40%

83.60%

94.40%

53.20%

GPT-4

79.45%

86.40%

95.30%

67.00%

83.10%

92.00%

52.90%

Llama 3 Instruct - 70B

79.23%

82.00%

87.00%

81.70%

81.30%

93.00%

50.40%

Claude 3 Sonnet

76.55%

79.00%

89.00%

73.00%

82.90%

92.30%

43.10%

Claude 3 Haiku

73.08%

75.20%

85.90%

75.90%

73.70%

88.90%

38.90%

Gemini Pro

68.28%

71.80%

84.70%

67.70%

75.00%

77.90%

32.60%

GPT-3.5

65.46%

70.00%

85.50%

48.10%

66.60%

57.10%

34.10%

Mixtral 8x7B

59.79%

70.60%

84.40%

40.20%

60.76%

74.40%

28.40%

Llama 3 Instruct - 8B

-

68.40%

-

62.00%

61.00%

79.60%

30.00%

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023