eleutherai/lm-evaluation-harness

★ 12,550PythonAudience · researcherComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((repo))
    What It Does
      AI model benchmarking
      Standardized evaluation
      Reproducible results
    Benchmarks Included
      60 plus task suites
      Reading comprehension
      Math and coding tasks
    Model Support
      Hugging Face models
      vLLM fast inference
      OpenAI and Anthropic APIs
    Customization
      YAML task definitions
      Custom scoring methods
      Python API access

mindmap root((repo)) What It Does AI model benchmarking Standardized evaluation Reproducible results Benchmarks Included 60 plus task suites Reading comprehension Math and coding tasks Model Support Hugging Face models vLLM fast inference OpenAI and Anthropic APIs Customization YAML task definitions Custom scoring methods Python API access

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run standardized benchmarks on any Hugging Face model to measure its performance on reading comprehension, reasoning, and math tasks.

USE CASE 2

Compare two language models head-to-head on the same tasks using reproducible settings so results can be cited in research.

USE CASE 3

Define a custom evaluation task with a YAML file to test a model on your own domain-specific prompts and scoring logic.

USE CASE 4

Evaluate commercial models via API (OpenAI or Anthropic) against open academic benchmarks and compare them with open-source alternatives.

Tech stack

PythonvLLMHugging Face

Getting it running

Difficulty · moderate Time to first run · 30min

Model backends like vLLM or transformers must be installed separately, API keys needed for commercial model evaluation.

In plain English

The Language Model Evaluation Harness is a Python framework for testing AI language models against a wide range of standardized benchmarks. It is maintained by EleutherAI, a research group focused on open AI research. This is the same framework that powers the Hugging Face Open LLM Leaderboard, which many people use to compare the capabilities of publicly available AI models. The main purpose of the tool is to give researchers and developers a consistent, reproducible way to measure how well a language model performs on tasks like reading comprehension, common sense reasoning, math, coding, and many others. Over 60 standard academic benchmarks are included, covering hundreds of subtasks. Because everyone runs the same prompts in the same way, results from different teams or papers can be compared directly. The framework supports several ways to load and run models. You can evaluate models from the Hugging Face model library, run models locally using vLLM for faster inference, or call commercial API providers like OpenAI or Anthropic. Quantized models (compressed to use less memory) are also supported through additional optional packages. The base installation is lightweight, you install the model backend separately depending on which kind of model you want to test. Running an evaluation from the command line involves specifying the model, the tasks to run, and a batch size. Results are printed to the terminal and can be saved to a file. There is also a Python API for running evaluations programmatically inside scripts or notebooks. Custom tasks can be defined using YAML configuration files, which lets you specify prompts, answer extraction logic, and scoring methods without writing Python code. The project has a changelog showing regular additions including multimodal (text plus image) evaluation support, support for chain-of-thought reasoning traces, and a refactored command-line interface with subcommands. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

Use eleutherai/lm-evaluation-harness to benchmark a Hugging Face model on HellaSwag and ARC-Challenge using vLLM for faster inference, then save results to JSON.

Prompt 2

Create a custom YAML task file in lm-evaluation-harness to evaluate a model on a multiple-choice dataset I provide in CSV format.

Prompt 3

Run lm-evaluation-harness to compare a quantized 4-bit version and a full-precision version of the same model on MMLU and report the accuracy gap.

Prompt 4

Set up lm-evaluation-harness to call the Anthropic API and evaluate a Claude model on the GSM8K math benchmark, then print the results.

Open on GitHub → Explain another repo

← eleutherai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.