open-compass/opencompass

★ 6,992PythonAudience · researcherComplexity · 4/5Setup · hard

Mindmap

mindmap
  root((OpenCompass))
    What it does
      Evaluate LLMs
      100 plus benchmarks
      Public leaderboard
    Benchmark Areas
      General knowledge
      Math reasoning
      Coding ability
      Long documents
    Model Backends
      HuggingFace
      vLLM
      LMDeploy
      OpenAI API
    Special Features
      AI judge evaluator
      Multi-GPU support
      Config-based setup

mindmap root((OpenCompass)) What it does Evaluate LLMs 100 plus benchmarks Public leaderboard Benchmark Areas General knowledge Math reasoning Coding ability Long documents Model Backends HuggingFace vLLM LMDeploy OpenAI API Special Features AI judge evaluator Multi-GPU support Config-based setup

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Run a new open-source language model through math, coding, and general knowledge benchmarks to get standardized scores for a research paper.

USE CASE 2

Compare GPT-4 against Llama or Qwen across multiple tasks using a consistent evaluation framework instead of ad-hoc tests.

USE CASE 3

Use a second AI model as a judge to evaluate open-ended text generation quality where there is no single correct answer.

USE CASE 4

Distribute a large evaluation across multiple GPUs or machines to finish faster.

Tech stack

PythonHuggingFacevLLMLMDeployOpenAI API

Getting it running

Difficulty · hard Time to first run · 1h+

Requires GPU access and potentially large model weights, distributed evaluation needs multi-machine or multi-GPU configuration.

No license information was provided in the explanation.

In plain English

OpenCompass is a Python platform for testing and comparing AI language models. When researchers or companies build or choose an AI model (the kind that generates text, answers questions, or writes code), they need a consistent way to measure how well it actually performs across a range of tasks. OpenCompass provides that measurement framework, supporting over 100 benchmarks and more than 100 different models, including GPT-4, Llama, Qwen, Mistral, InternLM, Claude, and GLM, among others. The platform runs a model through a series of standardized tests covering areas like general knowledge, mathematical reasoning, coding ability, long-document understanding, and scientific knowledge. It then produces scores that can be compared across models and tracked over time. Results from OpenCompass evaluations are published on a public leaderboard at rank.opencompass.org.cn, and a benchmark hub at hub.opencompass.org.cn collects the datasets used for evaluation. For evaluation that requires judgment calls rather than fixed right-or-wrong answers, the platform includes tools that use a second AI model as a judge. There are also specialized evaluators for mathematical reasoning and for cascading multiple evaluation methods in sequence. The platform supports running evaluations efficiently by distributing work across multiple machines or GPU clusters. Models can be connected through several backends including HuggingFace, vLLM, and LMDeploy, as well as API-based models from OpenAI and other providers. Configuration is file-based, and example scripts for evaluating specific models or benchmarks are included in the repository. The project is maintained by the OpenCompass team at Shanghai AI Laboratory and was recommended by Meta AI as a validation tool for Llama models. Documentation is available at opencompass.readthedocs.io. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

I want to evaluate a HuggingFace model on the MMLU and HumanEval benchmarks using OpenCompass. Show me the config file and the command to run the evaluation.

Prompt 2

How do I set up OpenCompass to evaluate a model hosted behind an OpenAI-compatible API endpoint instead of loading weights locally?

Prompt 3

I want to distribute an OpenCompass evaluation across 4 GPUs to speed it up. Walk me through the configuration changes needed.

Prompt 4

How do I add a custom evaluation dataset to OpenCompass and wire it into the benchmark pipeline?

Open on GitHub → Explain another repo

← open-compass on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.