Simple-evals is a lightweight Python library from OpenAI for testing how well language models perform on a set of standardized benchmarks. OpenAI published it to show the methodology behind the accuracy numbers they report when releasing new models, so others can see exactly how those scores are produced. The library runs models through several well-known tests used in the AI research community. These include MMLU, which covers a wide range of academic subjects, GPQA, a set of difficult graduate-level science questions, MATH, which tests mathematical problem-solving, HumanEval, which tests code generation, and SimpleQA, which checks factual accuracy on short questions. The repository also contains reference implementations for HealthBench, BrowseComp, and additional benchmarks. A key design choice in this library is using zero-shot prompts with simple plain instructions rather than elaborate setups. The README explains that some older evaluation methods used extra context or role-playing prompts, which were carry-overs from evaluating earlier models and do not reflect how modern instruction-tuned models actually behave in practice. The repository includes a large results table comparing many models from OpenAI and other providers across these benchmarks, giving a reference point for how different systems compare on the same tests. As of July 2025, OpenAI announced the repository will no longer be updated for new models or benchmark results. It continues to exist as a reference for the three benchmarks mentioned above, and the code can still be used to run evaluations, but active maintenance has stopped. It is intended for researchers and developers who want to reproduce or study published benchmark numbers.
← openai on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.