explaingit

openai/evals

18,487PythonAudience · developerComplexity · 3/5MaintainedLicenseSetup · moderate

TLDR

A framework and registry of benchmark tests for evaluating how well large language models perform on different tasks, with tools to create custom evaluations.

Mindmap

mindmap
  root((repo))
    What it does
      Benchmark registry
      Custom eval builder
      Model grading
      Performance testing
    How to use
      JSON data format
      YAML config
      OpenAI API key
      Python code optional
    Use cases
      Test model quality
      Build private evals
      Measure capabilities
      Compare model versions
    Tech stack
      Python framework
      Git LFS storage
      Snowflake logging
      OpenAI API

Things people build with this

USE CASE 1

Test how well your language model performs on specific tasks before deploying it to users.

USE CASE 2

Create private benchmarks using your own proprietary data to measure model quality without sharing data publicly.

USE CASE 3

Compare performance across different model versions or configurations to pick the best one for your application.

USE CASE 4

Build evaluations that use another AI model to automatically grade whether outputs are correct or helpful.

Tech stack

PythonOpenAI APIGit LFSSnowflakeYAMLJSON

Getting it running

Difficulty · moderate Time to first run · 30min

Requires OpenAI API key and potentially Snowflake credentials for full functionality.

Use freely for any purpose, including commercial use, as long as you keep the copyright notice and license text.

In plain English

OpenAI Evals is a framework for evaluating large language models (LLMs), AI systems that generate text, and an open-source registry of benchmark tests for measuring their capabilities. An "eval" in this context is a structured test that runs a model against a set of inputs and measures how well its outputs match expected results. The project serves two purposes. First, it provides an existing library of benchmarks that test different capabilities of language models. Second, it gives developers a framework to write their own custom evaluations for use cases specific to their application, including private evals that use proprietary data without exposing it publicly. Custom evals can be built in two ways: model-graded evals, where another language model judges whether the output is correct (these are currently accepted as contributions), or evals with custom Python code (currently not accepted as community submissions). For basic evals, no coding is required, you provide data in JSON format and specify parameters in a YAML configuration file. To run evals, you need an OpenAI API key and Python 3.9 or later. The eval registry data is stored using Git LFS (Large File Storage), a Git extension for tracking large binary files, which needs to be fetched separately after cloning the repository. Results can optionally be logged to a Snowflake database. An interactive dashboard version is also available directly in the OpenAI platform without needing this codebase.

Copy-paste prompts

Prompt 1
How do I set up OpenAI Evals to run benchmarks on a language model using my own test data in JSON format?
Prompt 2
Show me how to create a custom evaluation in OpenAI Evals where another model grades whether the output is correct.
Prompt 3
I want to log eval results to Snowflake, what configuration do I need in OpenAI Evals?
Prompt 4
How do I write a basic eval in OpenAI Evals without coding, just using YAML and JSON files?
Prompt 5
What's the difference between model-graded evals and custom Python evals in OpenAI Evals, and which should I use?
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.