explaingit

confident-ai/deepeval

15,363PythonAudience · developerComplexity · 3/5Setup · moderate

TLDR

An open-source testing framework for AI apps, works like Pytest but with built-in metrics for scoring chatbot, agent, and RAG output quality so you can catch regressions before they ship.

Mindmap

mindmap
  root((deepeval))
    What it does
      LLM app testing
      Pytest-style tests
      Metric scoring
    Metrics
      G-Eval custom rubric
      RAG faithfulness recall
      Agent task completion
      Multi-turn chat quality
    Use cases
      Prompt regression testing
      Model swap validation
      CI quality gates
    Integrations
      LangChain OpenAI SDK
      Confident AI platform
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Write repeatable test cases to check whether your chatbot or AI agent gives correct, relevant answers.

USE CASE 2

Compare prompt versions or model swaps (e.g. OpenAI to Claude) to measure which actually performs better.

USE CASE 3

Test a RAG pipeline for faithfulness and contextual recall after changing the retrieval or chunking setup.

USE CASE 4

Add LLM quality gates to CI so model regressions are caught before a prompt or model change ships to users.

Tech stack

PythonLangChainOpenAI SDK

Getting it running

Difficulty · moderate Time to first run · 30min

Requires an LLM API key (e.g. OpenAI or Anthropic) to power the evaluation metrics.

In plain English

DeepEval is an open-source framework for testing large language model (LLM) applications, chatbots, AI agents, retrieval pipelines, and the like. The pitch in the README is that it works "similar to Pytest but specialized for unit testing LLM apps": you write small test cases that check whether your AI is doing what you expect, and the framework runs them and reports the results. The hard part of testing an LLM is that there is rarely a single correct answer to compare against, so DeepEval ships a catalogue of ready-made metrics that score outputs in different ways. Some are general-purpose, like G-Eval (a research-backed approach that uses another LLM as a judge against custom criteria) and DAG (a graph-based deterministic judge builder). Others are grouped by use case: agentic metrics such as task completion, tool correctness, and plan adherence, RAG metrics such as answer relevancy, faithfulness, and contextual recall, multi-turn metrics for chatbots covering knowledge retention and role adherence, and MCP-specific metrics. The metrics can be powered by any LLM you choose, by statistical methods, or by smaller NLP models that run locally on your machine. You would reach for DeepEval when you are building an AI app and want a repeatable way to know whether a change to the prompt, the model, or the retrieval setup actually made the system better, including swapping providers, for example moving from OpenAI to Claude with confidence. It is a Python package, designed to plug into stacks like LangChain or the OpenAI SDK, and pairs with the paid Confident AI platform for storing and sharing test runs.

Copy-paste prompts

Prompt 1
Using DeepEval, write a pytest-style test that checks whether my RAG chatbot's answers are faithful to the retrieved context chunks.
Prompt 2
Set up a DeepEval test suite that uses G-Eval with a custom rubric to score my AI customer-support agent's responses.
Prompt 3
Write a DeepEval test to compare GPT-4 vs Claude Sonnet on task completion for my multi-step agent and report which scores higher.
Prompt 4
Show me how to add DeepEval tests to a GitHub Actions CI workflow so LLM quality regressions are caught automatically on every push.
Prompt 5
Write a DeepEval test measuring answer relevancy and contextual recall for a LangChain RAG pipeline with 5 example question-answer pairs.
Open on GitHub → Explain another repo

← confident-ai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.