explaingit

vibrantlabsai/ragas

13,901PythonAudience · developerComplexity · 3/5LicenseSetup · easy

TLDR

Ragas is a Python toolkit for automatically scoring the quality of AI-powered apps, measuring whether answers are accurate and grounded in source material, with built-in test data generation so you can start evaluating without a pre-made test set.

Mindmap

mindmap
  root((Ragas))
    What it does
      LLM app evaluation
      Quality scoring
      Test data generation
    Metrics
      Faithfulness
      Answer relevance
      Custom prompts
    Integrations
      pip install
      AI frameworks
      CI pipelines
    License
      Apache 2.0
      Open source
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Automatically score whether your RAG chatbot's answers are grounded in the source documents it retrieved, without manual review.

USE CASE 2

Generate a test dataset from your existing content to start evaluating your AI app immediately without hand-crafting test cases.

USE CASE 3

Define a custom evaluation metric by writing a plain-English prompt describing what good output looks like, then run it across your entire output set.

USE CASE 4

Run regression tests on your AI pipeline after prompt changes to catch quality drops before they reach users.

Tech stack

Python

Getting it running

Difficulty · easy Time to first run · 30min

Collects anonymized usage data by default, set RAGAS_DO_NOT_TRACK=true to opt out before first run.

Apache 2.0, use freely for any purpose including commercial, modify and distribute with attribution, no copyleft restrictions.

In plain English

Ragas is a Python toolkit for testing and measuring the quality of applications built on large language models. If you have built something that uses an AI model to answer questions, summarize text, or retrieve information, Ragas gives you a structured way to score how well it is working. The core idea is to move evaluation away from manual, subjective review and toward repeatable, data-driven scoring. Ragas provides a set of pre-built metrics that can assess things like whether a summary is accurate or whether a generated answer is grounded in the source material. You can also define your own custom scoring criteria by writing a prompt that describes what you want to check, and Ragas will apply that check to your outputs automatically. One practical problem the library addresses is the cold-start problem for testing: many teams want to run evaluations but do not have a ready-made set of test cases. Ragas includes a test data generation feature that can create a range of scenarios from your existing content, so you can start evaluating without building a test set by hand. Ragas is installed via pip and works alongside common AI orchestration frameworks. It collects anonymized usage data by default, which you can opt out of by setting an environment variable. The project is open source under the Apache 2.0 license and maintained by VibrantLabs, who also offer paid consulting for teams needing help scaling their evaluation workflows. The quickstart command provides template projects for common evaluation scenarios like RAG (retrieval-augmented generation) systems, with additional templates for agent evaluation and prompt testing listed as coming soon.

Copy-paste prompts

Prompt 1
I have a RAG pipeline that answers questions using company documentation. Show me how to use Ragas to score faithfulness and answer relevance on 50 sample question-answer pairs.
Prompt 2
Write a Ragas custom metric that checks whether a generated product description avoids using superlatives like 'best' or 'greatest'.
Prompt 3
I want to generate 100 test questions from a PDF about our product for use in Ragas evaluation, show me how to use Ragas test data generation with LangChain document loaders.
Prompt 4
How do I integrate Ragas scoring into a CI pipeline so every pull request that changes my prompt gets an automated quality score comparison?
Open on GitHub → Explain another repo

← vibrantlabsai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.