explaingit

conardli/easy-dataset

14,231JavaScriptAudience · dataComplexity · 2/5LicenseSetup · easy

TLDR

Easy Dataset is a desktop and web app that automatically converts your documents into structured question-and-answer training data for fine-tuning or evaluating AI language models.

Mindmap

mindmap
  root((repo))
    What it does
      Document to dataset
      QA pair generation
      Model fine-tuning data
    Input formats
      PDF and Word
      Markdown and text
    Output types
      QA pairs
      Multi-turn conversation
      Evaluation datasets
    Integrations
      OpenAI compatible APIs
      Ollama local models
      Hugging Face export
    Audience
      AI researchers
      ML practitioners
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Convert a product manual or knowledge base PDF into question-answer pairs for fine-tuning an AI model.

USE CASE 2

Generate multi-turn conversation training data from existing documentation to customize a language model.

USE CASE 3

Run automated model evaluation with the built-in judge to score and compare two models side by side.

USE CASE 4

Export a finished dataset directly to Hugging Face in standard AI training formats.

Tech stack

JavaScriptNode.jsDocker

Getting it running

Difficulty · easy Time to first run · 30min
Free to use for non-commercial purposes under AGPL-3.0, commercial use requires a separate agreement.

In plain English

Easy Dataset is a desktop and web application for turning documents into training data for AI language models. If you want to teach an AI model something specific, such as the contents of a product manual, a legal guide, or a technical knowledge base, you need a collection of question-and-answer pairs drawn from that material. Easy Dataset automates that process. You start by uploading documents in formats like PDF, Word, Markdown, or plain text. The app splits the content into segments and then uses an AI model of your choice to generate questions and answers from each segment. The result is a structured dataset you can use to fine-tune an AI model or to power a retrieval-augmented generation setup, which is a technique for letting an AI pull from a custom knowledge base when answering questions. Beyond basic question-and-answer pairs, the tool can generate multi-turn conversation data, image-based question pairs, and evaluation datasets for testing how well a model performs. The evaluation side includes multiple-choice and open-ended questions, an automated judge that scores model answers, and a side-by-side blind comparison mode where you can pit two models against each other without knowing in advance which is which. The app connects to a wide range of AI provider APIs as long as they follow the standard OpenAI request format. This covers services like OpenAI, Ollama for running models locally, and various others. Once your dataset is ready, you can export it in several common formats used in AI training pipelines and upload directly to the Hugging Face model repository platform. Desktop installers are available for Windows, macOS, and Linux. You can also run it locally via Node.js or Docker. The interface supports Chinese, English, Turkish, and Portuguese. The project is open source under the AGPL-3.0 license.

Copy-paste prompts

Prompt 1
I want to fine-tune an AI model on my company's product documentation. Show me how to use Easy Dataset to upload a PDF, generate Q&A pairs, and export the dataset to Hugging Face.
Prompt 2
Walk me through connecting Easy Dataset to a local Ollama model so I can generate training data without sending documents to a cloud API.
Prompt 3
I have a legal guide I want to turn into a retrieval-augmented generation knowledge base. How do I use Easy Dataset to segment the document and produce the right dataset format?
Prompt 4
Show me how to set up Easy Dataset's blind comparison mode to evaluate two AI models on the same test questions and get automatic scores.
Open on GitHub → Explain another repo

← conardli on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.