explaingit

mlabonne/llm-datasets

4,551Audience · researcherComplexity · 1/5Setup · easy

TLDR

A curated, regularly updated list of datasets and tools for fine-tuning large language models, covering instruction, math, code, safety, preference, and reasoning data with license and size details.

Mindmap

mindmap
  root((repo))
    Dataset types
      Instruction data
      Math reasoning
      Code datasets
      Safety and alignment
      Preference data
      Reasoning data
    Quality criteria
      Accuracy
      Diversity
      Complexity
    Use cases
      Fine-tuning LLMs
      Reward model training
      Reasoning research
    Audience
      ML researchers
      Model trainers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Find a ready-made instruction dataset for fine-tuning a language model on math or coding tasks with permissive licensing.

USE CASE 2

Discover preference datasets for training a reward model using reinforcement learning from human feedback.

USE CASE 3

Select a reasoning dataset that includes step-by-step thinking traces to teach a model to reason before answering.

USE CASE 4

Compare datasets by sample count, license, and content before committing to a training run.

Getting it running

Difficulty · easy Time to first run · 5min
Individual datasets vary in license from fully open to restricted, licensing details are noted per entry in the tables.

In plain English

This repository is a curated list of datasets and tools used to train and improve large language models after their initial pre-training phase. That process, called post-training or fine-tuning, is what shapes a raw language model into a useful assistant that can answer questions, write code, solve math problems, and follow instructions. The list is maintained by a researcher and author of a book on building with language models. The repository organizes datasets by purpose. Instruction datasets are used for supervised fine-tuning, which teaches a model to respond helpfully to prompts. These are broken into subcategories: general-purpose datasets that cover a broad mix of chat, code, and math, math-focused datasets that include step-by-step reasoning traces, science datasets with physics, chemistry, and biology problems, and code datasets covering many programming languages and difficulty levels. Additional sections cover safety and alignment data, preference datasets used to train models via reinforcement learning from human feedback, and reasoning datasets designed to teach models to think through problems before answering. The README opens with a short explanation of what makes a good dataset, identifying three qualities: accuracy (answers should be correct), diversity (covering as many situations as possible), and complexity (including multi-step, multi-language, and multi-turn examples). It notes that dataset quality typically requires a combination of human review, rule-based filtering, and automated scoring. Each entry in the tables includes the dataset name, the number of samples, whether the dataset includes explicit reasoning traces, licensing notes, and a short description of what is in it and how it was created. Most datasets listed are under open or permissive licenses, though some carry restrictions that are noted inline. The list is updated regularly as new datasets are released.

Copy-paste prompts

Prompt 1
From the mlabonne/llm-datasets list, recommend the best instruction dataset for fine-tuning a small model on Python coding questions with at least 50K samples and a permissive license.
Prompt 2
I want to train a math reasoning model. Which datasets in llm-datasets include step-by-step reasoning traces? List the top options by sample count.
Prompt 3
Help me write a Python script using the HuggingFace datasets library to load one of the instruction datasets from llm-datasets and format it for supervised fine-tuning.
Prompt 4
I'm designing a preference dataset pipeline. Based on the examples in llm-datasets, explain what makes a good preference pair and help me plan a data collection strategy.
Open on GitHub → Explain another repo

← mlabonne on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.