Find a ready-made instruction dataset for fine-tuning a language model on math or coding tasks with permissive licensing.
Discover preference datasets for training a reward model using reinforcement learning from human feedback.
Select a reasoning dataset that includes step-by-step thinking traces to teach a model to reason before answering.
Compare datasets by sample count, license, and content before committing to a training run.
This repository is a curated list of datasets and tools used to train and improve large language models after their initial pre-training phase. That process, called post-training or fine-tuning, is what shapes a raw language model into a useful assistant that can answer questions, write code, solve math problems, and follow instructions. The list is maintained by a researcher and author of a book on building with language models. The repository organizes datasets by purpose. Instruction datasets are used for supervised fine-tuning, which teaches a model to respond helpfully to prompts. These are broken into subcategories: general-purpose datasets that cover a broad mix of chat, code, and math, math-focused datasets that include step-by-step reasoning traces, science datasets with physics, chemistry, and biology problems, and code datasets covering many programming languages and difficulty levels. Additional sections cover safety and alignment data, preference datasets used to train models via reinforcement learning from human feedback, and reasoning datasets designed to teach models to think through problems before answering. The README opens with a short explanation of what makes a good dataset, identifying three qualities: accuracy (answers should be correct), diversity (covering as many situations as possible), and complexity (including multi-step, multi-language, and multi-turn examples). It notes that dataset quality typically requires a combination of human review, rule-based filtering, and automated scoring. Each entry in the tables includes the dataset name, the number of samples, whether the dataset includes explicit reasoning traces, licensing notes, and a short description of what is in it and how it was created. Most datasets listed are under open or permissive licenses, though some carry restrictions that are noted inline. The list is updated regularly as new datasets are released.
← mlabonne on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.