explaingit

ahammadmejbah/awesome-datasets-hub

Analysis updated 2026-06-24

118Audience · researcherComplexity · 1/5Setup · easy

TLDR

Hand-curated awesome-list of public datasets for training and evaluating LLMs, with tables covering domain, task, size, strength rating, languages, and license.

Mindmap

mindmap
  root((Awesome-Datasets-Hub))
    Inputs
      Manual research
      Dataset homepages
      License pages
    Outputs
      Comparison tables
      License flags
      Strength scores
    Use Cases
      Pick training data
      Pick eval benchmark
      Compliance check
    Tech Stack
      Markdown
      GitHub
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Pick a medical-domain training set for an LLM and confirm the license before downloading.

USE CASE 2

Compare instruction-tuning datasets by size, language coverage, and the author's strength rating.

USE CASE 3

Find evaluation benchmarks for reasoning and code generation in one place.

USE CASE 4

Identify which clinical sets like MIMIC need a PhysioNet data use agreement before request.

What is it built with?

Markdown

How does it compare?

ahammadmejbah/awesome-datasets-hubkrishnaik06/complete-machine-learning-2023jackson-video-resources/markov-hedge-fund-method
Stars118119120
LanguageJupyter NotebookPython
Last pushed2023-09-16
MaintenanceDormant
Setup difficultyeasyeasyeasy
Complexity1/51/53/5
Audienceresearchergeneraldeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

In plain English

Awesome-Datasets-Hub is not a piece of software. It is a hand-curated index, in the style of the many awesome-list repositories on GitHub, that gathers public datasets used to train and test large language models. The README is structured as a long series of comparison tables, each row pointing to one dataset along with a short tag for its domain, the task it is for, the rough size, a strength rating out of ten, the languages it covers, and the license under which it is released. The repository description lists the broader scope: medical AI, natural language processing, multimodal learning, instruction tuning, reasoning, code generation, and evaluation benchmarks. The portion of the README visible here begins with the medical datasets table. It lists well-known resources such as MedQA based on USMLE licensing exam questions, MedMCQA which draws from Indian medical entrance exams, PubMedQA covering biomedical yes-no-maybe questions, BioASQ, and clinical text corpora like MIMIC-III and MIMIC-IV with hospital records for tens to hundreds of thousands of patients. Each row links to the dataset's homepage or paper repository. The tables include practical details researchers usually have to dig for one at a time, in particular the license. Some datasets carry permissive MIT or Apache 2.0 licenses, others are Creative Commons, and several clinical sets like MIMIC and n2c2 require a PhysioNet data use agreement before download. The strength scores are the author's own ratings rather than a standardized benchmark. The author is Mejbah Ahammad and the page header has badges linking to their email, LinkedIn, YouTube channel called Intelligence Academy, ResearchGate profile, and a personal website. The repo's role is to act as a starting reference for anyone choosing training or evaluation data for an LLM project, especially in regulated areas like healthcare where dataset licensing is itself a research question.

Copy-paste prompts

Prompt 1
Pick three medical QA datasets from Awesome-Datasets-Hub that have permissive licenses and are above 8 in strength. Justify the choice.
Prompt 2
Build a Python script that scrapes the README tables from this repo and outputs a JSON index of dataset name, license, and language.
Prompt 3
I want to fine-tune a small open model for biomedical Q&A. Pick the right datasets from this index and write the download steps.
Prompt 4
Compare MedQA, MedMCQA, and PubMedQA on size, task format, and license. Use only the rows in this repo.
Prompt 5
List every dataset in this repo that requires a data use agreement, and group them by which agreement is needed.

Frequently asked questions

What is awesome-datasets-hub?

Hand-curated awesome-list of public datasets for training and evaluating LLMs, with tables covering domain, task, size, strength rating, languages, and license.

How hard is awesome-datasets-hub to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is awesome-datasets-hub for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.