explaingit

niderhoff/nlp-datasets

5,980Audience · researcherComplexity · 1/5Setup · easy

TLDR

A curated alphabetical list of free public-domain text datasets for natural language processing, not software, just a well-organized reference document with links, descriptions, and file sizes pointing to datasets hosted elsewhere.

Mindmap

mindmap
  root((NLP Datasets))
    What it is
      Dataset reference list
      No code or software
    Dataset types
      Web crawl corpora
      News text
      Product reviews
      Dialogue scripts
    Scale range
      Megabytes to terabytes
      Millions of items
    Focus
      Unstructured raw text
      Public domain data
    Audience
      NLP researchers
      ML practitioners
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Find a large text dataset for training or fine-tuning a language model.

USE CASE 2

Browse available public-domain corpora when starting a new NLP research project.

USE CASE 3

Locate a domain-specific dataset such as news text, product reviews, or dialogue scripts matched to your use case.

Getting it running

Difficulty · easy Time to first run · 5min

In plain English

This repository is a curated, alphabetical list of free and public-domain text datasets that can be used for natural language processing work. It is not a library or a piece of software you install and run. It is a reference document, essentially a long list of links with brief descriptions, pointing to datasets hosted elsewhere on the internet. The datasets span an enormous range of content and scale. Entries include things like Amazon product reviews (35 million reviews, 11 GB), all papers published on arXiv (270 GB of full text), the Common Crawl web corpus (over 5 billion pages, 541 TB), movie dialogue scripts, news headlines, email archives, government contract records, and many more. File sizes range from a few megabytes to hundreds of terabytes, so the list is useful whether you are working on a small project or a large infrastructure setup. The focus is on unstructured raw text rather than labeled or annotated data. The README notes that if you need annotated corpora or linguistic treebanks, those are covered by separate sources listed at the bottom of the document. There is no code in the repository. Its value is as a starting point when you need to find a text dataset for a project and do not know where to look. Each entry includes the dataset name, a short description, an approximate size, and a link to where it can be accessed or downloaded. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1
Using the nlp-datasets list, find me a large English text corpus suitable for pre-training a language model that is freely downloadable.
Prompt 2
I need Amazon product review data for sentiment analysis training. What dataset in nlp-datasets covers this and how large is it?
Prompt 3
I want to train a dialogue model. What conversational text datasets are listed in niderhoff/nlp-datasets and where can I download them?
Prompt 4
Point me to the largest web crawl dataset listed in nlp-datasets and explain any size or licensing considerations.
Open on GitHub → Explain another repo

← niderhoff on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.