Find a large text dataset for training or fine-tuning a language model.
Browse available public-domain corpora when starting a new NLP research project.
Locate a domain-specific dataset such as news text, product reviews, or dialogue scripts matched to your use case.
This repository is a curated, alphabetical list of free and public-domain text datasets that can be used for natural language processing work. It is not a library or a piece of software you install and run. It is a reference document, essentially a long list of links with brief descriptions, pointing to datasets hosted elsewhere on the internet. The datasets span an enormous range of content and scale. Entries include things like Amazon product reviews (35 million reviews, 11 GB), all papers published on arXiv (270 GB of full text), the Common Crawl web corpus (over 5 billion pages, 541 TB), movie dialogue scripts, news headlines, email archives, government contract records, and many more. File sizes range from a few megabytes to hundreds of terabytes, so the list is useful whether you are working on a small project or a large infrastructure setup. The focus is on unstructured raw text rather than labeled or annotated data. The README notes that if you need annotated corpora or linguistic treebanks, those are covered by separate sources listed at the bottom of the document. There is no code in the repository. Its value is as a starting point when you need to find a text dataset for a project and do not know where to look. Each entry includes the dataset name, a short description, an approximate size, and a link to where it can be accessed or downloaded. The full README is longer than what was shown.
← niderhoff on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.