Find the right Chinese text dataset for an NLP task such as sentiment analysis, question answering, or entity recognition
Download labeled Chinese news articles, product reviews, or doctor-patient dialogues to train or fine-tune a language model
Start a Chinese text classification project using the Toutiao news dataset of 380000 labeled short articles across 15 categories
This repository is a curated collection of datasets for Chinese natural language processing (NLP) research and experimentation. NLP is the field of computer science that teaches machines to read, understand, and work with human language. Because Chinese has its own grammar, writing system, and linguistic quirks, researchers need datasets specifically built for Chinese text rather than English ones. The collection is organized into several categories. One section covers reading comprehension, where a model reads a passage and answers questions about it. Datasets here include DuReader from Baidu, which contains 300,000 questions paired with 1.4 million documents, and CMRC 2018 from Harbin Institute of Technology. Another section covers task-oriented dialogue, meaning conversations where a user wants to accomplish something specific, like booking a car or getting a medical diagnosis. Examples include a medical diagnosis dataset from Fudan University built from real online doctor-patient exchanges, and several datasets from the annual SMP and NLPCC evaluation competitions. There are also datasets for text classification, such as a Toutiao news dataset with 380,000 labeled short articles across 15 categories, and a Tsinghua news corpus covering topics like sports, finance, technology, and entertainment. Sentiment analysis datasets appear as well, covering hotel reviews, food delivery reviews, online shopping reviews across 10 product types, and labeled Weibo posts. The project also indexes datasets for named entity recognition (identifying people, places, and organizations in text), text similarity, question answering, and knowledge graph tasks. Most entries include the dataset size, the institution that created it, links to the original papers, and download addresses. This is a reference and index resource, not a software tool. A researcher or developer working on Chinese text AI would browse the tables, find the dataset that fits their task, and download it from the linked source. The README is the main artifact here, and the repository is open to pull requests that add new datasets to the index.
← insanelife on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.