explaingit

insanelife/chinesenlpcorpus

4,597PythonAudience · researcherComplexity · 1/5Setup · easy

TLDR

A curated index of datasets for Chinese natural language processing research, covering reading comprehension, dialogue, text classification, sentiment analysis, named entity recognition, and more, with download links for each.

Mindmap

mindmap
  root((Chinese NLP Corpus))
    What it is
      Dataset index
      Download links
      Paper references
    Categories
      Reading comprehension
      Task-oriented dialogue
      Text classification
      Sentiment analysis
    Sources
      Baidu DuReader
      Tsinghua news
      Fudan medical
    Audience
      NLP researchers
      AI developers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Find the right Chinese text dataset for an NLP task such as sentiment analysis, question answering, or entity recognition

USE CASE 2

Download labeled Chinese news articles, product reviews, or doctor-patient dialogues to train or fine-tune a language model

USE CASE 3

Start a Chinese text classification project using the Toutiao news dataset of 380000 labeled short articles across 15 categories

Tech stack

Python

Getting it running

Difficulty · easy Time to first run · 30min
No license information is mentioned in the explanation.

In plain English

This repository is a curated collection of datasets for Chinese natural language processing (NLP) research and experimentation. NLP is the field of computer science that teaches machines to read, understand, and work with human language. Because Chinese has its own grammar, writing system, and linguistic quirks, researchers need datasets specifically built for Chinese text rather than English ones. The collection is organized into several categories. One section covers reading comprehension, where a model reads a passage and answers questions about it. Datasets here include DuReader from Baidu, which contains 300,000 questions paired with 1.4 million documents, and CMRC 2018 from Harbin Institute of Technology. Another section covers task-oriented dialogue, meaning conversations where a user wants to accomplish something specific, like booking a car or getting a medical diagnosis. Examples include a medical diagnosis dataset from Fudan University built from real online doctor-patient exchanges, and several datasets from the annual SMP and NLPCC evaluation competitions. There are also datasets for text classification, such as a Toutiao news dataset with 380,000 labeled short articles across 15 categories, and a Tsinghua news corpus covering topics like sports, finance, technology, and entertainment. Sentiment analysis datasets appear as well, covering hotel reviews, food delivery reviews, online shopping reviews across 10 product types, and labeled Weibo posts. The project also indexes datasets for named entity recognition (identifying people, places, and organizations in text), text similarity, question answering, and knowledge graph tasks. Most entries include the dataset size, the institution that created it, links to the original papers, and download addresses. This is a reference and index resource, not a software tool. A researcher or developer working on Chinese text AI would browse the tables, find the dataset that fits their task, and download it from the linked source. The README is the main artifact here, and the repository is open to pull requests that add new datasets to the index.

Copy-paste prompts

Prompt 1
I need a Chinese sentiment analysis dataset for online shopping reviews. Which dataset in insanelife/chinesenlpcorpus should I use and how do I download it?
Prompt 2
Help me find a Chinese task-oriented dialogue dataset for training a medical chatbot from the chinesenlpcorpus index
Prompt 3
What Chinese named entity recognition datasets are listed in insanelife/chinesenlpcorpus and where can I download them?
Prompt 4
I want to train a reading comprehension model on Chinese text. Which dataset from chinesenlpcorpus has the most question-document pairs?
Open on GitHub → Explain another repo

← insanelife on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.