explaingit

649453932/chinese-text-classification-pytorch

5,733PythonAudience · researcherComplexity · 2/5Setup · easy

TLDR

A PyTorch toolkit with 7 ready-to-train neural network models for classifying Chinese text into categories like finance, sports, and tech. Uses 200,000 Chinese news headlines from THUCNews, achieving up to 92% accuracy with a single command to run.

Mindmap

mindmap
  root((repo))
    Models
      TextCNN
      TextRNN
      FastText
      DPCNN
      Transformer
    Dataset
      THUCNews corpus
      10 news categories
      200k headlines
    Training
      Single command run
      Character level input
      Pretrained vectors
    Results
      90 to 92 percent accuracy
      FastText top performer
    Requirements
      Python 3.7
      PyTorch 1.1
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Train a model to automatically sort Chinese news articles into topic categories like sports, finance, or technology.

USE CASE 2

Use as a reference implementation to learn how different neural network architectures compare on a Chinese text classification task.

USE CASE 3

Fine-tune one of the 7 included models on your own Chinese text dataset using the provided training scripts.

USE CASE 4

Benchmark classic models like TextCNN and BiLSTM against a Transformer on a standardized Chinese NLP dataset.

Tech stack

PythonPyTorchTextCNNLSTMTransformerFastTextDPCNN

Getting it running

Difficulty · easy Time to first run · 30min

Requires Python 3.7 and PyTorch 1.1. Run with a single command: python run.py --model TextCNN. Optional pretrained character vectors available for download to improve accuracy.

No license information mentioned in the explanation.

In plain English

This repository is a collection of neural network models for classifying Chinese text, built with PyTorch. The task it solves is automatic categorization: given a short piece of Chinese text such as a news headline, the models can predict which topic category it belongs to. The included models range from older convolutional and recurrent architectures to a Transformer, all implemented and ready to train without heavy setup. The dataset used in the examples comes from THUCNews, a large Chinese news corpus published by Tsinghua University. The author extracted 200,000 headlines across 10 categories including finance, real estate, stocks, education, technology, society, current affairs, sports, gaming, and entertainment. The data is split into a training set of 180,000 examples and validation and test sets of 10,000 each. Input is processed at the character level rather than the word level, and optional pre-trained character vectors are available for download to improve results. Seven models are available out of the box: TextCNN, TextRNN, FastText, TextRCNN, BiLSTM with Attention, DPCNN, and Transformer. Benchmark accuracy on the test set ranges from about 90 to 92 percent depending on the model, with FastText reaching 92.23 percent despite being the simplest architecture. BERT and ERNIE models are covered in a companion repository linked from the README. Running a model is a single command, for example: python run.py --model TextCNN. The README is written in Chinese and requires Python 3.7 and PyTorch 1.1. The project is intended as a reference implementation for researchers and practitioners learning text classification in Chinese.

Copy-paste prompts

Prompt 1
Using the TextCNN model from chinese-text-classification-pytorch, how do I train it on my own Chinese text dataset instead of THUCNews? Show me what files to change and the exact command to run.
Prompt 2
I want to add an 11th category to the THUCNews classifier in chinese-text-classification-pytorch. Walk me through exactly which files and lines I need to edit.
Prompt 3
Explain the difference between the TextRNN and BiLSTM with Attention models in chinese-text-classification-pytorch. Which should I choose if I care more about accuracy than speed?
Prompt 4
How do I load the pre-trained character vectors for chinese-text-classification-pytorch and confirm they are improving my model's accuracy during training?
Prompt 5
I have a list of 500 Chinese sentences and I want to run inference using the FastText model from chinese-text-classification-pytorch. Write a Python script that loads the trained model and outputs a predicted category for each sentence.
Open on GitHub → Explain another repo

← 649453932 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.