Train a model to automatically sort Chinese news articles into topic categories like sports, finance, or technology.
Use as a reference implementation to learn how different neural network architectures compare on a Chinese text classification task.
Fine-tune one of the 7 included models on your own Chinese text dataset using the provided training scripts.
Benchmark classic models like TextCNN and BiLSTM against a Transformer on a standardized Chinese NLP dataset.
Requires Python 3.7 and PyTorch 1.1. Run with a single command: python run.py --model TextCNN. Optional pretrained character vectors available for download to improve accuracy.
This repository is a collection of neural network models for classifying Chinese text, built with PyTorch. The task it solves is automatic categorization: given a short piece of Chinese text such as a news headline, the models can predict which topic category it belongs to. The included models range from older convolutional and recurrent architectures to a Transformer, all implemented and ready to train without heavy setup. The dataset used in the examples comes from THUCNews, a large Chinese news corpus published by Tsinghua University. The author extracted 200,000 headlines across 10 categories including finance, real estate, stocks, education, technology, society, current affairs, sports, gaming, and entertainment. The data is split into a training set of 180,000 examples and validation and test sets of 10,000 each. Input is processed at the character level rather than the word level, and optional pre-trained character vectors are available for download to improve results. Seven models are available out of the box: TextCNN, TextRNN, FastText, TextRCNN, BiLSTM with Attention, DPCNN, and Transformer. Benchmark accuracy on the test set ranges from about 90 to 92 percent depending on the model, with FastText reaching 92.23 percent despite being the simplest architecture. BERT and ERNIE models are covered in a companion repository linked from the README. Running a model is a single command, for example: python run.py --model TextCNN. The README is written in Chinese and requires Python 3.7 and PyTorch 1.1. The project is intended as a reference implementation for researchers and practitioners learning text classification in Chinese.
← 649453932 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.