Filter common Chinese words out of text before running sentiment analysis or topic classification.
Load one or more word lists into a search index to improve Chinese full-text search relevance.
Use the Harbin or Baidu stopword list as a baseline for training a Chinese NLP or language model.
Combine multiple lists to build a comprehensive Chinese stopword set for a data cleaning pipeline.
This repository is a collection of Chinese stopword lists in plain text format. Stopwords are common words that text-processing systems typically filter out before analyzing or searching content. In English, words like "the", "and", and "is" are typical stopwords. In Chinese, the equivalent are short, high-frequency words that carry little meaning on their own. The repository contains four separate word lists, each from a different source. One is a general Chinese stopword list. A second comes from the Harbin Institute of Technology, a well-known source for Chinese natural language processing research. A third is Baidu's stopword list. The fourth is from the machine intelligence lab at Sichuan University. Each list is a separate text file, so you can pick the one that fits your project or use them in combination. The README is minimal and consists mainly of a table mapping each list name to its filename. There is no code in this repository, no installation instructions, and no usage examples. You would typically use these files by loading them into a text processing pipeline, a search index, or a machine learning data-cleaning step that needs to strip common words from Chinese text before further analysis. This kind of resource is useful for anyone building search tools, text classifiers, or language models that process Chinese content.
← goto456 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.