goto456/stopwords

★ 5,518Audience · dataComplexity · 1/5Setup · easy

Mindmap

mindmap
  root((stopwords))
    What it is
      Chinese word lists
      Plain text files
      No code included
    Word lists
      General Chinese list
      Harbin Institute list
      Baidu stopwords
      Sichuan University list
    Use cases
      Text classifiers
      Search indexing
      NLP pipelines
    Audience
      Data engineers
      NLP researchers
      ML practitioners

mindmap root((stopwords)) What it is Chinese word lists Plain text files No code included Word lists General Chinese list Harbin Institute list Baidu stopwords Sichuan University list Use cases Text classifiers Search indexing NLP pipelines Audience Data engineers NLP researchers ML practitioners

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Filter common Chinese words out of text before running sentiment analysis or topic classification.

USE CASE 2

Load one or more word lists into a search index to improve Chinese full-text search relevance.

USE CASE 3

Use the Harbin or Baidu stopword list as a baseline for training a Chinese NLP or language model.

USE CASE 4

Combine multiple lists to build a comprehensive Chinese stopword set for a data cleaning pipeline.

Getting it running

Difficulty · easy Time to first run · 5min

No explicit license is stated in the repository.

In plain English

This repository is a collection of Chinese stopword lists in plain text format. Stopwords are common words that text-processing systems typically filter out before analyzing or searching content. In English, words like "the", "and", and "is" are typical stopwords. In Chinese, the equivalent are short, high-frequency words that carry little meaning on their own. The repository contains four separate word lists, each from a different source. One is a general Chinese stopword list. A second comes from the Harbin Institute of Technology, a well-known source for Chinese natural language processing research. A third is Baidu's stopword list. The fourth is from the machine intelligence lab at Sichuan University. Each list is a separate text file, so you can pick the one that fits your project or use them in combination. The README is minimal and consists mainly of a table mapping each list name to its filename. There is no code in this repository, no installation instructions, and no usage examples. You would typically use these files by loading them into a text processing pipeline, a search index, or a machine learning data-cleaning step that needs to strip common words from Chinese text before further analysis. This kind of resource is useful for anyone building search tools, text classifiers, or language models that process Chinese content.

Copy-paste prompts

Prompt 1

I'm building a Chinese text classifier in Python. Show me how to load the Harbin Institute stopword list from this repo and use it to filter tokens before training.

Prompt 2

Help me compare the four stopword lists in this repo and decide which one to use for a Baidu search-related project versus an academic NLP task.

Prompt 3

I want to index Chinese product reviews in Elasticsearch. Show me how to configure a custom stopword filter using one of these text files.

Prompt 4

Write a Python script that loads all four stopword list files from this repo, deduplicates them, and saves a combined master list.

Prompt 5

Explain what stopwords are to a non-technical PM and why filtering them matters before running AI analysis on Chinese customer feedback.

Open on GitHub → Explain another repo

← goto456 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.