Analysis updated 2026-06-20
Tokenize Chinese text into words before training a sentiment analysis or text classification model
Build a Chinese search engine that indexes word segments so users can find partial phrase matches
Extract keywords from Chinese articles using TF-IDF or TextRank for content summarization
Add part-of-speech tagging to a Chinese text pipeline to identify nouns, verbs, and named locations
| fxsjy/jieba | wshobson/agents | geekcomputers/python | |
|---|---|---|---|
| Stars | 34,931 | 34,878 | 35,001 |
| Language | Python | Python | Python |
| Setup difficulty | easy | moderate | easy |
| Complexity | 2/5 | 3/5 | 1/5 |
| Audience | data | developer | vibe coder |
Figures from each repo's GitHub metadata at analysis time.
Install via pip with no system dependencies, the deep-learning mode requires PaddlePaddle installed separately.
Jieba is a Chinese text segmentation library for Python. The challenge it solves is fundamental to processing Chinese text on computers: unlike English, written Chinese does not place spaces between words, so before you can analyze or search text you need software to figure out where one word ends and the next begins. Jieba does that work reliably and quickly. The library offers several segmentation modes suited to different tasks. Precise mode tries to split a sentence into the most accurate set of words, which is best for text analysis. Full mode scans the sentence and extracts every possible word combination very quickly, though it cannot resolve ambiguous cases. Search engine mode works on top of precise mode by further splitting long words, which is useful when building search indexes where you want to match partial phrases too. There is also a deep-learning mode based on PaddlePaddle for higher accuracy and part-of-speech tagging. Under the hood, jieba builds a directed acyclic graph of all possible word combinations in a sentence, then uses dynamic programming and a Hidden Markov Model (a statistical technique for inferring hidden states from sequences) to find the most probable segmentation. Beyond splitting text into words, jieba supports keyword extraction using two approaches: TF-IDF, which scores words by how distinctive they are to the document, and TextRank, which scores words by how centrally they connect to other words. It also returns the part of speech (noun, verb, location name, etc.) for each word, and can tell you the exact character positions where each word starts and ends. You would use jieba in any Python project that processes Chinese text: search engines, chatbots, sentiment analysis pipelines, document classifiers, or natural language processing research. The entire library is Python, compatible with Python 2 and 3, and installable via pip.
Jieba is a Python library that splits Chinese text into individual words, essential preprocessing for any project that analyzes, searches, or classifies Chinese language content, since Chinese has no spaces between words.
Mainly Python. The stack also includes Python.
Setup difficulty is rated easy, with roughly 5min to a first successful run.
Mainly data.
This repo across BitVibe Labs
Verify against the repo before relying on details.