fxsjy/jieba

Analysis updated 2026-06-20

★ 34,931PythonAudience · dataComplexity · 2/5Setup · easy

Mindmap

mindmap
  root((repo))
    What it does
      Chinese word splitting
      Keyword extraction
    Segmentation modes
      Precise mode
      Full mode
      Search engine mode
    Extra features
      Part-of-speech tags
      TF-IDF keywords
      TextRank keywords
    Who uses it
      NLP engineers
      Data scientists

mindmap root((repo)) What it does Chinese word splitting Keyword extraction Segmentation modes Precise mode Full mode Search engine mode Extra features Part-of-speech tags TF-IDF keywords TextRank keywords Who uses it NLP engineers Data scientists

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Tokenize Chinese text into words before training a sentiment analysis or text classification model

USE CASE 2

Build a Chinese search engine that indexes word segments so users can find partial phrase matches

USE CASE 3

Extract keywords from Chinese articles using TF-IDF or TextRank for content summarization

USE CASE 4

Add part-of-speech tagging to a Chinese text pipeline to identify nouns, verbs, and named locations

What is it built with?

Python

How does it compare?

	fxsjy/jieba	wshobson/agents	geekcomputers/python
Stars	34,931	34,878	35,001
Language	Python	Python	Python
Setup difficulty	easy	moderate	easy
Complexity	2/5	3/5	1/5
Audience	data	developer	vibe coder

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

Install via pip with no system dependencies, the deep-learning mode requires PaddlePaddle installed separately.

In plain English

Jieba is a Chinese text segmentation library for Python. The challenge it solves is fundamental to processing Chinese text on computers: unlike English, written Chinese does not place spaces between words, so before you can analyze or search text you need software to figure out where one word ends and the next begins. Jieba does that work reliably and quickly. The library offers several segmentation modes suited to different tasks. Precise mode tries to split a sentence into the most accurate set of words, which is best for text analysis. Full mode scans the sentence and extracts every possible word combination very quickly, though it cannot resolve ambiguous cases. Search engine mode works on top of precise mode by further splitting long words, which is useful when building search indexes where you want to match partial phrases too. There is also a deep-learning mode based on PaddlePaddle for higher accuracy and part-of-speech tagging. Under the hood, jieba builds a directed acyclic graph of all possible word combinations in a sentence, then uses dynamic programming and a Hidden Markov Model (a statistical technique for inferring hidden states from sequences) to find the most probable segmentation. Beyond splitting text into words, jieba supports keyword extraction using two approaches: TF-IDF, which scores words by how distinctive they are to the document, and TextRank, which scores words by how centrally they connect to other words. It also returns the part of speech (noun, verb, location name, etc.) for each word, and can tell you the exact character positions where each word starts and ends. You would use jieba in any Python project that processes Chinese text: search engines, chatbots, sentiment analysis pipelines, document classifiers, or natural language processing research. The entire library is Python, compatible with Python 2 and 3, and installable via pip.

Copy-paste prompts

Prompt 1

Using jieba in Python, segment the Chinese sentence '我今天去北京出差' into individual words and print each word with its part of speech.

Prompt 2

Help me use jieba's TF-IDF keyword extraction on a list of Chinese news articles to find the top 10 most distinctive terms per article.

Prompt 3

Show me how to add a custom word to jieba's dictionary so it always recognizes a brand name or technical term as a single token rather than splitting it.

Prompt 4

Using jieba's search engine mode, tokenize a list of Chinese product descriptions for indexing into Elasticsearch.

Frequently asked questions

What is jieba?

Jieba is a Python library that splits Chinese text into individual words, essential preprocessing for any project that analyzes, searches, or classifies Chinese language content, since Chinese has no spaces between words.

What language is jieba written in?

Mainly Python. The stack also includes Python.

How hard is jieba to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is jieba for?

Mainly data.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub fxsjy on gitmyhub

Verify against the repo before relying on details.