explaingit

fxsjy/jieba

Analysis updated 2026-06-20

34,931PythonAudience · dataComplexity · 2/5Setup · easy

TLDR

Jieba is a Python library that splits Chinese text into individual words, essential preprocessing for any project that analyzes, searches, or classifies Chinese language content, since Chinese has no spaces between words.

Mindmap

mindmap
  root((repo))
    What it does
      Chinese word splitting
      Keyword extraction
    Segmentation modes
      Precise mode
      Full mode
      Search engine mode
    Extra features
      Part-of-speech tags
      TF-IDF keywords
      TextRank keywords
    Who uses it
      NLP engineers
      Data scientists
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Tokenize Chinese text into words before training a sentiment analysis or text classification model

USE CASE 2

Build a Chinese search engine that indexes word segments so users can find partial phrase matches

USE CASE 3

Extract keywords from Chinese articles using TF-IDF or TextRank for content summarization

USE CASE 4

Add part-of-speech tagging to a Chinese text pipeline to identify nouns, verbs, and named locations

What is it built with?

Python

How does it compare?

fxsjy/jiebawshobson/agentsgeekcomputers/python
Stars34,93134,87835,001
LanguagePythonPythonPython
Setup difficultyeasymoderateeasy
Complexity2/53/51/5
Audiencedatadevelopervibe coder

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

Install via pip with no system dependencies, the deep-learning mode requires PaddlePaddle installed separately.

In plain English

Jieba is a Chinese text segmentation library for Python. The challenge it solves is fundamental to processing Chinese text on computers: unlike English, written Chinese does not place spaces between words, so before you can analyze or search text you need software to figure out where one word ends and the next begins. Jieba does that work reliably and quickly. The library offers several segmentation modes suited to different tasks. Precise mode tries to split a sentence into the most accurate set of words, which is best for text analysis. Full mode scans the sentence and extracts every possible word combination very quickly, though it cannot resolve ambiguous cases. Search engine mode works on top of precise mode by further splitting long words, which is useful when building search indexes where you want to match partial phrases too. There is also a deep-learning mode based on PaddlePaddle for higher accuracy and part-of-speech tagging. Under the hood, jieba builds a directed acyclic graph of all possible word combinations in a sentence, then uses dynamic programming and a Hidden Markov Model (a statistical technique for inferring hidden states from sequences) to find the most probable segmentation. Beyond splitting text into words, jieba supports keyword extraction using two approaches: TF-IDF, which scores words by how distinctive they are to the document, and TextRank, which scores words by how centrally they connect to other words. It also returns the part of speech (noun, verb, location name, etc.) for each word, and can tell you the exact character positions where each word starts and ends. You would use jieba in any Python project that processes Chinese text: search engines, chatbots, sentiment analysis pipelines, document classifiers, or natural language processing research. The entire library is Python, compatible with Python 2 and 3, and installable via pip.

Copy-paste prompts

Prompt 1
Using jieba in Python, segment the Chinese sentence '我今天去北京出差' into individual words and print each word with its part of speech.
Prompt 2
Help me use jieba's TF-IDF keyword extraction on a list of Chinese news articles to find the top 10 most distinctive terms per article.
Prompt 3
Show me how to add a custom word to jieba's dictionary so it always recognizes a brand name or technical term as a single token rather than splitting it.
Prompt 4
Using jieba's search engine mode, tokenize a list of Chinese product descriptions for indexing into Elasticsearch.

Frequently asked questions

What is jieba?

Jieba is a Python library that splits Chinese text into individual words, essential preprocessing for any project that analyzes, searches, or classifies Chinese language content, since Chinese has no spaces between words.

What language is jieba written in?

Mainly Python. The stack also includes Python.

How hard is jieba to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is jieba for?

Mainly data.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub fxsjy on gitmyhub

Verify against the repo before relying on details.