explaingit

google/sentencepiece

11,815C++Audience · researcherComplexity · 2/5Setup · easy

TLDR

A fast text tokenizer from Google that converts raw text into token sequences for language models using BPE or unigram algorithms, with no language-specific preprocessing required.

Mindmap

mindmap
  root((sentencepiece))
    What it does
      Text to token IDs
      Token IDs to text
      No preprocessing needed
    Algorithms
      Byte-pair encoding
      Unigram language model
    Features
      50k sentences per second
      Fixed vocabulary size
      Reproducible output
    Use Cases
      LLM preprocessing
      Multilingual tokenization
      Training data pipeline
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Train a custom tokenizer on your text corpus with a fixed vocabulary size ready for use in a neural network model.

USE CASE 2

Convert raw sentences into integer token ID sequences that a language model can process as input.

USE CASE 3

Decode model output token IDs back into human-readable text after inference.

USE CASE 4

Tokenize languages like Chinese or Japanese that have no word-boundary spaces, without any preprocessing step.

Tech stack

C++Python

Getting it running

Difficulty · easy Time to first run · 5min

In plain English

SentencePiece is a tool from Google that prepares text for use in machine learning language models. Before a language model can process a sentence, the text has to be broken into small units called tokens. SentencePiece handles that conversion step, turning raw text into a sequence of numbers the model can work with, and also converting those numbers back into readable text afterward. The central design choice is that SentencePiece works directly from raw text without any language-specific preprocessing. Most tokenizers require that text be cleaned or split in language-dependent ways first, which makes them harder to use with languages like Chinese or Japanese that do not put spaces between words. SentencePiece avoids that requirement by treating the entire character sequence, including spaces, as input it handles on its own. It supports two main approaches for deciding how to split text: byte-pair encoding, which repeatedly merges the most frequent character pairs into single units, and a unigram language model approach, which works backward from a large candidate vocabulary to find the most probable segmentation. Both approaches produce a fixed vocabulary size, which is a requirement for most neural network models. The tool is fast, processing around 50,000 sentences per second, and has a small memory footprint. Once a model file is trained, the same file will always produce the same tokenization, which makes results reproducible. Python bindings are available via pip, and there is also a C++ library for use in compiled applications. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1
Using the sentencepiece Python library (pip install sentencepiece), train a BPE tokenizer with a vocabulary of 32,000 on my text file and then encode a new sentence into token IDs.
Prompt 2
Show me how to use google/sentencepiece to tokenize a Japanese sentence without any preprocessing, then decode the resulting token IDs back to readable text.
Prompt 3
Help me integrate a pre-trained SentencePiece .model file into my PyTorch training loop to encode batches of raw strings into padded integer tensors.
Prompt 4
Write a Python script that trains a unigram SentencePiece tokenizer on a multilingual corpus, verifies that the same sentence always produces the same token sequence, and saves the model file.
Prompt 5
What is the difference between the BPE and unigram training modes in google/sentencepiece, and when should I choose one over the other for a language model project?
Open on GitHub → Explain another repo

← google on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.