huggingface/tokenizers

★ 10,720RustAudience · researcherComplexity · 2/5Setup · easy

Mindmap

mindmap
  root((tokenizers))
    What it does
      Text to tokens
      For AI models
    Methods
      Byte-Pair Encoding
      WordPiece
      Unigram
    Languages
      Rust core
      Python wrapper
      Node.js wrapper
      Ruby wrapper
    Features
      Fast processing
      Alignment tracking
      Padding and truncation
    Use cases
      Train new vocab
      Load pre-built vocab
      Production pipelines

mindmap root((tokenizers)) What it does Text to tokens For AI models Methods Byte-Pair Encoding WordPiece Unigram Languages Rust core Python wrapper Node.js wrapper Ruby wrapper Features Fast processing Alignment tracking Padding and truncation Use cases Train new vocab Load pre-built vocab Production pipelines

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Prepare text data for a language model by tokenizing it with BPE, WordPiece, or Unigram methods.

USE CASE 2

Train a custom tokenizer vocabulary on your own text dataset in just a few lines of Python.

USE CASE 3

Process a gigabyte of text in under 20 seconds for large-scale dataset preparation.

USE CASE 4

Trace any token back to its exact position in the original text when building NLP pipelines.

Tech stack

RustPythonNode.jsRuby

Getting it running

Difficulty · easy Time to first run · 5min

In plain English

Before a language model can read text, it needs to break that text into smaller pieces called tokens. Words get split into fragments, punctuation gets separated out, and special markers get inserted. This library, from Hugging Face, is the software that does that job. It supports the most widely-used tokenization methods in modern AI, including Byte-Pair Encoding, WordPiece, and Unigram. The library is written in Rust, a programming language known for speed. The README notes it can process a gigabyte of text in under 20 seconds on a standard server CPU. Despite being written in Rust, you do not need to know Rust to use it. Hugging Face provides ready-made wrappers for Python, Node.js, and Ruby, and the Python package is installable with a single pip command. Aside from splitting text into tokens, the library handles the surrounding preparation steps that AI models require: padding sequences to a fixed length, truncating sequences that are too long, and inserting any special tokens a particular model expects. It also tracks alignment, meaning you can trace any token back to exactly where it appeared in the original input text, which is useful when you need to highlight specific spans in the original sentence. You can either train a new tokenizer vocabulary from scratch on your own text files, or load a pre-built vocabulary. The Python API keeps both options to just a few lines of code. Hugging Face created and maintains the library. It is used across their broader ecosystem of AI tools and is intended for both research and production deployment.

Copy-paste prompts

Prompt 1

Using the Hugging Face tokenizers Python library, show me how to load a pre-built tokenizer and tokenize a list of sentences.

Prompt 2

How do I train a new BPE tokenizer vocabulary from a folder of text files using the tokenizers library?

Prompt 3

Show me how to tokenize text and get the character offsets so I can map tokens back to the original string.

Prompt 4

What is the difference between BPE, WordPiece, and Unigram tokenization, and when should I use each?

Open on GitHub → Explain another repo

← huggingface on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.