Normalize thousands of documents by replacing multiple company name variants with one canonical name in a single fast pass.
Extract all mentions of a large predefined keyword list from a text corpus without slowing down as the list grows.
Redact sensitive terms from documents at scale by replacing them with placeholder text.
Tag documents by topic by detecting which category keywords appear in each document.
FlashText is a Python library for finding and replacing words or phrases in text. You give it a list of keywords to look for, and it either pulls them out of any text you pass in, or swaps them for replacement terms. It is built on a custom algorithm that performs both jobs much faster than regular expressions when the list of keywords is large. The core use case is normalizing text that uses multiple names for the same thing. For example, you might teach it that "Big Apple" and "NYC" both refer to "New York", then run it over thousands of documents to extract or replace those mentions with the standard name. You can load keywords one at a time, from a list, or from a dictionary that maps canonical names to their variants. Keywords can also be removed later, and the processor tracks all of them so you can inspect or count what it knows. By default the library is case-insensitive, but you can switch it to case-sensitive mode. When extracting keywords, you can also ask for span information, which returns the start and end character positions of each match alongside the matched term, useful if you need to know exactly where in the text something appeared. The README includes benchmark charts comparing FlashText to Python's built-in regular expression module. The advantage grows as the keyword list grows: with hundreds or thousands of terms, FlashText stays roughly constant in speed while regex slows down proportionally. This makes it practical for tasks like redacting sensitive terms, tagging documents by topic, or cleaning inconsistent terminology across a large dataset. Installation is a single pip command. The library has no unusual dependencies and the README includes short examples for every feature it describes.
← vi3k6i5 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.