Expand a small text dataset by generating synonym replacements, typos, or paraphrases so your AI model trains on more varied examples.
Improve speech recognition models by augmenting audio clips with noise, pitch shifts, and time shifts that mimic real recording conditions.
Build a multilingual training set by using back-translation to create natural paraphrase variants in multiple languages.
Chain multiple text or audio transformations into a single reusable pipeline to apply consistent augmentation across your whole dataset.
Install via pip or conda. Requires Python 3.5+. Works on Linux and Windows. Some augmenters need additional model downloads (e.g. BERT).
nlpaug is a Python library for creating variations of text and audio data to use in machine learning training. The core idea is data augmentation: if you have a limited set of examples for training an AI model, you can use nlpaug to generate modified copies of those examples, which helps the model become more accurate without collecting additional real-world data. For text, the library can make changes at the character level, word level, or sentence level. Character-level augmenters simulate typos such as keyboard distance errors or OCR misreads. Word-level augmenters can replace words with synonyms or antonyms, insert or delete words randomly, correct or introduce spelling mistakes, or use language models such as BERT to find contextually fitting substitutions. Sentence-level augmenters can summarize paragraphs or generate new sentences using language generation models. A back-translation method translates text to another language and then back, which naturally produces paraphrase variations. For audio, the library can crop sections, adjust volume or pitch, inject noise, shift the audio forward or backward in time, or apply frequency masking to spectrograms. These augmentations help train models that need to handle variations in recording quality and environment. The library is organized around two core concepts. An Augmenter is a single transformation step. A Flow chains several augmenters together in a pipeline, either applying them all in sequence or randomly applying a subset of them. This makes it possible to define complex augmentation strategies in a few lines of code. Installation is done with pip or conda. The library supports Python 3.5 and above and works on both Linux and Windows. Example notebooks in the repository cover common use cases including multilingual text augmentation and custom augmenter creation.
← makcedward on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.