explaingit

makcedward/nlpaug

4,657Jupyter NotebookAudience · dataComplexity · 2/5Setup · easy

TLDR

Python library for generating modified copies of text and audio data to boost AI model training without collecting more real-world examples. Supports character, word, and sentence-level text changes plus audio transformations, all chainable into pipelines.

Mindmap

mindmap
  root((nlpaug))
    Text Augmentation
      Character level typos
      Word synonyms antonyms
      BERT substitutions
      Back translation
    Audio Augmentation
      Noise injection
      Pitch and volume
      Time shifting
      Frequency masking
    Pipeline System
      Sequential flow
      Random subset flow
    Installation
      pip install
      conda install
    Examples
      Multilingual text
      Custom augmenters
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Expand a small text dataset by generating synonym replacements, typos, or paraphrases so your AI model trains on more varied examples.

USE CASE 2

Improve speech recognition models by augmenting audio clips with noise, pitch shifts, and time shifts that mimic real recording conditions.

USE CASE 3

Build a multilingual training set by using back-translation to create natural paraphrase variants in multiple languages.

USE CASE 4

Chain multiple text or audio transformations into a single reusable pipeline to apply consistent augmentation across your whole dataset.

Tech stack

PythonBERTJupyter Notebookpipconda

Getting it running

Difficulty · easy Time to first run · 30min

Install via pip or conda. Requires Python 3.5+. Works on Linux and Windows. Some augmenters need additional model downloads (e.g. BERT).

License terms are not mentioned in the explanation.

In plain English

nlpaug is a Python library for creating variations of text and audio data to use in machine learning training. The core idea is data augmentation: if you have a limited set of examples for training an AI model, you can use nlpaug to generate modified copies of those examples, which helps the model become more accurate without collecting additional real-world data. For text, the library can make changes at the character level, word level, or sentence level. Character-level augmenters simulate typos such as keyboard distance errors or OCR misreads. Word-level augmenters can replace words with synonyms or antonyms, insert or delete words randomly, correct or introduce spelling mistakes, or use language models such as BERT to find contextually fitting substitutions. Sentence-level augmenters can summarize paragraphs or generate new sentences using language generation models. A back-translation method translates text to another language and then back, which naturally produces paraphrase variations. For audio, the library can crop sections, adjust volume or pitch, inject noise, shift the audio forward or backward in time, or apply frequency masking to spectrograms. These augmentations help train models that need to handle variations in recording quality and environment. The library is organized around two core concepts. An Augmenter is a single transformation step. A Flow chains several augmenters together in a pipeline, either applying them all in sequence or randomly applying a subset of them. This makes it possible to define complex augmentation strategies in a few lines of code. Installation is done with pip or conda. The library supports Python 3.5 and above and works on both Linux and Windows. Example notebooks in the repository cover common use cases including multilingual text augmentation and custom augmenter creation.

Copy-paste prompts

Prompt 1
Using the nlpaug library, write Python code that takes a list of customer review sentences and generates 3 augmented versions of each using word-level synonym replacement and back-translation.
Prompt 2
Show me how to use nlpaug's Flow to chain a keyboard-typo augmenter and a BERT word substitution augmenter together, then apply them to a pandas DataFrame column of text.
Prompt 3
Write a Python script using nlpaug to augment an audio dataset folder: apply noise injection and pitch shift to every .wav file and save the results to a new folder.
Prompt 4
Using nlpaug, create a custom augmenter class that randomly removes punctuation from sentences, then integrate it into a Sequential flow with an existing word-deletion augmenter.
Prompt 5
Explain and demonstrate how to use nlpaug's back-translation augmenter to create paraphrase variations of English sentences for fine-tuning a text classification model.
Open on GitHub → Explain another repo

← makcedward on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.