explaingit

piskvorky/gensim

16,409PythonAudience · dataComplexity · 3/5Setup · moderate

TLDR

Gensim is a Python library for topic modelling and document similarity over large text collections, it streams corpora instead of loading them into memory and includes word2vec, LDA, and LSA implementations.

Mindmap

mindmap
  root((gensim))
    What it does
      Topic modelling
      Document similarity
      Word embeddings
    Algorithms
      LDA
      LSA
      Random projections
      word2vec and fastText
    Key design
      Streaming corpora
      Memory-efficient
      Multi-core support
    Tech Stack
      Python
      NumPy and BLAS
    Audience
      NLP researchers
      Data scientists
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Run LDA topic modelling on a corpus of support tickets or articles to automatically surface recurring themes.

USE CASE 2

Train word2vec embeddings on a domain-specific text collection for use in a downstream NLP model.

USE CASE 3

Build a find-similar-documents search feature over a large archive without loading the whole corpus into memory.

USE CASE 4

Process a multi-gigabyte text dataset by streaming it through Gensim to avoid running out of RAM.

Tech stack

PythonNumPyCFortranBLAS

Getting it running

Difficulty · moderate Time to first run · 30min

Performance depends on the NumPy BLAS backend, installing MKL or OpenBLAS significantly speeds up large corpus training.

In plain English

Gensim is a Python library for the kind of natural-language-processing work that involves digging through enormous piles of text to find structure: discovering the hidden topics a collection of documents is about, indexing the documents, and looking up which ones are similar to a given query. The maintainers describe its audience as the natural language processing (NLP) and information retrieval (IR) communities. The library is built around the idea that you should never have to load your whole corpus into memory at once. You hand Gensim a stream of documents, and its algorithms, including Latent Semantic Analysis, Latent Dirichlet Allocation, Random Projections, Hierarchical Dirichlet Process and the word2vec family of word-embedding methods, process them in chunks. There are efficient multi-core implementations of these algorithms, and Latent Semantic Analysis and Latent Dirichlet Allocation can also be run across a cluster of computers for very large jobs. Although Gensim itself is written in Python, the heavy lifting is delegated through NumPy down to optimised Fortran and C numerical libraries (BLAS), which is what lets it stay fast despite the high-level wrapper. You would reach for Gensim if you have a large body of text, articles, support tickets, research papers, product descriptions, and want to figure out what themes run through it, build a "find me similar documents" feature, or train word vectors for a downstream model. It is installed with pip, depends on NumPy, and is currently in stable maintenance mode: bug fixes and documentation updates are still accepted but new features are not. The full README is longer than what was provided.

Copy-paste prompts

Prompt 1
Show me how to train an LDA topic model on a folder of .txt files using gensim and print the top 10 words for each topic.
Prompt 2
Write a Python script using gensim word2vec to train embeddings on a custom corpus, then find the 5 most similar words to a given query word.
Prompt 3
How do I use gensim's similarity index to build a find-similar-documents feature over 100k articles without loading them all into memory at once?
Prompt 4
Stream a large JSON Lines file through gensim's Dictionary and build a bag-of-words corpus for LDA training without reading the whole file at once.
Open on GitHub → Explain another repo

← piskvorky on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.