stanfordnlp/glove

★ 7,217CAudience · researcherComplexity · 3/5LicenseSetup · moderate

Mindmap

mindmap
  root((glove))
    What it does
      Words to vectors
      Semantic similarity
      Word analogies
    Pre-trained Vectors
      Wikipedia
      Common Crawl
      Twitter
    Training
      Custom corpus
      C command line
    Audience
      NLP researchers
      ML practitioners

mindmap root((glove)) What it does Words to vectors Semantic similarity Word analogies Pre-trained Vectors Wikipedia Common Crawl Twitter Training Custom corpus C command line Audience NLP researchers ML practitioners

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Download pre-trained GloVe vectors and use them as input features for a text classification or sentiment analysis model.

USE CASE 2

Train custom word vectors on a domain-specific corpus where specialized vocabulary differs from general web language.

USE CASE 3

Measure semantic similarity between words or perform word analogy tasks using the vector arithmetic properties.

Tech stack

CPython

Getting it running

Difficulty · moderate Time to first run · 30min

Using pre-trained vectors is just a download, training custom vectors requires building the C code and preparing a tokenized corpus.

Use freely in commercial and open-source projects, with attribution required when distributing.

In plain English

GloVe, which stands for Global Vectors for Word Representation, is a research project from Stanford University that turns words into lists of numbers so that computers can work with language mathematically. Each word in a vocabulary gets assigned a vector, which is just a fixed-length sequence of decimal numbers. The key property of these vectors is that words with similar meanings end up close to each other in the mathematical space. The classic example is that the vector for "king" minus the vector for "man" plus the vector for "woman" comes out close to the vector for "queen." This kind of word representation is called a word embedding, and it was one of the foundational techniques in natural language processing before the era of large language models. Many machine learning systems that work with text still use or have historically used these vectors as a starting point. The repository offers two ways to use GloVe. The first is to download pre-trained vectors that Stanford has already computed from large text collections. Options include vectors trained on Wikipedia, a large web crawl called Common Crawl (which covers billions of web pages), and Twitter. A 2024 update added vectors trained on the Dolma dataset, which is a 220-billion-word open-source text collection. These pre-trained files can be downloaded and used directly in other projects without any training step. The second option is to train your own vectors on a custom body of text, which is useful when the domain-specific language in a field differs significantly from general web text. The training code is written in C and runs from the command line. Stanford added updated pre-trained vectors in 2024 and published a report analyzing their quality. The project is licensed under the Apache 2.0 license, which allows use in commercial and open-source applications.

Copy-paste prompts

Prompt 1

I want to use pre-trained GloVe 300-dimensional vectors in a Python NLP project. How do I load them into a NumPy array and find the 10 words most similar to a given word?

Prompt 2

Show me how to train custom GloVe word vectors on my own text corpus using the C training scripts, from tokenizing the input to running the training command.

Prompt 3

I have GloVe vectors and want to use them as an embedding layer in a PyTorch text classifier. How do I load the vectors and freeze them during training?

Open on GitHub → Explain another repo

← stanfordnlp on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.