explaingit

spotify/annoy

14,239C++Audience · dataComplexity · 3/5Setup · easy

TLDR

Annoy is a Spotify-built library for fast approximate nearest neighbor search that finds similar items in a large dataset by comparing feature vectors, with indexes saved to disk and shared across many processes.

Mindmap

mindmap
  root((repo))
    What it does
      Nearest neighbor search
      Approximate matching
      Disk-based indexes
    Use cases
      Music recommendations
      Product suggestions
      Image matching
    Distance metrics
      Euclidean
      Cosine
      Dot product
    Languages
      Python
      C++
      Go
    Audience
      Data scientists
      ML engineers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Build a music or product recommendation engine that finds similar items based on numerical feature vectors.

USE CASE 2

Pre-compute a similarity index and serve it to many worker processes simultaneously via memory-mapped disk files.

USE CASE 3

Speed up an image or text similarity search pipeline by trading a small accuracy loss for much faster queries.

USE CASE 4

Find approximate nearest neighbors in datasets with up to a few hundred dimensions using cosine or dot product distance.

Tech stack

C++PythonGoLua

Getting it running

Difficulty · easy Time to first run · 30min

In plain English

Annoy is a library built at Spotify to answer one specific question quickly: given a large collection of items described as lists of numbers, which items in that collection are most similar to a particular item? This kind of search is called nearest neighbor search, and it shows up in music recommendations, product suggestions, image matching, and many similar problems. The library takes an approximate approach, meaning it trades a small amount of accuracy for a large gain in speed. You build an index from your data, and then at query time you can ask for the closest matches. The key feature that sets Annoy apart from similar tools is that it saves indexes to disk as static files and loads them back using a technique called memory mapping, which lets many different processes read the same index file at once without duplicating it in memory. This makes it practical in environments where you want to share a pre-built index across many workers, such as a batch processing cluster. Building the index and searching it are treated as separate steps. Once an index is built and saved, you cannot add more items to it. This is a deliberate trade-off: keeping the index immutable makes it possible to memory-map it safely. You control accuracy versus speed with two parameters: the number of trees built at index time (more trees means better results and a larger file) and the number of nodes inspected at search time (more nodes means better results but slower queries). Python is the primary supported language for most users, with the library available via pip. The underlying code is written in C++ and can be used directly from C++ as well. Bindings also exist for Go and Lua. Supported distance metrics include Euclidean, Manhattan, cosine, Hamming, and dot product. The README notes it works best with fewer than a few hundred dimensions, though it can handle up to around a thousand.

Copy-paste prompts

Prompt 1
I have 100,000 songs each represented as a 50-dimensional feature vector. Show me how to build an Annoy index in Python, save it to disk, and query it for the 10 most similar songs to a given track.
Prompt 2
I'm running a recommendation service with multiple worker processes. Explain how to use Annoy's memory-mapped index so all workers share the same file without duplicating RAM.
Prompt 3
What trade-off do the n_trees and search_k parameters in Annoy control? Show me how to tune them for higher accuracy versus faster queries.
Prompt 4
Walk me through using Annoy for cosine similarity search on text embeddings to find semantically similar documents.
Open on GitHub → Explain another repo

← spotify on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.