Build a music or product recommendation engine that finds similar items based on numerical feature vectors.
Pre-compute a similarity index and serve it to many worker processes simultaneously via memory-mapped disk files.
Speed up an image or text similarity search pipeline by trading a small accuracy loss for much faster queries.
Find approximate nearest neighbors in datasets with up to a few hundred dimensions using cosine or dot product distance.
Annoy is a library built at Spotify to answer one specific question quickly: given a large collection of items described as lists of numbers, which items in that collection are most similar to a particular item? This kind of search is called nearest neighbor search, and it shows up in music recommendations, product suggestions, image matching, and many similar problems. The library takes an approximate approach, meaning it trades a small amount of accuracy for a large gain in speed. You build an index from your data, and then at query time you can ask for the closest matches. The key feature that sets Annoy apart from similar tools is that it saves indexes to disk as static files and loads them back using a technique called memory mapping, which lets many different processes read the same index file at once without duplicating it in memory. This makes it practical in environments where you want to share a pre-built index across many workers, such as a batch processing cluster. Building the index and searching it are treated as separate steps. Once an index is built and saved, you cannot add more items to it. This is a deliberate trade-off: keeping the index immutable makes it possible to memory-map it safely. You control accuracy versus speed with two parameters: the number of trees built at index time (more trees means better results and a larger file) and the number of nodes inspected at search time (more nodes means better results but slower queries). Python is the primary supported language for most users, with the library available via pip. The underlying code is written in C++ and can be used directly from C++ as well. Bindings also exist for Go and Lua. Supported distance metrics include Euclidean, Manhattan, cosine, Hamming, and dot product. The README notes it works best with fewer than a few hundred dimensions, though it can handle up to around a thousand.
← spotify on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.