Compare the speed vs accuracy tradeoff of FAISS, Qdrant, and Annoy on your specific dataset before committing to a vector search library
Run reproducible performance benchmarks on your own machine using pre-built datasets for image similarity or text embedding retrieval
Add a new similarity search library to the benchmark suite by writing a Docker container config and harness adapter
Each algorithm runs in an isolated Docker container, requires Docker installed and sufficient disk space for pre-built datasets.
This repository is a benchmarking project that measures the performance of many different tools for doing fast similarity searches across large datasets. The core problem: given a large collection of items (such as images, songs, or text passages each encoded as a list of numbers), how quickly and accurately can you find the items most similar to a given query? This type of search is called nearest neighbor search, and the approximate version accepts a small accuracy tradeoff in exchange for much faster results. The project covers more than 30 different search libraries. The list includes FAISS (built by Facebook Research), Annoy (built by Spotify), pgvector (a PostgreSQL extension), Qdrant, Milvus, Elasticsearch, RediSearch, and many others. Each library is run inside an isolated container so that results are fair and reproducible across different machines. Pre-built datasets are provided in a standard file format covering tasks like image similarity lookup and text embedding retrieval. The benchmark measures two things simultaneously: how fast the search runs (queries per second) and how accurate the results are (recall, meaning what fraction of the true nearest neighbors the algorithm actually found). These two factors trade off against each other. A library might return results much faster but miss some of the correct matches. The project plots both metrics together on charts for each library, so users can compare the full performance curve rather than a single headline number. Running a benchmark requires installing the project, selecting a dataset and algorithm, and executing the provided Python scripts. Docker handles all per-algorithm environment setup automatically. Results are saved and can then be visualized as comparison charts. This is a purely research and evaluation tool, not a production search library itself. It exists to help developers choose the right similarity search tool for their specific situation, whether that means optimizing for speed, accuracy, memory usage, or the scale of the dataset.
← erikbern on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.