erikkaum/maxsim

Analysis updated 2026-06-24

★ 17MetalAudience · researcherComplexity · 4/5LicenseSetup · moderate

Mindmap

mindmap
  root((maxsim))
    Inputs
      Query token tensors
      Document token tensors
      Offset arrays
    Outputs
      Per pair scores
      Faster reranking
    Use Cases
      ColBERT scoring
      PyLate reranking
      Multi vector search
    Tech Stack
      Metal
      CUDA
      PyTorch
      Hugging Face kernels

mindmap root((maxsim)) Inputs Query token tensors Document token tensors Offset arrays Outputs Per pair scores Faster reranking Use Cases ColBERT scoring PyLate reranking Multi vector search Tech Stack Metal CUDA PyTorch Hugging Face kernels

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Speed up ColBERT or PyLate reranking on Apple Silicon or Nvidia GPUs

USE CASE 2

Replace a naive PyTorch MaxSim loop in an existing search pipeline

USE CASE 3

Benchmark a multi-vector retriever across M3 Pro, A10G, L4, and A100

What is it built with?

MetalCUDAPyTorchPython

How do you get it running?

Difficulty · moderate Time to first run · 30min

Needs Metal or a recent Nvidia GPU, no backward pass yet and the fast CUDA path requires dim and Lq multiples of 16.

Apache 2.0, free to use commercially with attribution and a patent grant.

In plain English

MaxSim is a small, low-level computation package aimed at a very specific job inside modern search systems. When you use a model like ColBERT or PyLate, each query and each document is represented not by one vector but by many small vectors, one per token. To compare a query and a document you compute a score by, for each query token, finding the closest document token and adding up those best matches. That operation is called MaxSim, and this repository provides a fast, memory-efficient version of it written as a hardware-specific kernel. The point of writing a custom kernel is to avoid the obvious but expensive approach of building the full similarity matrix between every query token and every document token. This repository's kernel instead walks through document tokens in tiles, keeps a running best score per query token in fast on-chip memory, and only outputs the final per-pair score. The result is the same number a textbook implementation would give, but with less memory traffic. The package is distributed through Hugging Face's kernels system. You install kernels with pip or uv, then call get_kernel with the name erikkaum/maxsim. There are two entry points. The packed form takes flat tensors of query tokens, document tokens, and offset arrays, plus lists of which query goes with which document. The padded form takes a more familiar shape, B batches of Lq query tokens against C candidates of Ld document tokens, and is meant for the common reranking case. A pure-PyTorch reference implementation ships alongside the kernels for tests and benchmarks. The kernel supports two backends: Metal on Apple Silicon and CUDA on Nvidia GPUs of Ampere or Lovelace generation, with fp32, fp16, and bf16 inputs and fp32 accumulation. Benchmarks in the README show two to five times speedups over a naive PyTorch baseline on an M3 Pro, an A10G, an L4, and an A100, with the largest gains on heavy reranking workloads and on the more memory-bound GPUs. The author lists honest limits: there is no backward pass yet, no argmax position output, the fastest CUDA path needs dim and Lq sizes that are multiples of 16, and Hopper GPUs are supported but not yet specially tuned. License is Apache 2.0.

Copy-paste prompts

Prompt 1

Show me how to load maxsim via Hugging Face kernels and call the padded entry point for a B by Lq by Ld batch

Prompt 2

Help me swap the naive PyTorch MaxSim in my ColBERT reranker for the maxsim kernel

Prompt 3

Walk me through the tiled document token loop and why it avoids the full similarity matrix

Prompt 4

Explain what limits the CUDA path needing dim and Lq as multiples of 16 imposes on my model

Frequently asked questions

What is maxsim?

Hardware kernel for the MaxSim scoring step used by ColBERT-style multi-vector retrievers, with Metal and CUDA backends.

What language is maxsim written in?

Mainly Metal. The stack also includes Metal, CUDA, PyTorch.

What license does maxsim use?

Apache 2.0, free to use commercially with attribution and a patent grant.

How hard is maxsim to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is maxsim for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.