explaingit

erikkaum/maxsim

17MetalAudience · researcherComplexity · 4/5ActiveLicenseSetup · moderate

TLDR

Hardware kernel for the MaxSim scoring step used by ColBERT-style multi-vector retrievers, with Metal and CUDA backends.

Mindmap

mindmap
  root((maxsim))
    Inputs
      Query token tensors
      Document token tensors
      Offset arrays
    Outputs
      Per pair scores
      Faster reranking
    Use Cases
      ColBERT scoring
      PyLate reranking
      Multi vector search
    Tech Stack
      Metal
      CUDA
      PyTorch
      Hugging Face kernels

Things people build with this

USE CASE 1

Speed up ColBERT or PyLate reranking on Apple Silicon or Nvidia GPUs

USE CASE 2

Replace a naive PyTorch MaxSim loop in an existing search pipeline

USE CASE 3

Benchmark a multi-vector retriever across M3 Pro, A10G, L4, and A100

Tech stack

MetalCUDAPyTorchPython

Getting it running

Difficulty · moderate Time to first run · 30min

Needs Metal or a recent Nvidia GPU; no backward pass yet and the fast CUDA path requires dim and Lq multiples of 16.

Apache 2.0, free to use commercially with attribution and a patent grant.

In plain English

MaxSim is a small, low-level computation package aimed at a very specific job inside modern search systems. When you use a model like ColBERT or PyLate, each query and each document is represented not by one vector but by many small vectors, one per token. To compare a query and a document you compute a score by, for each query token, finding the closest document token and adding up those best matches. That operation is called MaxSim, and this repository provides a fast, memory-efficient version of it written as a hardware-specific kernel. The point of writing a custom kernel is to avoid the obvious but expensive approach of building the full similarity matrix between every query token and every document token. This repository's kernel instead walks through document tokens in tiles, keeps a running best score per query token in fast on-chip memory, and only outputs the final per-pair score. The result is the same number a textbook implementation would give, but with less memory traffic. The package is distributed through Hugging Face's kernels system. You install kernels with pip or uv, then call get_kernel with the name erikkaum/maxsim. There are two entry points. The packed form takes flat tensors of query tokens, document tokens, and offset arrays, plus lists of which query goes with which document. The padded form takes a more familiar shape, B batches of Lq query tokens against C candidates of Ld document tokens, and is meant for the common reranking case. A pure-PyTorch reference implementation ships alongside the kernels for tests and benchmarks. The kernel supports two backends: Metal on Apple Silicon and CUDA on Nvidia GPUs of Ampere or Lovelace generation, with fp32, fp16, and bf16 inputs and fp32 accumulation. Benchmarks in the README show two to five times speedups over a naive PyTorch baseline on an M3 Pro, an A10G, an L4, and an A100, with the largest gains on heavy reranking workloads and on the more memory-bound GPUs. The author lists honest limits: there is no backward pass yet, no argmax position output, the fastest CUDA path needs dim and Lq sizes that are multiples of 16, and Hopper GPUs are supported but not yet specially tuned. License is Apache 2.0.

Copy-paste prompts

Prompt 1
Show me how to load maxsim via Hugging Face kernels and call the padded entry point for a B by Lq by Ld batch
Prompt 2
Help me swap the naive PyTorch MaxSim in my ColBERT reranker for the maxsim kernel
Prompt 3
Walk me through the tiled document token loop and why it avoids the full similarity matrix
Prompt 4
Explain what limits the CUDA path needing dim and Lq as multiples of 16 imposes on my model
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.