Speed up ColBERT or PyLate reranking on Apple Silicon or Nvidia GPUs
Replace a naive PyTorch MaxSim loop in an existing search pipeline
Benchmark a multi-vector retriever across M3 Pro, A10G, L4, and A100
Needs Metal or a recent Nvidia GPU; no backward pass yet and the fast CUDA path requires dim and Lq multiples of 16.
MaxSim is a small, low-level computation package aimed at a very specific job inside modern search systems. When you use a model like ColBERT or PyLate, each query and each document is represented not by one vector but by many small vectors, one per token. To compare a query and a document you compute a score by, for each query token, finding the closest document token and adding up those best matches. That operation is called MaxSim, and this repository provides a fast, memory-efficient version of it written as a hardware-specific kernel. The point of writing a custom kernel is to avoid the obvious but expensive approach of building the full similarity matrix between every query token and every document token. This repository's kernel instead walks through document tokens in tiles, keeps a running best score per query token in fast on-chip memory, and only outputs the final per-pair score. The result is the same number a textbook implementation would give, but with less memory traffic. The package is distributed through Hugging Face's kernels system. You install kernels with pip or uv, then call get_kernel with the name erikkaum/maxsim. There are two entry points. The packed form takes flat tensors of query tokens, document tokens, and offset arrays, plus lists of which query goes with which document. The padded form takes a more familiar shape, B batches of Lq query tokens against C candidates of Ld document tokens, and is meant for the common reranking case. A pure-PyTorch reference implementation ships alongside the kernels for tests and benchmarks. The kernel supports two backends: Metal on Apple Silicon and CUDA on Nvidia GPUs of Ampere or Lovelace generation, with fp32, fp16, and bf16 inputs and fp32 accumulation. Benchmarks in the README show two to five times speedups over a naive PyTorch baseline on an M3 Pro, an A10G, an L4, and an A100, with the largest gains on heavy reranking workloads and on the more memory-bound GPUs. The author lists honest limits: there is no backward pass yet, no argmax position output, the fastest CUDA path needs dim and Lq sizes that are multiples of 16, and Hopper GPUs are supported but not yet specially tuned. License is Apache 2.0.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.