explaingit

isalia20/metalblas

17PythonAudience · researcherComplexity · 3/5LicenseSetup · moderate

TLDR

A Python library that replaces PyTorch's matrix multiplication on Apple Silicon Macs with custom Metal GPU code, achieving 2-3x faster performance for 32-bit floats and matching PyTorch for 16-bit formats used in large language models.

Mindmap

mindmap
  root((metalblas))
    What it does
      Faster matrix multiply
      Apple Silicon only
      Custom Metal GPU code
    Performance
      2-3x speedup float32
      Matches PyTorch bfloat16
      Auto-tunes per shape
    Tech stack
      Python
      PyTorch
      Metal Shading Language
      Apple MPS
    Usage
      Import matmul
      Drop-in replacement
      Works on tensors
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Speed up matrix multiplication in a PyTorch model on a Mac with an M-series chip by swapping one import.

USE CASE 2

Run large language model inference faster on Apple Silicon by replacing the default matrix multiply with metalBLAS.

USE CASE 3

Benchmark Apple GPU matrix multiplication performance for float32 versus bfloat16 across different matrix shapes.

Tech stack

PythonPyTorchMetalC++

Getting it running

Difficulty · moderate Time to first run · 30min

Requires PyTorch 2.12+ and only works on Macs with Apple Silicon using the MPS device, no Intel Mac or CUDA support.

Use freely for any purpose, including commercial use, as long as you keep the copyright notice.

In plain English

metalBLAS is a Python library that provides faster matrix multiplication for Apple Silicon computers, meaning the M-series chips found in modern Macs. Matrix multiplication is a core mathematical operation that machine learning models rely on constantly, and how fast you can do it directly affects training and inference speed. This library was written specifically to squeeze better performance out of Apple's GPU on that one operation. When you use PyTorch on a Mac with Apple Silicon, PyTorch routes GPU work through Apple's own graphics system called Metal. metalBLAS plugs into that system by writing custom low-level GPU code in Apple's Metal Shading Language, then compiling it on the fly when your program runs. This lets it take shortcuts that a general-purpose library cannot: it knows exactly what Apple's tensor processing unit can do and writes code that targets those capabilities directly. The performance gains are most visible with 32-bit floating-point numbers, where metalBLAS runs 2 to 3 times faster than PyTorch's built-in matrix multiply on the same hardware. For the 16-bit formats common in large language model work (bfloat16 and float16), it matches or beats the standard path depending on the shape of the matrices involved. Using it takes just a few lines of code. You import matmul from the package and call it on your existing PyTorch tensors. The library automatically picks the right internal approach for your data type and matrix shape. It also includes an autotuner that silently runs a short test the first time it sees a particular shape and caches the fastest setting for future calls. The library requires PyTorch 2.12 or newer and only works on Macs with Apple Silicon using the MPS (Metal Performance Shaders) device. It is released under the MIT license.

Copy-paste prompts

Prompt 1
I'm running a PyTorch model on my M3 MacBook. Show me how to replace torch.matmul with metalBLAS and benchmark the speedup on my model's matrix shapes.
Prompt 2
I want to use metalBLAS for bfloat16 matrix multiplication in a language model on Apple MPS. Walk me through the correct import and any shape restrictions.
Prompt 3
The metalBLAS autotuner ran on my matrix shapes. Show me where the cached settings are stored and how to reset them if I move to a different Mac.
Prompt 4
I installed metalBLAS but it is not faster than PyTorch on my matrices. Help me diagnose whether my matrix shapes or data type are in the supported fast path.
Open on GitHub → Explain another repo

← isalia20 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.