Speed up matrix multiplication in a PyTorch model on a Mac with an M-series chip by swapping one import.
Run large language model inference faster on Apple Silicon by replacing the default matrix multiply with metalBLAS.
Benchmark Apple GPU matrix multiplication performance for float32 versus bfloat16 across different matrix shapes.
Requires PyTorch 2.12+ and only works on Macs with Apple Silicon using the MPS device, no Intel Mac or CUDA support.
metalBLAS is a Python library that provides faster matrix multiplication for Apple Silicon computers, meaning the M-series chips found in modern Macs. Matrix multiplication is a core mathematical operation that machine learning models rely on constantly, and how fast you can do it directly affects training and inference speed. This library was written specifically to squeeze better performance out of Apple's GPU on that one operation. When you use PyTorch on a Mac with Apple Silicon, PyTorch routes GPU work through Apple's own graphics system called Metal. metalBLAS plugs into that system by writing custom low-level GPU code in Apple's Metal Shading Language, then compiling it on the fly when your program runs. This lets it take shortcuts that a general-purpose library cannot: it knows exactly what Apple's tensor processing unit can do and writes code that targets those capabilities directly. The performance gains are most visible with 32-bit floating-point numbers, where metalBLAS runs 2 to 3 times faster than PyTorch's built-in matrix multiply on the same hardware. For the 16-bit formats common in large language model work (bfloat16 and float16), it matches or beats the standard path depending on the shape of the matrices involved. Using it takes just a few lines of code. You import matmul from the package and call it on your existing PyTorch tensors. The library automatically picks the right internal approach for your data type and matrix shape. It also includes an autotuner that silently runs a short test the first time it sees a particular shape and caches the fastest setting for future calls. The library requires PyTorch 2.12 or newer and only works on Macs with Apple Silicon using the MPS (Metal Performance Shaders) device. It is released under the MIT license.
← isalia20 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.