Analysis updated 2026-07-03
Work through hands-on C++ labs to identify and fix cache misses and branch mispredictions in real code.
Practice CPU auto-vectorization and SIMD compiler intrinsics by optimizing benchmark problems.
Submit lab solutions to an automated CI benchmarking system to verify whether your changes actually made things faster.
Study memory-bound bottlenecks like false sharing, cache-friendly data layouts, and software prefetching.
| dendibakh/perf-ninja | nextcloud/desktop | farbrausch/fr_public | |
|---|---|---|---|
| Stars | 3,702 | 3,703 | 3,700 |
| Language | C++ | C++ | C++ |
| Setup difficulty | moderate | hard | hard |
| Complexity | 4/5 | 3/5 | 5/5 |
| Audience | developer | general | developer |
Figures from each repo's GitHub metadata at analysis time.
Requires C++ build tools on Linux, Windows, or Mac, accurate performance measurements depend on the CPU architecture (Intel 12th gen, AMD Zen3, and Apple M1 are CI-tested).
Performance Ninja is a free, hands-on course for learning how to make code run faster at the hardware level. It is not about high-level design decisions or algorithms. Instead it focuses on the kind of low-level problems that show up on modern CPUs: cache misses, branch mispredictions, and missed opportunities for the processor to do multiple things at once. The course was created by Denis Bakhvalov, author of a book on the same topic, and pairs written lab assignments with companion YouTube videos. The format is almost entirely practical. The README notes that students spend at least 90 percent of their time actually analyzing and improving code rather than reading theory. Each lab targets one specific problem, and completion times range from 30 minutes to 4 hours depending on your background. When you finish improving the code in a lab, you can submit it to GitHub and an automated benchmarking system checks whether your changes actually made things faster. Labs are organized into categories: Core Bound labs cover topics like auto-vectorization (getting the CPU to process multiple data items in a single instruction), function inlining, dependency chains, and compiler intrinsics. Memory Bound labs cover data layout, cache-friendly loop patterns, software prefetching, false sharing between CPU cores, and memory alignment. Bad Speculation labs work through situations where the processor guesses wrong about which code path to take next, and how to rewrite code to avoid that. The assignments are written in C++, and basic C++ knowledge is listed as a hard requirement. The course runs on Linux, Windows, and Mac, and the CI system tests submissions on Intel 12th-gen, AMD Zen3, and Apple M1 machines. Community members have also ported the labs to Rust and Zig. The course is licensed under Creative Commons CC BY 4.0.
A free hands-on C++ course of labs for learning low-level CPU performance optimization: cache misses, branch mispredictions, auto-vectorization, and hardware bottlenecks, with automated benchmarking to verify your fixes.
Mainly C++. The stack also includes C++, Rust, Zig.
Free to use and share for any purpose, including commercial, as long as you credit the original author (CC BY 4.0).
Setup difficulty is rated moderate, with roughly 30min to a first successful run.
Mainly developer.
This repo across BitVibe Labs
Verify against the repo before relying on details.