explaingit

dendibakh/perf-ninja

Analysis updated 2026-07-03

3,702C++Audience · developerComplexity · 4/5LicenseSetup · moderate

TLDR

A free hands-on C++ course of labs for learning low-level CPU performance optimization: cache misses, branch mispredictions, auto-vectorization, and hardware bottlenecks, with automated benchmarking to verify your fixes.

Mindmap

mindmap
  root((perf-ninja))
    What it does
      Hands on C++ labs
      Low level CPU tuning
      90 percent practice
    Lab categories
      Core Bound
        Auto-vectorization
        Inlining and intrinsics
        Dependency chains
      Memory Bound
        Cache friendly layouts
        False sharing
        Software prefetch
      Bad Speculation
        Branch misprediction
        Code restructuring
    Validation
      CI benchmarking system
      Intel AMD Apple M1
      Submit and compare
    Background
      Denis Bakhvalov
      YouTube companion videos
      CC BY 4.0 license
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Work through hands-on C++ labs to identify and fix cache misses and branch mispredictions in real code.

USE CASE 2

Practice CPU auto-vectorization and SIMD compiler intrinsics by optimizing benchmark problems.

USE CASE 3

Submit lab solutions to an automated CI benchmarking system to verify whether your changes actually made things faster.

USE CASE 4

Study memory-bound bottlenecks like false sharing, cache-friendly data layouts, and software prefetching.

What is it built with?

C++RustZigCMakeLinux

How does it compare?

dendibakh/perf-ninjanextcloud/desktopfarbrausch/fr_public
Stars3,7023,7033,700
LanguageC++C++C++
Setup difficultymoderatehardhard
Complexity4/53/55/5
Audiencedevelopergeneraldeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Requires C++ build tools on Linux, Windows, or Mac, accurate performance measurements depend on the CPU architecture (Intel 12th gen, AMD Zen3, and Apple M1 are CI-tested).

Free to use and share for any purpose, including commercial, as long as you credit the original author (CC BY 4.0).

In plain English

Performance Ninja is a free, hands-on course for learning how to make code run faster at the hardware level. It is not about high-level design decisions or algorithms. Instead it focuses on the kind of low-level problems that show up on modern CPUs: cache misses, branch mispredictions, and missed opportunities for the processor to do multiple things at once. The course was created by Denis Bakhvalov, author of a book on the same topic, and pairs written lab assignments with companion YouTube videos. The format is almost entirely practical. The README notes that students spend at least 90 percent of their time actually analyzing and improving code rather than reading theory. Each lab targets one specific problem, and completion times range from 30 minutes to 4 hours depending on your background. When you finish improving the code in a lab, you can submit it to GitHub and an automated benchmarking system checks whether your changes actually made things faster. Labs are organized into categories: Core Bound labs cover topics like auto-vectorization (getting the CPU to process multiple data items in a single instruction), function inlining, dependency chains, and compiler intrinsics. Memory Bound labs cover data layout, cache-friendly loop patterns, software prefetching, false sharing between CPU cores, and memory alignment. Bad Speculation labs work through situations where the processor guesses wrong about which code path to take next, and how to rewrite code to avoid that. The assignments are written in C++, and basic C++ knowledge is listed as a hard requirement. The course runs on Linux, Windows, and Mac, and the CI system tests submissions on Intel 12th-gen, AMD Zen3, and Apple M1 machines. Community members have also ported the labs to Rust and Zig. The course is licensed under Creative Commons CC BY 4.0.

Copy-paste prompts

Prompt 1
I'm starting the perf-ninja course. Walk me through the first auto-vectorization lab: what the code does, why it's slow, and how to fix it.
Prompt 2
In the perf-ninja Memory Bound labs, how do I detect and fix a false sharing problem between two threads writing to adjacent memory?
Prompt 3
What is a branch misprediction and how do the perf-ninja Bad Speculation labs teach you to rewrite C++ code to avoid it?
Prompt 4
How does the perf-ninja automated benchmark CI work, and what metrics does it report to confirm that my lab solution is faster?
Prompt 5
Walk me through a perf-ninja Core Bound dependency chain lab and explain how to restructure the loop so the CPU can execute more instructions in parallel.

Frequently asked questions

What is perf-ninja?

A free hands-on C++ course of labs for learning low-level CPU performance optimization: cache misses, branch mispredictions, auto-vectorization, and hardware bottlenecks, with automated benchmarking to verify your fixes.

What language is perf-ninja written in?

Mainly C++. The stack also includes C++, Rust, Zig.

What license does perf-ninja use?

Free to use and share for any purpose, including commercial, as long as you credit the original author (CC BY 4.0).

How hard is perf-ninja to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is perf-ninja for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub dendibakh on gitmyhub

Verify against the repo before relying on details.