Analysis updated 2026-07-03
Build a study plan from foundational papers to understand how Google, Amazon, and LinkedIn designed their core infrastructure systems.
Find practical blog posts on alerting strategy, postmortem writing, and game-day exercises for improving your team's reliability practices.
Recommend curated reading material to a new site reliability engineer or platform engineer joining your team.
Explore the human and organizational side of reliability with book recommendations covering error theory, checklists, and retrospectives.
| mmcgrana/services-engineering | alchaincyf/hermes-agent-orange-book | callstack/haul | |
|---|---|---|---|
| Stars | 3,684 | 3,684 | 3,684 |
| Language | — | — | TypeScript |
| Setup difficulty | easy | moderate | moderate |
| Complexity | 1/5 | 2/5 | 3/5 |
| Audience | ops devops | developer | developer |
Figures from each repo's GitHub metadata at analysis time.
This repository is a curated reading list for engineers who build and operate cloud infrastructure services. It collects papers, blog posts, presentations, and books that cover the theory and practice of running large-scale distributed systems reliably. The papers section is the heaviest part of the list. It includes foundational technical papers from major technology companies describing systems like Google's Bigtable, Spanner, MapReduce, and the Google File System, as well as Amazon's Dynamo key-value store and the Kafka messaging system from LinkedIn. These papers are often referenced when engineers want to understand the design decisions behind widely used infrastructure. The list also includes papers on consensus algorithms, disk failure rates, distributed tracing, and the theory of complexity in software systems. The blog posts section takes a more practical angle, covering topics like how to think about alerting, how to run game-day exercises that deliberately break systems to find weaknesses, how to write useful postmortems after incidents, and common mistakes people make when reasoning about distributed systems. Several posts come from well-known engineers writing about their experience at Netflix, Stripe, Heroku, Twitter, and similar companies. The books section is short and leans toward the human and organizational side of reliability: how people understand and cause errors, how to run retrospectives, and how checklists improve performance under pressure. There are also a couple of more technical titles on browser networking and web operations. The list accepts community contributions and the README links to a contributing guide. It does not include code or tutorials, only pointers to external reading material.
A curated reading list of foundational papers, blog posts, and books on building and operating large-scale distributed systems reliably, covering classic papers from Google, Amazon, and LinkedIn alongside practical posts on alerting, postmortems, and incident response.
Setup difficulty is rated easy, with roughly 5min to a first successful run.
Mainly ops devops.
This repo across BitVibe Labs
Verify against the repo before relying on details.