explaingit

mmcgrana/services-engineering

Analysis updated 2026-07-03

3,684Audience · ops devopsComplexity · 1/5Setup · easy

TLDR

A curated reading list of foundational papers, blog posts, and books on building and operating large-scale distributed systems reliably, covering classic papers from Google, Amazon, and LinkedIn alongside practical posts on alerting, postmortems, and incident response.

Mindmap

mindmap
  root((Services Engineering))
    Papers
      Google Bigtable
      Amazon Dynamo
      Apache Kafka
      Consensus algorithms
    Blog Posts
      Alerting best practices
      Game-day exercises
      Postmortem writing
      Distributed fallacies
    Books
      Human factors
      Retrospectives
      Web operations
    Community
      Open contributions
      Reading only no code
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Build a study plan from foundational papers to understand how Google, Amazon, and LinkedIn designed their core infrastructure systems.

USE CASE 2

Find practical blog posts on alerting strategy, postmortem writing, and game-day exercises for improving your team's reliability practices.

USE CASE 3

Recommend curated reading material to a new site reliability engineer or platform engineer joining your team.

USE CASE 4

Explore the human and organizational side of reliability with book recommendations covering error theory, checklists, and retrospectives.

How does it compare?

mmcgrana/services-engineeringalchaincyf/hermes-agent-orange-bookcallstack/haul
Stars3,6843,6843,684
LanguageTypeScript
Setup difficultyeasymoderatemoderate
Complexity1/52/53/5
Audienceops devopsdeveloperdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

In plain English

This repository is a curated reading list for engineers who build and operate cloud infrastructure services. It collects papers, blog posts, presentations, and books that cover the theory and practice of running large-scale distributed systems reliably. The papers section is the heaviest part of the list. It includes foundational technical papers from major technology companies describing systems like Google's Bigtable, Spanner, MapReduce, and the Google File System, as well as Amazon's Dynamo key-value store and the Kafka messaging system from LinkedIn. These papers are often referenced when engineers want to understand the design decisions behind widely used infrastructure. The list also includes papers on consensus algorithms, disk failure rates, distributed tracing, and the theory of complexity in software systems. The blog posts section takes a more practical angle, covering topics like how to think about alerting, how to run game-day exercises that deliberately break systems to find weaknesses, how to write useful postmortems after incidents, and common mistakes people make when reasoning about distributed systems. Several posts come from well-known engineers writing about their experience at Netflix, Stripe, Heroku, Twitter, and similar companies. The books section is short and leans toward the human and organizational side of reliability: how people understand and cause errors, how to run retrospectives, and how checklists improve performance under pressure. There are also a couple of more technical titles on browser networking and web operations. The list accepts community contributions and the README links to a contributing guide. It does not include code or tutorials, only pointers to external reading material.

Copy-paste prompts

Prompt 1
I want to study distributed systems fundamentals. Based on the services-engineering reading list, create a 4-week study plan starting with the most foundational papers.
Prompt 2
I need to improve how my team writes postmortems. From the services-engineering blog post list, summarize the key principles for writing useful incident reports.
Prompt 3
Help me prepare a game-day exercise based on the practices described in the services-engineering reading list to find weaknesses in my system before they cause outages.
Prompt 4
I'm onboarding a new site reliability engineer. Create a first-month reading list from the services-engineering repo, ordered from beginner-friendly to advanced.

Frequently asked questions

What is services-engineering?

A curated reading list of foundational papers, blog posts, and books on building and operating large-scale distributed systems reliably, covering classic papers from Google, Amazon, and LinkedIn alongside practical posts on alerting, postmortems, and incident response.

How hard is services-engineering to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is services-engineering for?

Mainly ops devops.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub mmcgrana on gitmyhub

Verify against the repo before relying on details.