dastergon/awesome-sre

★ 13,201Audience · ops devopsComplexity · 1/5Setup · easy

Mindmap

mindmap
  root((awesome-sre))
    Topic areas
      Reliability
      Monitoring alerting
      On-call postmortems
      Capacity planning
    Learning formats
      Books courses
      Blogs podcasts
      Conference talks
    Audience
      New SREs
      Experienced engineers
    Metrics
      SLAs SLOs
      Error budgets
      Performance

mindmap root((awesome-sre)) Topic areas Reliability Monitoring alerting On-call postmortems Capacity planning Learning formats Books courses Blogs podcasts Conference talks Audience New SREs Experienced engineers Metrics SLAs SLOs Error budgets Performance

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Find curated conference talks and engineering blog posts to learn what Site Reliability Engineering means in day-to-day practice.

USE CASE 2

Discover monitoring, alerting, and on-call tools used by companies like Google, Netflix, and Uber.

USE CASE 3

Build a structured reading plan for a new SRE hire using the organized topic sections on reliability and post-mortems.

Getting it running

Difficulty · easy Time to first run · 5min

In plain English

Awesome SRE is a curated list of resources about Site Reliability Engineering and Production Engineering. It does not contain software. It is a long, organized collection of links that point to articles, talks, books, podcasts, and tools gathered from across the web. Lists like this are common on GitHub and usually carry the word "awesome" in their name to signal that they are hand picked rather than automatically generated. The README opens by answering the question of what Site Reliability Engineering is, using a quote from Ben Treynor Sloss of Google, who founded the discipline. He describes it as what happens when you ask a software engineer to design an operations function. In plain terms, it is the practice of keeping large online services running reliably by writing software and following engineering methods, rather than handling outages by hand. The bulk of the page is a table of contents that splits the links into many themed sections. These include Culture, Education, and Books for newcomers learning the field, Reliability, Monitoring and Alerting, On-Call, and Post-Mortem for the day to day work of keeping systems healthy and reviewing failures, and Capacity Planning, Service Level Agreements, and Performance for planning and measuring how a service behaves. Further sections gather blogs, newsletters, conferences, Twitter accounts, podcasts, and a set of SRE tools. Each section is a list of outbound links, many of them to conference talks, company engineering blogs, and well known industry presentations from organizations such as Google, Facebook, Netflix, Uber, and Dropbox. So the repository works as a reading and viewing guide: a starting map for someone who wants to learn how reliability is handled at scale, or an experienced engineer looking for deeper material. The page notes that contributions are welcome and points to a separate contribution guide for anyone who wants to add a resource. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

I am new to SRE. Give me a 4-week reading plan using only the topics and links found in the awesome-sre list.

Prompt 2

Find me resources from the awesome-sre list specifically about writing useful blameless post-mortems.

Prompt 3

I need to introduce SLAs and error budgets to my engineering team. Which sections of awesome-sre should I start with?

Prompt 4

List the podcast and newsletter resources from awesome-sre that suit a software engineer moving into an SRE role.

Open on GitHub → Explain another repo

← dastergon on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.