explaingit

netflix/chaosmonkey

16,886GoAudience · ops devopsComplexity · 4/5Setup · hard

TLDR

Netflix's resilience tool that randomly kills instances in production so engineers learn to build systems that survive failures gracefully.

Mindmap

mindmap
    root((chaosmonkey))
      Inputs
        Spinnaker config
        Cloud credentials
        Schedule
      Outputs
        Random instance terminations
        Failure events
      Use Cases
        Test service resiliency
        Validate failover paths
        Practice chaos engineering
      Tech Stack
        Go
        Spinnaker
        AWS
        Kubernetes
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Randomly terminate instances in a Spinnaker-managed service to test resiliency

USE CASE 2

Introduce scheduled chaos experiments into a staging cloud environment

USE CASE 3

Validate that autoscaling and failover handle node loss without user impact

Tech stack

GoSpinnakerAWSKubernetes

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a working Spinnaker deployment, which is itself a multi-service install before Chaos Monkey can be wired in.

In plain English

Chaos Monkey is a resilience testing tool from Netflix that deliberately causes random failures in a running production system. It works by randomly terminating virtual machine instances and containers, the individual server processes that keep an application running, while the system is live. The goal is not to break things maliciously, but to force engineers to build services that can survive unexpected failures gracefully. If your system can handle Chaos Monkey randomly killing pieces of it, it is far less likely to collapse during an unplanned real-world outage. This approach is part of a broader discipline called Chaos Engineering, the practice of intentionally introducing controlled failures to expose weaknesses before they cause customer-facing problems. Chaos Monkey is written in Go and is designed to work with Spinnaker, a continuous delivery platform (a system for automatically deploying software updates). It integrates with various cloud backends including AWS, Google Compute Engine, Azure, Kubernetes, and Cloud Foundry. You need to be managing your applications through Spinnaker to use Chaos Monkey. You would use this tool if you are a reliability or infrastructure engineer at a company running large cloud-based services and you want to proactively test whether your system degrades gracefully when individual components fail.

Copy-paste prompts

Prompt 1
Walk me through installing Spinnaker on AWS just so I can run Chaos Monkey against a demo service
Prompt 2
Compare Chaos Monkey, Gremlin, and chaos-mesh for a Kubernetes-only stack and recommend one
Prompt 3
Draft a runbook for a first Chaos Monkey experiment that limits blast radius to one service
Prompt 4
Show me the smallest possible Chaos Monkey config that kills one EC2 instance per workday
Open on GitHub → Explain another repo

← netflix on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.