explaingit

shenli/distributed-system-testing

Analysis updated 2026-06-24

158Audience · ops devopsComplexity · 4/5Setup · moderate

TLDR

Two Markdown skill files for AI coding agents that design and run claim driven chaos and fault injection test plans for distributed and stateful systems.

Mindmap

mindmap
  root((distributed-system-testing))
    Inputs
      System under test docs
      Existing test inventory
      Product claims
    Outputs
      Test plan markdown
      Findings report
      Per scenario verdicts
    Use Cases
      Pre release chaos audit
      Falsify safety claims
      Reviewer ready evidence
    Tech Stack
      Markdown
      Shell
      Claude Code
      Codex
      Cursor
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Generate a structured chaos and fault injection test plan for a distributed database or queue before a release

USE CASE 2

Run claim driven scenarios that try to falsify linearizability, idempotency, or no lost ack guarantees

USE CASE 3

Produce a reviewer ready findings report so a manager can sign off without rerunning every test

What is it built with?

MarkdownShellClaude CodeCursor

How does it compare?

shenli/distributed-system-testingvadimsemenykv/saboteurhelpmeeadice/bandori-pet-rev
Stars158157156
LanguageGoPython
Setup difficultymoderateeasymoderate
Complexity4/53/53/5
Audienceops devopsdevelopergeneral

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 1h+

Requires a Markdown and shell capable AI agent like Claude Code or Cursor, the skills do real chaos scenarios that need an isolated environment for the system under test.

In plain English

This repository ships two skill files for AI coding agents (Claude Code, Codex, Copilot CLI, Cursor, Gemini, or anything that reads Markdown and runs shell commands) aimed at testing distributed and stateful systems. One skill designs a test plan, the other runs it. The deliverables are two Markdown artifacts: a structured plan and a findings report. A reviewer reads those two files and decides whether the system is safe to ship, without having to re-run the tests themselves. The pitch is that the usual approach to testing distributed systems, writing a handful of integration tests, misses the bugs that actually break them in production: partial network partitions, non-deterministic concurrency, crash and recovery sequences, upgrades and rollbacks, replay-induced idempotency issues, and timing-sensitive ordering. The skills push an opinionated workflow built around what the README calls claim-driven testing: every scenario is named after a product claim it tries to falsify under one specific fault, rather than being named after its setup. For claims about safety, durability, idempotency, isolation, ordering, or membership, each scenario also has to declare an abstract model (register, queue, log, lock, lease, ledger, or similar), an operation-history schema, a named checker such as linearizability, serializability, session-consistency, no-lost-ack, or exactly-once, plus a nemesis that injects the fault along with observable landing evidence proving the fault actually fired. Outcomes use a 9-state verdict set so that a clean chaos script run cannot quietly be read as a passed claim. Failures carry a blame tag classifying whether the system under test, the harness, the checker, or the environment is at fault, so reproducers reach the right team. The outputs are organized as a testing-plans Markdown file plus a per-run test-sessions directory containing a session log, raw logs, metric snapshots, ephemeral harness artifacts, per-scenario findings written as the run proceeds, and a summary report. The plan itself follows a fixed structure with sections covering architecture, scope, claims under test, missing claims found through documentation versus code drift, the system model, an inventory of existing tests, failure-mode hypotheses tied to claim IDs, a coverage matrix, technique selection, environment requirements, the scenarios themselves, a coverage adequacy argument, residual uncertainty, a confidence statement, and open follow-ups. Installation is a single prompt pasted into an agent that tells it to fetch INSTALL.md from the repository. The agent then clones the repository into ~/.local/share/distributed-testing-skills/ and wires the skills into the agent's own skill configuration. The skills are plain SKILL.md files, not a binary or service, so any agent that can read Markdown and run shell commands can use them.

Copy-paste prompts

Prompt 1
Install the distributed-system-testing skills into my Claude Code agent and design a test plan for my Postgres logical replication setup
Prompt 2
Use the planning skill to draft scenarios that falsify exactly once delivery in my Kafka consumer group under a broker restart nemesis
Prompt 3
Run the execution skill against the attached plan and write per scenario findings using the 9 state verdict set
Prompt 4
Map my service's product claims to model types like register, queue, log, lock, lease, or ledger and pick the right checker
Prompt 5
Show me the testing-plans markdown structure and fill in the failure mode hypotheses for a Raft based metadata store

Frequently asked questions

What is distributed-system-testing?

Two Markdown skill files for AI coding agents that design and run claim driven chaos and fault injection test plans for distributed and stateful systems.

How hard is distributed-system-testing to set up?

Setup difficulty is rated moderate, with roughly 1h+ to a first successful run.

Who is distributed-system-testing for?

Mainly ops devops.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.