shenli/distributed-system-testing

★ 158

In plain English

This repository ships two skill files for AI coding agents (Claude Code, Codex, Copilot CLI, Cursor, Gemini, or anything that reads Markdown and runs shell commands) aimed at testing distributed and stateful systems. One skill designs a test plan, the other runs it. The deliverables are two Markdown artifacts: a structured plan and a findings report. A reviewer reads those two files and decides whether the system is safe to ship, without having to re-run the tests themselves. The pitch is that the usual approach to testing distributed systems, writing a handful of integration tests, misses the bugs that actually break them in production: partial network partitions, non-deterministic concurrency, crash and recovery sequences, upgrades and rollbacks, replay-induced idempotency issues, and timing-sensitive ordering. The skills push an opinionated workflow built around what the README calls claim-driven testing: every scenario is named after a product claim it tries to falsify under one specific fault, rather than being named after its setup. For claims about safety, durability, idempotency, isolation, ordering, or membership, each scenario also has to declare an abstract model (register, queue, log, lock, lease, ledger, or similar), an operation-history schema, a named checker such as linearizability, serializability, session-consistency, no-lost-ack, or exactly-once, plus a nemesis that injects the fault along with observable landing evidence proving the fault actually fired. Outcomes use a 9-state verdict set so that a clean chaos script run cannot quietly be read as a passed claim. Failures carry a blame tag classifying whether the system under test, the harness, the checker, or the environment is at fault, so reproducers reach the right team. The outputs are organized as a testing-plans Markdown file plus a per-run test-sessions directory containing a session log, raw logs, metric snapshots, ephemeral harness artifacts, per-scenario findings written as the run proceeds, and a summary report. The plan itself follows a fixed structure with sections covering architecture, scope, claims under test, missing claims found through documentation versus code drift, the system model, an inventory of existing tests, failure-mode hypotheses tied to claim IDs, a coverage matrix, technique selection, environment requirements, the scenarios themselves, a coverage adequacy argument, residual uncertainty, a confidence statement, and open follow-ups. Installation is a single prompt pasted into an agent that tells it to fetch INSTALL.md from the repository. The agent then clones the repository into ~/.local/share/distributed-testing-skills/ and wires the skills into the agent's own skill configuration. The skills are plain SKILL.md files, not a binary or service, so any agent that can read Markdown and run shell commands can use them.

Open on GitHub → Explain another repo

Generated 2026-05-21 · Model: sonnet-4-6 · Verify against the repo before relying on details.