explaingit

nathanmarz/storm

8,784JavaAudience · dataComplexity · 4/5Setup · hard

TLDR

Storm is a distributed system for processing continuous streams of data in real time across a cluster of machines, reacting to each event as it arrives rather than processing in batches like Hadoop.

Mindmap

mindmap
  root((Storm))
    What it does
      Real-time streaming
      Distributed processing
      Event-by-event reaction
    Core Concepts
      Spouts emit data
      Bolts process data
      Topologies connect them
    Use Cases
      Stream processing
      Running aggregations
      Distributed RPC
    Compared to Hadoop
      Real-time vs batch
      Same multi-machine idea
    History
      Created by Nathan Marz
      Donated to Apache
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Build a real-time analytics pipeline that reacts to each incoming event, a click, a transaction, a sensor reading, the moment it arrives.

USE CASE 2

Keep a running count or aggregation updated continuously as a stream of data flows in, without waiting for a nightly batch job.

USE CASE 3

Distribute a compute-heavy calculation across many machines and collect the combined result quickly using distributed remote procedure calls.

Tech stack

Java

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a multi-machine cluster for production, documentation and tutorials are on an external Apache wiki rather than in the repository.

In plain English

Storm is a system for processing continuous streams of data in real time, spread across multiple computers working together. The core idea is that data keeps arriving in a constant flow, and Storm lets you write programs that react to each piece of data as it comes in rather than waiting to collect everything first and then processing it in bulk. The README draws a comparison to Hadoop, which is a well-known tool for batch processing large datasets on multiple machines. Storm does the same for real-time data: it gives developers building blocks for splitting up stream-processing work across many computers, making sure the work keeps running even if some machines fail, and doing it all at high speed. It supports a variety of use cases according to the description: stream processing (reacting to each event as it arrives), continuous computation (keeping running tallies or aggregations updated as data flows in), and distributed remote procedure calls (sending a request to be computed across many nodes and getting a result back quickly). Storm was designed to work with any programming language, not just Java. The original project was created by Nathan Marz and later donated to the Apache Software Foundation, which hosts the project's mailing lists and ongoing development. This GitHub repository is the original pre-Apache version. The README is sparse on installation or configuration details and points readers to a separate wiki for documentation and tutorials. The project has been used by a number of companies, and a link to a list of those is included in the README.

Copy-paste prompts

Prompt 1
Explain Storm's topology concept in plain English. What are spouts, bolts, and streams, and how do they connect to form a real-time data processing pipeline?
Prompt 2
I want to count word frequencies in a live stream of text messages using Storm. Give me the Java code for a basic word-count topology with a spout that emits sentences and a bolt that counts words.
Prompt 3
How does Storm guarantee that each message is processed at least once even if a machine fails mid-way? Explain the acking mechanism simply.
Prompt 4
What is the difference between Storm and Apache Spark Streaming? Give me a concrete example of when I would choose Storm over Spark for a real-time job.
Prompt 5
How do I set up a local Storm cluster for development so I can test a topology on my laptop before deploying to multiple machines?
Open on GitHub → Explain another repo

← nathanmarz on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.