explaingit

apache/beam

8,583JavaAudience · dataComplexity · 4/5LicenseSetup · moderate

TLDR

A framework for writing data processing pipelines once in Java, Python, or Go and running them on your laptop, Google Cloud Dataflow, Apache Spark, or Flink without changing the code.

Mindmap

mindmap
  root((beam))
    What it does
      Data pipelines
      Batch processing
      Stream processing
    Languages
      Java
      Python
      Go
    Runners
      Local DirectRunner
      Google Dataflow
      Apache Flink
      Apache Spark
    Concepts
      PCollections
      PTransforms
      Pipelines
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Write a data transformation pipeline in Python and run it locally for testing, then submit it to Google Cloud Dataflow for production without changing the code

USE CASE 2

Build a real-time event stream processor that aggregates data continuously as it arrives

USE CASE 3

Switch a Spark or Flink job to a different execution engine by changing a single configuration flag

Tech stack

JavaPythonGoApache FlinkApache SparkGoogle Cloud Dataflow

Getting it running

Difficulty · moderate Time to first run · 30min

Production runs require a cloud account or a running Flink or Spark cluster, local testing needs only Java or Python.

Use freely for any purpose including commercial use, you must include the Apache license notice when redistributing.

In plain English

Apache Beam is an open-source framework for writing data processing programs that can run on many different computing systems without changing the code. You write the logic once using Java, Python, or Go, and then choose where to run it: on your laptop for testing, or on a large distributed cluster for production workloads. The core idea is that data processing follows a common pattern regardless of scale. You define a pipeline, which is a graph of steps. Each step takes a collection of data as input, does something to it (filter, transform, aggregate), and produces another collection as output. Beam calls these collections PCollections and the steps PTransforms. The same code works whether the data is a fixed file you process once (batch) or a live stream of events arriving continuously (streaming). Because the code is separate from where it runs, Beam supports several execution environments called runners. The DirectRunner runs everything on your local machine, which is useful for development and testing. The DataflowRunner submits the job to Google Cloud Dataflow. The FlinkRunner and SparkRunner send it to Apache Flink or Apache Spark clusters respectively. Switching runners is a configuration change, not a code change. This design came out of earlier Google internal systems, including MapReduce and a streaming processing model that Google researchers published around 2015. Beam brought that model into open source under the Apache Software Foundation. The repository contains the SDK code for all three languages, the runner implementations, and a large set of example programs including a classic word-count example recommended for first-time users. Documentation and quickstart guides for each language are available on the official Apache Beam website.

Copy-paste prompts

Prompt 1
Help me write an Apache Beam pipeline in Python that reads a CSV file, filters rows where a column exceeds a threshold, and writes the output to a new file using DirectRunner.
Prompt 2
I want to run an Apache Beam job on Google Cloud Dataflow. Show me how to switch from DirectRunner to DataflowRunner and what GCP project setup I need first.
Prompt 3
Walk me through creating a streaming Apache Beam pipeline in Java that reads events from a Pub/Sub topic and writes 5-minute windowed counts to BigQuery.
Open on GitHub → Explain another repo

← apache on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.