explaingit

apache/seatunnel

9,326JavaAudience · dataComplexity · 4/5LicenseSetup · hard

TLDR

Apache SeaTunnel moves large amounts of data between 160+ sources and destinations, databases, cloud storage, message queues, as scheduled batch jobs or continuous real-time streams, without writing custom connector code.

Mindmap

mindmap
  root((SeaTunnel))
    Data Movement
      Batch pipelines
      Real-time streaming
    Connectors
      160 plus sources
      Databases
      Cloud storage
    Engines
      Built-in Zeta
      Apache Spark
      Apache Flink
    Features
      CDC sync
      Monitoring
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Sync a MySQL database to a data warehouse in near real-time using change data capture

USE CASE 2

Build a batch ETL pipeline that moves files from cloud storage to PostgreSQL with column transformations

USE CASE 3

Connect 160+ data sources and destinations without writing custom connector or integration code

USE CASE 4

Monitor pipeline throughput and catch duplicate or missing records across data stores

Tech stack

JavaApache SparkApache FlinkSQL

Getting it running

Difficulty · hard Time to first run · 1day+

Requires a processing engine (Zeta, Spark, or Flink) and configured source/sink connectors, production pipelines need significant infrastructure planning.

Free to use for any purpose including commercial use, as long as you keep the copyright and license notice (Apache 2.0 license).

In plain English

Apache SeaTunnel is a tool for moving large amounts of data between different data storage systems. In many organizations, data lives in many places at once: databases, data warehouses, cloud storage buckets, messaging systems, and more. SeaTunnel connects these sources and destinations so data can flow between them reliably and at high speed, whether in scheduled batch runs or as a continuous real-time stream. The tool supports over 160 connectors, which are plugins that know how to read from or write to a specific system. Examples include databases like MySQL and PostgreSQL, cloud services, message queues, and file storage. You configure a job by specifying a source connector, any transformations to apply along the way, and a sink connector for the destination. SeaTunnel then executes that job on a processing engine. For the underlying processing engine, SeaTunnel can run on its own built-in engine called Zeta, or it can delegate to Apache Spark or Apache Flink, two widely used distributed data processing frameworks. This means teams already using Spark or Flink can adopt SeaTunnel without replacing their existing infrastructure. One feature highlighted in the README is change data capture, which means SeaTunnel can watch a database for changes as they happen and forward those changes to another system in near real time, keeping two data stores in sync. It also includes monitoring so you can track throughput and catch problems like duplicate or missing records. The project is part of the Apache Software Foundation, licensed under the Apache 2.0 license, and is used in production by companies including ByteDance, Tencent Cloud, and JP Morgan. Documentation and downloads are available on the official SeaTunnel website.

Copy-paste prompts

Prompt 1
Show me a SeaTunnel job config that reads from a MySQL table using CDC and streams every change into a Kafka topic in real time.
Prompt 2
Write a SeaTunnel configuration to move a large CSV file from S3 into a PostgreSQL table and cast a date column to the correct type.
Prompt 3
How do I run a SeaTunnel job on the built-in Zeta engine instead of Spark or Flink, and what are the trade-offs for each?
Prompt 4
I need to keep two databases in sync using SeaTunnel. Walk me through setting up a CDC pipeline with conflict detection and monitoring.
Prompt 5
How do I check throughput and spot missing or duplicate records in a running SeaTunnel production pipeline?
Open on GitHub → Explain another repo

← apache on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.