explaingit

apache/dolphinscheduler

14,272JavaAudience · dataComplexity · 4/5LicenseSetup · hard

TLDR

An open-source workflow scheduler for data pipelines, build, run, and monitor multi-step data tasks visually without much code, handling tens of millions of tasks per day at scale.

Mindmap

mindmap
  root((repo))
    What it does
      Workflow scheduling
      Data pipelines
      Task orchestration
    Interfaces
      Visual drag-drop
      Python API
      REST API
    Deployment
      Single server
      Cluster mode
      Docker
      Kubernetes
    Integrations
      MySQL PostgreSQL
      Hive Trino
      Browser monitoring
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Schedule a daily ETL pipeline that pulls from a database, transforms data, and loads it to a warehouse using a visual drag-and-drop editor.

USE CASE 2

Monitor the health and resource usage of your data workflows from a browser dashboard without SSHing into servers.

USE CASE 3

Run high-volume data workflows using distributed cluster mode to process tens of millions of tasks per day.

Tech stack

JavaPythonDockerKubernetes

Getting it running

Difficulty · hard Time to first run · 1h+

Production deployment requires a multi-server cluster, Docker single-server mode is easier for evaluation but not production-grade.

Use freely for any purpose including commercial use with attribution under the Apache 2.0 license.

In plain English

Apache DolphinScheduler is a tool for planning and running data workflows. A workflow is a series of tasks that need to run in a specific order, for example: pull data from a database, transform it, then load it somewhere else. DolphinScheduler handles the scheduling of those tasks, tracks dependencies between them, and keeps everything running reliably. You build workflows through a visual drag-and-drop interface in a web browser, without writing much code. There is also a Python programming interface and an API for teams that prefer to manage things programmatically. The tool supports a wide range of task types out of the box, meaning you can connect it to many common data systems without custom plugins. It is built to handle large volumes of work. The README states it can process tens of millions of tasks per day and performs several times faster than comparable tools. It uses a distributed architecture where multiple servers share the load, so you can add more capacity by adding more machines rather than replacing existing hardware. You can run it in several ways: as a single-server setup for quick evaluation, as a cluster for production use, or inside Docker or Kubernetes container environments. It supports connecting to many external databases including MySQL, PostgreSQL, Hive, and Trino. There is also built-in monitoring so you can see server health and resource usage in a browser without logging into the machines directly. The project is part of the Apache Software Foundation and is open source under the Apache 2.0 license.

Copy-paste prompts

Prompt 1
Help me set up Apache DolphinScheduler with Docker and create a simple daily ETL workflow using the visual drag-and-drop UI.
Prompt 2
How do I connect DolphinScheduler to PostgreSQL and Hive to orchestrate a data pipeline that moves data between them on a schedule?
Prompt 3
Show me how to define a DolphinScheduler workflow using the Python API instead of the visual UI, including setting task dependencies.
Prompt 4
What does the DolphinScheduler distributed architecture look like and how do I add more worker nodes to scale throughput for a production cluster?
Open on GitHub → Explain another repo

← apache on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.