explaingit

datastacktv/data-engineer-roadmap

12,752Audience · dataComplexity · 1/5Setup · easy

TLDR

A visual diagram mapping all the tools, cloud platforms, and concepts you would encounter across a data engineering career, meant as a landscape reference rather than a step-by-step curriculum.

Mindmap

mindmap
  root((repo))
    What it covers
      Cloud platforms
      Pipeline tools
      Storage formats
      Query engines
    Audience
      Beginners
      Career changers
    Format
      Visual diagram
      Text version
    Created by
      datastack.tv
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Get a quick overview of the entire data engineering landscape before deciding which skills to learn first

USE CASE 2

Identify gaps in your current data engineering knowledge by scanning the roadmap categories

USE CASE 3

Use the diagram to explain the data engineering field to a manager or non-technical stakeholder

USE CASE 4

Find the names of tools in areas like orchestration, storage formats, or query engines to research further

Tech stack

Markdown

Getting it running

Difficulty · easy Time to first run · 5min
No explicit license stated in the repository, treat as a reference document and check before reusing the diagram commercially.

In plain English

data-engineer-roadmap is a visual reference guide showing the tools, technologies, and concepts a person would need to learn to work as a data engineer. Data engineering is the discipline of building and maintaining the pipelines that move, store, clean, and prepare data so that analysts, dashboards, and machine learning systems can use it. This roadmap attempts to map out that entire field in a single diagram, presented as a large image hosted in the repository. The roadmap covers the modern data engineering landscape as of 2021, grouping topics across areas such as cloud platforms, data pipeline tools, storage formats, orchestration systems, query engines, and programming languages. A text version of the diagram is included in the repository for users who cannot view the image. There is also a separate extras diagram covering additional tools that are useful to know but not strictly required for most roles. The README includes a note for beginners: a working data engineer would typically master only a subset of these tools over several years, shaped by the company they work for and the kinds of problems they encounter. The diagram is intended as a map of the overall landscape, not a checklist to complete before getting started. The README itself is sparse and the main content is the roadmap image, which the README links to but does not describe in text. The project was created by datastack.tv, a learning platform that produces screencast tutorials for data engineers. Community suggestions and pull requests are welcome.

Copy-paste prompts

Prompt 1
Based on the data-engineer-roadmap, I am a Python developer with no data engineering experience. Build me a 12-week learning plan covering the most essential tools from each category in priority order.
Prompt 2
I see the data-engineer-roadmap lists several orchestration tools. Compare Apache Airflow, Prefect, and Dagster in terms of setup complexity, community size, and which team sizes each suits best.
Prompt 3
Using the data-engineer-roadmap as a reference, which tools in the roadmap integrate natively with dbt, and what role does each play in a typical analytics engineering stack?
Prompt 4
I work at a startup with one data engineer. Based on the data-engineer-roadmap categories, what is the minimum viable toolset I should master first to get data flowing from our Postgres database to a BI tool?
Prompt 5
Explain the difference between a data lake and a data lakehouse as shown in the data-engineer-roadmap, and give a concrete example of when I would choose each architecture.
Open on GitHub → Explain another repo

← datastacktv on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.