explaingit

datatalksclub/data-engineering-zoomcamp

40,939Jupyter NotebookAudience · developerComplexity · 3/5ActiveSetup · hard

TLDR

Free nine-week course teaching data pipeline fundamentals: Docker, Terraform, workflow orchestration, BigQuery, dbt, Spark, and Kafka for aspiring data engineers.

Mindmap

mindmap
  root((repo))
    What it covers
      Docker and Terraform
      Workflow orchestration
      Data warehousing
      Batch and streaming
    Learning format
      Jupyter Notebooks
      Video lectures
      Homework assignments
    Tech stack
      Google BigQuery
      Apache Spark
      Apache Kafka
      dbt
    Use cases
      Build production pipelines
      Learn industry tools
      Complete capstone project
    Audience
      Python and SQL basics
      Career changers
      Self-paced learners

Things people build with this

USE CASE 1

Learn to build and deploy production data pipelines using industry-standard tools.

USE CASE 2

Gain hands-on experience with Docker, Terraform, and workflow orchestration for real data engineering jobs.

USE CASE 3

Master data warehousing, transformation, and streaming technologies through structured modules and a capstone project.

USE CASE 4

Transition from SQL/Python knowledge to full-stack data engineering with practical homework and real-world scenarios.

Tech stack

PythonSQLDockerTerraformKestraGoogle BigQueryApache SparkApache Kafka

Getting it running

Difficulty · hard Time to first run · 1day+

Requires Docker, GCP account with BigQuery, Terraform, and multiple distributed systems (Kafka, Spark, Kestra) to run full examples.

License could not be detected automatically. Check the repository's LICENSE file before use.

In plain English

Data Engineering Zoomcamp is a free nine-week online course that teaches the fundamentals of building data pipelines from scratch. Data engineering is the discipline of designing and building the systems that collect, move, transform, and store data so that it can be used for analysis and machine learning. The course addresses the gap that many aspiring data professionals face: they know how to write SQL or Python but do not have hands-on experience with the production infrastructure tools that real data jobs require. The course is structured as seven modules followed by a final project. The first module covers containerization using Docker and infrastructure provisioning using Terraform, which are tools for packaging software and managing cloud resources consistently. Module two teaches workflow orchestration, the practice of scheduling and monitoring data pipelines, using Kestra. Later modules cover data warehousing in Google BigQuery, analytics engineering with dbt which is a tool for transforming data inside a warehouse using SQL, batch processing with Apache Spark for large-scale distributed computation, and streaming data with Apache Kafka for real-time event processing. Each module includes homework assignments, and the course ends with a capstone project where students build a complete end-to-end pipeline. You would enroll in or self-study this course if you have basic Python and SQL knowledge and want practical experience with the tools used in industry data engineering roles. The course runs in cohorts starting each January, but all materials including Jupyter Notebooks, lecture videos, and homework are freely available for self-paced study. The primary format is Jupyter Notebook alongside code and configuration files.

Copy-paste prompts

Prompt 1
Walk me through the data engineering zoomcamp module on Docker and Terraform, what problems do they solve in a data pipeline?
Prompt 2
I want to set up a data pipeline using Kestra for workflow orchestration. Show me how the zoomcamp course structures this.
Prompt 3
Explain the dbt module from data engineering zoomcamp: how do you transform data inside a warehouse using SQL?
Prompt 4
What's the difference between batch processing with Spark and streaming with Kafka? Use examples from the zoomcamp course.
Prompt 5
Help me design a capstone project for the data engineering zoomcamp that uses BigQuery, dbt, and Spark together.
Open on GitHub → Explain another repo

Generated 2026-05-18 · Model: sonnet-4-6 · Verify against the repo before relying on details.