igorbarinov/awesome-data-engineering

★ 8,627Audience · dataComplexity · 1/5Setup · easy

Mindmap

mindmap
  root((repo))
    Databases
      Relational PostgreSQL MySQL
      Key-value Redis
      Document MongoDB
      Graph Neo4j
      Time series InfluxDB
    Data movement
      Ingestion Kafka Logstash
      Stream processing
      Batch Spark Hadoop
    Operations
      Workflow orchestration
      Monitoring
      Data quality
    Community
      Books and podcasts
      Conferences
      Forums

mindmap root((repo)) Databases Relational PostgreSQL MySQL Key-value Redis Document MongoDB Graph Neo4j Time series InfluxDB Data movement Ingestion Kafka Logstash Stream processing Batch Spark Hadoop Operations Workflow orchestration Monitoring Data quality Community Books and podcasts Conferences Forums

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Quickly find the right database type for your data project, relational, document, graph, or time series, from one organized reference list

USE CASE 2

Discover stream or batch processing tools like Kafka or Spark when planning a new data pipeline

USE CASE 3

Find orchestration and monitoring tools to schedule and observe your data workflows

Tech stack

Apache KafkaApache SparkPostgreSQLRedisMongoDBElasticsearch

Getting it running

Difficulty · easy Time to first run · 5min

In plain English

This is a curated list of tools and technologies used in data engineering. Data engineering is the field focused on building and maintaining the systems that collect, store, move, and process large amounts of data so that analysts and data scientists can work with it. This repository does not contain code, it is a reference list of links to relevant projects, grouped by category. The list covers databases of many types: relational databases like PostgreSQL and MySQL, key-value stores like Redis and DynamoDB, column-oriented databases like Cassandra and ClickHouse, document databases like MongoDB and Elasticsearch, graph databases like Neo4j, time series databases like InfluxDB, and distributed databases. Each entry is a short description with a link to the project. Beyond storage, the list covers tools for moving and processing data. There are sections on data ingestion (tools for getting data from one place to another, such as Apache Kafka and Logstash), stream processing (handling data as it arrives in real time), and batch processing (working through large stored datasets, with tools like Apache Spark and Hadoop). There are also sections on file systems and serialization formats, which are the ways data is structured and stored on disk. The list extends into operational concerns, with sections on workflow orchestration tools (for scheduling and coordinating data pipelines), monitoring, data quality testing, and data profiling (understanding the shape and content of a dataset). There is also coverage of charts and dashboards, ELK stack tooling, and Docker-related resources. At the end the list points to community resources including forums, conferences, podcasts, and books related to data engineering. It is formatted as a standard Awesome list, a common GitHub convention for community-maintained reference collections. The full README is longer than what was shown.

Copy-paste prompts

Prompt 1

Based on the awesome-data-engineering list, recommend the best open-source tool for ingesting data from a REST API into PostgreSQL and explain the trade-offs

Prompt 2

Using awesome-data-engineering as context, compare Apache Kafka and Apache Spark for real-time event processing and explain when to use each

Prompt 3

I need to add data quality testing to my daily ETL job. Based on awesome-data-engineering, which tools should I evaluate and what do they each check for?

Open on GitHub → Explain another repo

← igorbarinov on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.