delta-io/delta

★ 8,798ScalaAudience · dataComplexity · 4/5LicenseSetup · hard

Mindmap

mindmap
  root((Delta Lake))
    What It Does
      Reliable data storage
      Concurrent write safety
      Time travel queries
      Rollback support
    Compute Integrations
      Apache Spark
      Flink and Trino
      Hive
    Language APIs
      Scala and Java
      Python and Rust
    Where Data Lives
      Amazon S3
      Azure Data Lake
      Cloud object storage

mindmap root((Delta Lake)) What It Does Reliable data storage Concurrent write safety Time travel queries Rollback support Compute Integrations Apache Spark Flink and Trino Hive Language APIs Scala and Java Python and Rust Where Data Lives Amazon S3 Azure Data Lake Cloud object storage

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Store large datasets in S3 and safely update them from multiple concurrent jobs without corrupting data.

USE CASE 2

Query the state of a data table as it looked at a past point in time using the time travel feature.

USE CASE 3

Replace brittle CSV or Parquet pipelines with a transaction-safe format that supports rollback if a bad write occurs.

USE CASE 4

Integrate Delta Lake into an existing Apache Spark or Flink pipeline without rewriting data processing code.

Tech stack

ScalaJavaPythonRustApache SparkFlinkTrinoHive

Getting it running

Difficulty · hard Time to first run · 1h+

Requires Apache Spark as the primary compute engine, cloud object storage like S3 or Azure Data Lake needed for most real use cases.

Apache 2.0, use freely for any purpose including commercial, modify and distribute, as long as you include the license and copyright notice.

In plain English

Delta Lake is an open-source storage layer designed to sit on top of existing data storage systems (such as cloud object storage like Amazon S3 or Azure Data Lake) and add capabilities that those systems do not provide on their own. The most important of those capabilities is the ability to treat large data files more like a database: you can read and write data reliably even when multiple processes are doing so at the same time, roll back to an earlier version of a dataset if something goes wrong, and get consistent results when querying data that is being updated by another job simultaneously. The project is particularly common in data engineering and analytics contexts, where teams store massive amounts of structured data and process it with tools like Apache Spark, Flink, Trino, or Hive. Delta Lake integrates with all of those tools through connectors, so existing data pipelines can adopt it without a complete rewrite. APIs are available for Scala, Java, Python, Rust, and Ruby. At a technical level, Delta Lake achieves its reliability guarantees by maintaining a transaction log alongside the actual data files. Every write to a Delta table is recorded as a transaction, and the log is what enables features like time travel (querying the state of a table at a past point in time), concurrent write safety, and the ability for newer versions of the software to always read tables written by older versions. The project originated at Databricks and is now part of the Linux Foundation. It has a companion ecosystem of related repositories covering Rust bindings, data sharing, and Kafka ingestion. The core library here is written in Scala and requires Apache Spark as the primary compute engine for most use cases. The license is Apache 2.

Copy-paste prompts

Prompt 1

Show me how to write a PySpark script that saves a DataFrame as a Delta Lake table in S3 and reads it back.

Prompt 2

How do I use Delta Lake time travel to query what a table looked like two days ago using Python?

Prompt 3

How do I set up concurrent writes to a Delta Lake table from multiple Spark jobs without data corruption?

Prompt 4

What is the Delta Lake transaction log and how does it enable features like rollback and consistent reads?

Open on GitHub → Explain another repo

← delta-io on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.