explaingit

apache/iceberg

8,848JavaAudience · dataComplexity · 4/5LicenseSetup · hard

TLDR

A table format standard that lets multiple data tools like Spark and Flink safely read and write the same massive dataset at once, with transactions, row-level updates, and version history.

Mindmap

mindmap
  root((iceberg))
    What it does
      Organize large datasets
      Safe concurrent access
      Version and rollback
    Integrations
      Spark
      Flink
      Trino
      Hive
    Features
      Row-level updates
      Time travel
      Schema evolution
    Audience
      Data engineers
      Analytics teams
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Build a data lakehouse where Spark writes new records and Trino queries them simultaneously without data corruption.

USE CASE 2

Roll back a large dataset to a previous version after a bad pipeline run overwrites critical records.

USE CASE 3

Share a single dataset between teams using different processing engines like Flink and Presto without format conflicts.

USE CASE 4

Update or delete individual rows in file-based storage without rewriting entire data files.

Tech stack

JavaSparkFlinkPythonGoRust

Getting it running

Difficulty · hard Time to first run · 1h+

Requires integration with a processing engine like Spark and a storage backend like S3 or HDFS.

Use freely for any purpose including commercial use, you must include the license notice and state any changes you made.

In plain English

Apache Iceberg is a table format for storing and querying very large datasets. Think of it as a standardized way to organize enormous collections of data files on disk or in cloud storage so that multiple different analysis tools can read and write to the same data safely, even at the same time. The problem it solves is that large-scale data analysis typically involves many separate tools. One tool might be reading sales records while another is writing new ones, or different teams might use different processing engines depending on what they are comfortable with. Without a shared format that understands transactions and versioning, these tools can conflict with each other or produce inconsistent results. Iceberg provides a stable specification that tools like Spark, Flink, Trino, Presto, Hive, and Impala can all integrate with, giving them a consistent view of the data. Iceberg also handles features you would expect from a proper database table: you can update or delete individual rows, roll back to a previous version of the data if something goes wrong, and run queries efficiently without scanning every file. These capabilities are unusual for file-based storage systems, which traditionally treat data as append-only. This repository is the reference Java implementation. There are also separate community implementations in Go, Python, Rust, and C++ for teams using other languages. The Java library is what most processing engines integrate against directly. This is infrastructure-level software for data engineering teams. It is not an end-user application but a core component in data warehouse and analytics platform stacks.

Copy-paste prompts

Prompt 1
Help me set up an Apache Iceberg table in S3 that I can write to with Spark Streaming and query with Trino without conflicts.
Prompt 2
Show me how to use Iceberg time travel to query what a table looked like 3 days ago after a bad data load.
Prompt 3
Write a Spark job that creates an Iceberg table with schema evolution so I can add new columns without breaking existing queries.
Prompt 4
Explain how Iceberg partition transforms work and how to use them to make date-range queries faster without managing partitions manually.
Open on GitHub → Explain another repo

← apache on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.