Build a data lakehouse where Spark writes new records and Trino queries them simultaneously without data corruption.
Roll back a large dataset to a previous version after a bad pipeline run overwrites critical records.
Share a single dataset between teams using different processing engines like Flink and Presto without format conflicts.
Update or delete individual rows in file-based storage without rewriting entire data files.
Requires integration with a processing engine like Spark and a storage backend like S3 or HDFS.
Apache Iceberg is a table format for storing and querying very large datasets. Think of it as a standardized way to organize enormous collections of data files on disk or in cloud storage so that multiple different analysis tools can read and write to the same data safely, even at the same time. The problem it solves is that large-scale data analysis typically involves many separate tools. One tool might be reading sales records while another is writing new ones, or different teams might use different processing engines depending on what they are comfortable with. Without a shared format that understands transactions and versioning, these tools can conflict with each other or produce inconsistent results. Iceberg provides a stable specification that tools like Spark, Flink, Trino, Presto, Hive, and Impala can all integrate with, giving them a consistent view of the data. Iceberg also handles features you would expect from a proper database table: you can update or delete individual rows, roll back to a previous version of the data if something goes wrong, and run queries efficiently without scanning every file. These capabilities are unusual for file-based storage systems, which traditionally treat data as append-only. This repository is the reference Java implementation. There are also separate community implementations in Go, Python, Rust, and C++ for teams using other languages. The Java library is what most processing engines integrate against directly. This is infrastructure-level software for data engineering teams. It is not an end-user application but a core component in data warehouse and analytics platform stacks.
← apache on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.