explaingit

apache/datafusion

8,760RustAudience · developerComplexity · 4/5LicenseSetup · moderate

TLDR

Apache DataFusion is a fast, embeddable SQL and DataFrame query engine written in Rust that lets developers build database tools and data pipelines without writing query execution from scratch.

Mindmap

mindmap
  root((DataFusion))
    What it does
      SQL query execution
      DataFrame operations
      Multi-threaded processing
    Data formats
      CSV Parquet JSON
      Avro Arrow
    Extend it
      Custom data sources
      Custom functions
    Ecosystem
      DataFusion Python
      DataFusion Comet
      Apache Software Foundation
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Build a custom query engine in Rust that runs SQL against CSV, Parquet, or JSON files without a database server.

USE CASE 2

Run fast SQL queries on local data files from a Python script using the DataFusion Python bindings.

USE CASE 3

Speed up Apache Spark jobs by using DataFusion Comet as a drop-in execution plugin.

USE CASE 4

Create a data pipeline that reads large Parquet files, filters and aggregates them with SQL, and writes results efficiently.

Tech stack

RustPythonApache ArrowParquetSQL

Getting it running

Difficulty · moderate Time to first run · 30min

Rust users need a working Rust toolchain, Python users can pip install datafusion and start immediately with no extra setup.

Apache 2.0 license, use freely for any purpose, including commercial, as long as you preserve copyright and license notices.

In plain English

Apache DataFusion is a query engine, meaning it is software that lets you run SQL queries or DataFrame-style operations against data stored in files like CSV, Parquet, JSON, and Avro. It is written in the Rust programming language and is designed to process data quickly by working on multiple columns at once and using many CPU threads in parallel. The project is aimed at developers who want to build their own database tools, data pipelines, or custom query systems, rather than being a finished end-user product. You bring your data and your application, and DataFusion provides the core machinery for parsing queries, planning how to execute them, and running them efficiently. You can extend it with your own data sources, functions, and operators. Two related projects make DataFusion more accessible without coding in Rust. DataFusion Python provides a Python interface so you can run SQL or DataFrame queries from Python scripts. DataFusion Comet is a plugin for Apache Spark that uses DataFusion to speed up Spark jobs. Out of the box, DataFusion includes a full SQL parser, support for common file formats, date and time functions, cryptographic functions, regular expression functions, and Unicode handling. Many of these features are optional and can be turned on or off depending on what your project needs. The project is part of the Apache Software Foundation and follows Apache governance. It has an active community, documentation on its website, and a Discord channel for discussion. The README links to getting-started guides for both Rust developers and Python users.

Copy-paste prompts

Prompt 1
Using DataFusion in Python, show me how to read a Parquet file, run a SQL GROUP BY aggregation, and collect the results into a Pandas DataFrame.
Prompt 2
How do I create a custom table provider in Rust with DataFusion so I can run SQL queries against my own in-memory data format?
Prompt 3
I want to build a lightweight data catalog tool in Rust that lets users query multiple Parquet files in S3 with SQL. How do I structure the DataFusion integration?
Prompt 4
Show me how to register multiple CSV files as a single logical table in DataFusion Python and query across them with a SQL JOIN.
Prompt 5
Explain the DataFusion execution pipeline from SQL parsing through logical planning, physical planning, and parallel execution, what hooks exist for customization?
Open on GitHub → Explain another repo

← apache on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.