datafusion/dev/wiki/apache-datafusion.wikitext at 2a014b2a5a132c9da7d1adaead632fa69a15aba1 · apache/datafusion

Apache DataFusion is an open-source, embeddable analytical query engine written in Rust, built on Apache Arrow's columnar memory format.^[1]^[2] It provides SQL and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.^[1]^[2] The project originated in 2017, was donated to the Apache Arrow project in 2019, and became a top-level project of the Apache Software Foundation in 2024.^[3]^[4]

History
Features
Comparison with related systems
Adoption and reception
Language support
Ecosystem projects
References
External links

History

DataFusion originally authored by Andy Grove starting in 2017. It was donated to the Apache Arrow Project in February 2019.^[3] In 2024, a paper describing DataFusion was accepted to the industry track of the ACM SIGMOD conference.^[5]^[1] In April 2024, the project graduated from Apache Arrow and became a top-level Apache project.^[4]

Features

DataFusion is a fast, extensible query engine for building data systems. It provides a SQL interface and a DataFrame API for constructing queries programmatically, a query planner and rule-based optimizer, and a multithreaded vectorized execution engine that processes data in columnar batches rather than row by row.^[1]^[2]

The engine reads common analytical file formats natively, including Apache Parquet, CSV, JSON, Avro, and Arrow IPC, and uses Apache Arrow's columnar memory format throughout execution, avoiding serialization overhead between stages.^[1]

DataFusion is designed for in-process embedding: it runs within the host application's process rather than as a separate server, using threads for parallel query execution. Its extension points allow downstream systems to add user-defined functions, custom data sources, custom query languages, and new optimizer rules, enabling developers to build specialized database systems on top of DataFusion's planning and execution components without reimplementing them.^[1]^[2]

Comparison with related systems

DataFusion is frequently compared with other columnar analytical systems including DuckDB, Polars, and Velox, but these systems differ significantly in scope and intended use.^[6]

DuckDB

DuckDB is an in-process OLAP database for direct use by end users, with its own storage format and catalog.^[7] DataFusion is a library for building such systems, providing query planning and execution components that other software can embed without a bundled persistent storage format.^[8]

Polars

Polars is also written in Rust and uses the Apache Arrow memory model, but is designed as a self-contained DataFrame library for data manipulation rather than an embeddable query engine for building other systems.^[9]^[10]

Apache Spark

Apache Spark is a distributed analytics framework for processing data at cluster scale.^[11] DataFusion executes queries within a single process and is aimed at building embedded analytics systems rather than distributed workloads.^[1] Apache projects that use DataFusion to accelerate Spark include Apache DataFusion Comet, a native execution plugin for Spark's JVM-based SQL execution engine,^[12] and Apache Auron, a Spark accelerator that combines the Apache Arrow-DataFusion library with the Spark distributed computing framework.^[13]

Velox

Velox is an execution engine library developed at Meta.^[14] Unlike DataFusion, Velox does not include a SQL frontend or query planning framework; it takes an already-optimized query plan as input and handles only execution.^[15]

Adoption and reception

DataFusion has been adopted across a range of analytics and database products. Cloudflare used DataFusion in its Log Explorer product to execute SQL queries over log data stored in Cloudflare R2.^[16] Palantir Lightweight Pipelines are powered by DataFusion.^[17]^[18] InfluxDB 3.0 uses DataFusion as part of the FDAP stack: Apache Flight, DataFusion, Arrow, and Parquet.^[19] Other users described in public sources include EDB Postgres AI,^[20] Cube,^[21] Spice AI,^[22] Pydantic Logfire,^[23] and Kamu.^[24]

In 2024, CRN included Apache DataFusion in its list of "The 10 Coolest Open-Source Software Tools Of 2024".^[25]

Language support

DataFusion itself is written in Rust. The project also has official Python bindings and community-maintained bindings and tooling for other languages and runtimes.^[26]^[27]

Language support

Language / runtime	Project	Notes
Rust	Apache DataFusion	Core implementation
Python	datafusion-python	Official Python bindings
Java	datafusion-java	Community-maintained Java bindings
C	datafusion-c	Community-maintained C bindings
Ruby	datafusion-ruby	Community-maintained Ruby bindings
WebAssembly	datafusion-wasm-bindings	Community-maintained WebAssembly bindings
Browser tooling	datafusion-wasm-playground, datafusion-fiddle	Interactive playgrounds

Ecosystem projects

Several projects in the broader Apache ecosystem and the community-maintained datafusion-contrib organization extend DataFusion's capabilities.^[27]

Apache DataFusion Comet, donated to the Apache Software Foundation by Apple in 2024, is a plugin that uses DataFusion to accelerate Apache Spark workloads as a drop-in replacement for Spark's JVM-based SQL execution engine^[12]
datafusion-federation, which allows DataFusion to resolve queries across remote query engines while pushing down as much compute as possible to the remote source
datafusion-distributed, a library for bringing distributed execution capabilities to DataFusion
datafusion-materialized-views, which provides incremental view maintenance and query rewriting for materialized views in DataFusion
datafusion-table-providers, which provides TableProvider implementations for reading data from external systems such as databases and file formats not natively supported by DataFusion

References

External links

Apache Arrow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table of Contents

History

Features

Comparison with related systems

DuckDB

Polars

Apache Spark

Velox

Adoption and reception

Language support

Ecosystem projects

References

External links

FilesExpand file tree

apache-datafusion.wikitext

Latest commit

History

apache-datafusion.wikitext

File metadata and controls

Table of Contents

History

Features

Comparison with related systems

DuckDB

Polars

Apache Spark

Velox

Adoption and reception

Language support

Ecosystem projects

References

External links