Skip to content

Latest commit

 

History

History
113 lines (82 loc) · 14.4 KB

File metadata and controls

113 lines (82 loc) · 14.4 KB

Apache DataFusion is an open-source, embeddable analytical query engine written in Rust, built on Apache Arrow's columnar memory format.[1][2] It provides SQL and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.[1][2] The project originated in 2017, was donated to the Apache Arrow project in 2019, and became a top-level project of the Apache Software Foundation in 2024.[3][4]

Table of Contents

History

DataFusion originally authored by Andy Grove starting in 2017. It was donated to the Apache Arrow Project in February 2019.[3] In 2024, a paper describing DataFusion was accepted to the industry track of the ACM SIGMOD conference.[5][1] In April 2024, the project graduated from Apache Arrow and became a top-level Apache project.[4]

Features

DataFusion is a fast, extensible query engine for building data systems. It provides a SQL interface and a DataFrame API for constructing queries programmatically, a query planner and rule-based optimizer, and a multithreaded vectorized execution engine that processes data in columnar batches rather than row by row.[1][2]

The engine reads common analytical file formats natively, including Apache Parquet, CSV, JSON, Avro, and Arrow IPC, and uses Apache Arrow's columnar memory format throughout execution, avoiding serialization overhead between stages.[1]

DataFusion is designed for in-process embedding: it runs within the host application's process rather than as a separate server, using threads for parallel query execution. Its extension points allow downstream systems to add user-defined functions, custom data sources, custom query languages, and new optimizer rules, enabling developers to build specialized database systems on top of DataFusion's planning and execution components without reimplementing them.[1][2]

Comparison with related systems

DataFusion is frequently compared with other columnar analytical systems including DuckDB, Polars, and Velox, but these systems differ significantly in scope and intended use.[6]

DuckDB is an in-process OLAP database for direct use by end users, with its own storage format and catalog.[7] DataFusion is a library for building such systems, providing query planning and execution components that other software can embed without a bundled persistent storage format.[8]

Polars is also written in Rust and uses the Apache Arrow memory model, but is designed as a self-contained DataFrame library for data manipulation rather than an embeddable query engine for building other systems.[9][10]

Apache Spark is a distributed analytics framework for processing data at cluster scale.[11] DataFusion executes queries within a single process and is aimed at building embedded analytics systems rather than distributed workloads.[1] Apache projects that use DataFusion to accelerate Spark include Apache DataFusion Comet, a native execution plugin for Spark's JVM-based SQL execution engine,[12] and Apache Auron, a Spark accelerator that combines the Apache Arrow-DataFusion library with the Spark distributed computing framework.[13]

Velox

Velox is an execution engine library developed at Meta.[14] Unlike DataFusion, Velox does not include a SQL frontend or query planning framework; it takes an already-optimized query plan as input and handles only execution.[15]

Adoption and reception

DataFusion has been adopted across a range of analytics and database products. Cloudflare used DataFusion in its Log Explorer product to execute SQL queries over log data stored in Cloudflare R2.[16] Palantir Lightweight Pipelines are powered by DataFusion.[17][18] InfluxDB 3.0 uses DataFusion as part of the FDAP stack: Apache Flight, DataFusion, Arrow, and Parquet.[19] Other users described in public sources include EDB Postgres AI,[20] Cube,[21] Spice AI,[22] Pydantic Logfire,[23] and Kamu.[24]

In 2024, CRN included Apache DataFusion in its list of "The 10 Coolest Open-Source Software Tools Of 2024".[25]

Language support

DataFusion itself is written in Rust. The project also has official Python bindings and community-maintained bindings and tooling for other languages and runtimes.[26][27]

Language support
Language / runtime Project Notes
Rust Apache DataFusion Core implementation
Python datafusion-python Official Python bindings
Java datafusion-java Community-maintained Java bindings
C datafusion-c Community-maintained C bindings
Ruby datafusion-ruby Community-maintained Ruby bindings
WebAssembly datafusion-wasm-bindings Community-maintained WebAssembly bindings
Browser tooling datafusion-wasm-playground, datafusion-fiddle Interactive playgrounds

Ecosystem projects

Several projects in the broader Apache ecosystem and the community-maintained datafusion-contrib organization extend DataFusion's capabilities.[27]

  • Apache DataFusion Comet, donated to the Apache Software Foundation by Apple in 2024, is a plugin that uses DataFusion to accelerate Apache Spark workloads as a drop-in replacement for Spark's JVM-based SQL execution engine[12]
  • datafusion-federation, which allows DataFusion to resolve queries across remote query engines while pushing down as much compute as possible to the remote source
  • datafusion-distributed, a library for bringing distributed execution capabilities to DataFusion
  • datafusion-materialized-views, which provides incremental view maintenance and query rewriting for materialized views in DataFusion
  • datafusion-table-providers, which provides TableProvider implementations for reading data from external systems such as databases and file formats not natively supported by DataFusion

References

External links

  • Apache Arrow