-
Notifications
You must be signed in to change notification settings - Fork 2.1k
add first draft of wikipedia article #21105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
gene-bordegaray
wants to merge
6
commits into
apache:main
Choose a base branch
from
gene-bordegaray:issue-21076-wikipedia-draft
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
eea53d6
add first draft of article
gene-bordegaray b621e2c
docs: tighten wikipedia draft review follow-ups
gene-bordegaray fa06abe
add Auron link
gene-bordegaray 2a014b2
shorten history more
gene-bordegaray 1988656
add crates.io downloads and citation
gene-bordegaray 09d6d47
change wording to be extensible
gene-bordegaray File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| <!-- | ||
| Draft Wikipedia article. | ||
| --> | ||
|
|
||
| {{Short description|Open-source query engine}} | ||
| {{Draft topics|technology|software}} | ||
| {{Infobox software | ||
| | name = Apache DataFusion | ||
| | developer = [[Apache Software Foundation]] | ||
| | programming language = [[Rust (programming language)|Rust]] | ||
| | genre = Query engine | ||
| | license = [[Apache License]] | ||
| | website = {{URL|https://datafusion.apache.org/}} | ||
| }} | ||
|
|
||
| '''Apache DataFusion''' is an [[open-source software|open-source]], extensible analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded one million monthly downloads on crates.io.<ref name="crates-io">{{cite web |title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io |access-date=2026-03-26}}</ref> | ||
|
|
||
| == History == | ||
|
|
||
| DataFusion originally authored by Andy Grove starting in 2017. It was donated to the Apache Arrow Project in February 2019.<ref name="donation-post" /> In 2024, a paper describing DataFusion was accepted to the industry track of the [[ACM SIGMOD]] conference.<ref name="sigmod-accepted">{{cite web |title=SIGMOD 2024 Industrial Track: Accepted Papers |url=https://2024.sigmod.org/industrial-list.shtml |website=SIGMOD 2024 |access-date=2026-03-22}}</ref><ref name="sigmod-paper" /> In April 2024, the project graduated from Apache Arrow and became a top-level Apache project.<ref name="asf-tlp" /> | ||
|
|
||
| == Features == | ||
|
|
||
| DataFusion is a fast, extensible query engine for building data systems. It provides a SQL interface and a DataFrame API for constructing queries programmatically, a [[query plan|query planner]] and rule-based [[query optimization|optimizer]], and a multithreaded vectorized execution engine that processes data in columnar batches rather than row by row.<ref name="sigmod-paper" /><ref name="intro-docs" /> | ||
|
|
||
| The engine reads common analytical file formats natively, including [[Apache Parquet]], [[comma-separated values|CSV]], [[JSON]], [[Apache Avro|Avro]], and Arrow IPC, and uses [[Apache Arrow]]'s columnar memory format throughout execution, avoiding [[serialization]] overhead between stages.<ref name="sigmod-paper" /> | ||
|
|
||
| DataFusion is designed for in-process embedding: it runs within the host application's process rather than as a separate server, using threads for parallel query execution. Its extension points allow downstream systems to add [[user-defined function|user-defined functions]], custom data sources, custom query languages, and new optimizer rules, enabling developers to build specialized database systems on top of DataFusion's planning and execution components without reimplementing them.<ref name="sigmod-paper" /><ref name="intro-docs" /> | ||
|
|
||
| == Comparison with related systems == | ||
|
|
||
| DataFusion is frequently compared with other columnar analytical systems including [[DuckDB]], [[Polars (software)|Polars]], and Velox, but these systems differ significantly in scope and intended use.<ref name="composable-dbms">{{cite journal |last1=Pedreira |first1=Pedro |last2=Erling |first2=Orri |last3=Mühleisen |first3=Hannes |last4=Muñoz |first4=Ruben |last5=Khaled |first5=Wael |last6=Dürsch |first6=Peter |title=The Composable Data Management System Manifesto |journal=Proceedings of the VLDB Endowment |volume=16 |issue=10 |year=2023 |doi=10.14778/3603581.3603604}}</ref> | ||
|
|
||
| === [[DuckDB]] === | ||
|
|
||
| [[DuckDB]] is an in-process [[online analytical processing|OLAP]] database for direct use by end users, with its own storage format and catalog.<ref name="duckdb-official">{{cite web |title=DuckDB |url=https://duckdb.org/ |website=DuckDB |access-date=2026-03-22}}</ref> DataFusion is a library for building such systems, providing query planning and execution components that other software can embed without a bundled persistent storage format.<ref name="bauplan">{{cite web |title=Duck Hunt: Moving Bauplan from DuckDB to DataFusion |url=https://www.bauplanlabs.com/post/duck-hunt-moving-bauplan-from-duckdb-to-datafusion |website=Bauplan |date=2025-11-05 |access-date=2026-03-22}}</ref> | ||
|
|
||
| === [[Polars (software)|Polars]] === | ||
|
|
||
| [[Polars (software)|Polars]] is also written in [[Rust (programming language)|Rust]] and uses the [[Apache Arrow]] memory model, but is designed as a self-contained DataFrame library for data manipulation rather than an embeddable query engine for building other systems.<ref name="polars-official">{{cite web |title=Polars |url=https://pola.rs/ |website=Polars |access-date=2026-03-22}}</ref><ref name="faq">{{cite web |title=Frequently Asked Questions |url=https://datafusion.apache.org/user-guide/faq.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> | ||
|
|
||
| === [[Apache Spark]] === | ||
|
|
||
| [[Apache Spark]] is a distributed analytics framework for processing data at cluster scale.<ref name="spark-sql">{{cite web |title=Spark SQL & DataFrames |url=https://spark.apache.org/sql/ |website=Apache Spark |access-date=2026-03-22}}</ref> DataFusion executes queries within a single process and is aimed at building embedded analytics systems rather than distributed workloads.<ref name="sigmod-paper" /> Apache projects that use DataFusion to accelerate Spark include Apache DataFusion Comet, a native execution plugin for Spark's [[Java virtual machine|JVM]]-based SQL execution engine,<ref name="comet-donation">{{cite web |title=Announcing Apache Arrow DataFusion Comet |url=https://arrow.apache.org/blog/2024/03/06/comet-donation/ |website=Apache Arrow Blog |publisher=Apache Software Foundation |date=2024-03-06 |access-date=2026-03-22}}</ref> and [https://auron.apache.org/ Apache Auron], a Spark accelerator that combines the Apache Arrow-DataFusion library with the Spark distributed computing framework.<ref name="auron-intro">{{cite web |title=Introduction |url=https://auron.apache.org/introduction.html |website=Apache Auron |publisher=Apache Software Foundation |access-date=2026-03-23}}</ref> | ||
|
|
||
| === Velox === | ||
|
|
||
| [https://velox-lib.io/ Velox] is an execution engine library developed at [[Meta Platforms|Meta]].<ref name="velox-vldb">{{cite journal |last1=Pedreira |first1=Pedro |last2=Tan |first2=Wei |last3=Narayanan |first3=Deepak |last4=Chattopadhyay |first4=Bikramjit |last5=Erling |first5=Orri |last6=Melnik |first6=Sergey |last7=Bhagwan |first7=Ranjita |last8=Dumoulin |first8=Franck |title=Velox: Meta's Unified Execution Engine |journal=Proceedings of the VLDB Endowment |volume=15 |issue=12 |year=2022 |doi=10.14778/3554821.3554829}}</ref> Unlike DataFusion, Velox does not include a SQL frontend or query planning framework; it takes an already-optimized query plan as input and handles only execution.<ref name="velox-docs">{{cite web |title=Velox in 10 Minutes |url=https://facebookincubator.github.io/velox/velox-in-10-min.html |website=Velox |access-date=2026-03-22}}</ref> | ||
|
|
||
| == Adoption and reception == | ||
|
|
||
| DataFusion has been adopted across a range of analytics and database products. [[Cloudflare]] used DataFusion in its Log Explorer product to execute SQL queries over log data stored in Cloudflare R2.<ref name="cloudflare">{{cite web |title=Cloudflare Log Explorer is now GA, providing native observability and forensics |url=https://blog.cloudflare.com/logexplorer-ga/ |website=The Cloudflare Blog |publisher=Cloudflare |date=2025-06-18 |access-date=2026-03-22}}</ref> [[Palantir Technologies|Palantir]] Lightweight Pipelines are powered by DataFusion.<ref name="palantir-2025">{{cite web |title=Announcements: July 2025 |url=https://www.palantir.com/docs/foundry/announcements/2025-07 |website=Palantir Foundry Documentation |publisher=Palantir Technologies |date=2025-07-29 |access-date=2026-03-22}}</ref><ref name="palantir-2024">{{cite web |title=Announcements: February 2024 |url=https://www.palantir.com/docs/foundry/announcements/2024-02 |website=Palantir Foundry Documentation |publisher=Palantir Technologies |date=February 2024 |access-date=2026-03-22}}</ref> [[InfluxDB]] 3.0 uses DataFusion as part of the FDAP stack: Apache Flight, DataFusion, Arrow, and Parquet.<ref name="influx-fdap">{{cite web |title=Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0 |url=https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/ |website=InfluxData |date=2023-10-25 |access-date=2026-03-22}}</ref> Other users described in public sources include EDB Postgres AI,<ref name="siliconangle-edb">{{cite web |title=Enterprise DB begins rolling AI features into PostgreSQL |url=https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/ |website=SiliconANGLE |date=2024-05-23 |access-date=2026-03-22}}</ref> Cube,<ref name="cube-pushdown">{{cite web |title=Query pushdown in Cube's semantic layer |url=https://cube.dev/blog/query-push-down-in-cubes-semantic-layer |website=Cube |date=2024-06-03 |access-date=2026-03-22}}</ref> Spice AI,<ref name="spice">{{cite web |title=How we use Apache DataFusion at Spice AI |url=https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai |website=Spice AI |date=2026-01-17 |access-date=2026-03-22}}</ref> Pydantic Logfire,<ref name="logfire">{{cite web |title=We're changing database |url=https://github.com/pydantic/logfire/issues/408 |website=GitHub |date=2024-08-29 |access-date=2026-03-22}}</ref> and Kamu.<ref name="kamu">{{cite web |title=100X faster ingestion, and FlightSQL support for connecting BI tools |url=https://www.kamu.dev/blog/2023-09-datafusion-flightsql/ |website=Kamu Data |date=2023-09-26 |access-date=2026-03-22}}</ref> | ||
|
|
||
| In 2024, ''CRN'' included Apache DataFusion in its list of "The 10 Coolest Open-Source Software Tools Of 2024".<ref name="crn">{{cite web |title=The 10 Coolest Open-Source Software Tools Of 2024 |url=https://www.crn.com/news/software/2024/the-10-coolest-open-source-software-tools-of-2024?page=3 |website=CRN |date=2024-11-21 |access-date=2026-03-22}}</ref> | ||
|
|
||
| == Language support == | ||
|
|
||
| DataFusion itself is written in Rust. The project also has official Python bindings and community-maintained bindings and tooling for other languages and runtimes.<ref name="readme-related">{{cite web |title=Apache DataFusion |url=https://github.com/apache/datafusion |website=GitHub |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref><ref name="df-contrib-org">{{cite web |title=datafusion-contrib |url=https://github.com/datafusion-contrib |website=GitHub |access-date=2026-03-22}}</ref> | ||
|
|
||
| {| class="wikitable" | ||
| |+ Language support | ||
| ! Language / runtime | ||
| ! Project | ||
| ! Notes | ||
| |- | ||
| | [[Rust (programming language)|Rust]] | ||
| | Apache DataFusion | ||
| | Core implementation | ||
| |- | ||
| | [[Python (programming language)|Python]] | ||
| | [https://github.com/apache/datafusion-python datafusion-python] | ||
| | Official Python bindings | ||
| |- | ||
| | [[Java (programming language)|Java]] | ||
| | [https://github.com/datafusion-contrib/datafusion-java datafusion-java] | ||
| | Community-maintained Java bindings | ||
| |- | ||
| | [[C (programming language)|C]] | ||
| | [https://github.com/datafusion-contrib/datafusion-c datafusion-c] | ||
| | Community-maintained C bindings | ||
| |- | ||
| | [[Ruby (programming language)|Ruby]] | ||
| | [https://github.com/datafusion-contrib/datafusion-ruby datafusion-ruby] | ||
| | Community-maintained Ruby bindings | ||
| |- | ||
| | [[WebAssembly]] | ||
| | [https://github.com/datafusion-contrib/datafusion-wasm-bindings datafusion-wasm-bindings] | ||
| | Community-maintained WebAssembly bindings | ||
| |- | ||
| | Browser tooling | ||
| | [https://github.com/datafusion-contrib/datafusion-wasm-playground datafusion-wasm-playground], [https://github.com/datafusion-contrib/datafusion-fiddle datafusion-fiddle] | ||
| | Interactive playgrounds | ||
| |} | ||
|
|
||
| == Ecosystem projects == | ||
|
|
||
| Several projects in the broader Apache ecosystem and the community-maintained [https://github.com/datafusion-contrib datafusion-contrib] organization extend DataFusion's capabilities.<ref name="df-contrib-org" /> | ||
|
|
||
| * [https://github.com/apache/datafusion-comet Apache DataFusion Comet], donated to the Apache Software Foundation by [[Apple Inc.|Apple]] in 2024, is a plugin that uses DataFusion to accelerate [[Apache Spark]] workloads as a drop-in replacement for Spark's JVM-based SQL execution engine<ref name="comet-donation" /> | ||
| * [https://github.com/datafusion-contrib/datafusion-federation datafusion-federation], which allows DataFusion to resolve queries across remote query engines while pushing down as much compute as possible to the remote source | ||
| * [https://github.com/datafusion-contrib/datafusion-distributed datafusion-distributed], a library for bringing distributed execution capabilities to DataFusion | ||
| * [https://github.com/datafusion-contrib/datafusion-materialized-views datafusion-materialized-views], which provides incremental view maintenance and query rewriting for [[materialized view|materialized views]] in DataFusion | ||
| * [https://github.com/datafusion-contrib/datafusion-table-providers datafusion-table-providers], which provides <code>TableProvider</code> implementations for reading data from external systems such as databases and file formats not natively supported by DataFusion | ||
|
|
||
| == References == | ||
|
|
||
| {{Reflist}} | ||
|
|
||
| == External links == | ||
|
|
||
| * {{Official website|https://datafusion.apache.org/}} | ||
| * {{GitHub|apache/datafusion}} | ||
| * {{URL|https://arrow.apache.org/}} Apache Arrow | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm biased to want to include a link to rerun but we don't have a blog post calling out DataFusion even though it is all over our repo. Will work on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is the ideal answer!