Skip to content

add first draft of wikipedia article#21105

Draft
gene-bordegaray wants to merge 6 commits intoapache:mainfrom
gene-bordegaray:issue-21076-wikipedia-draft
Draft

add first draft of wikipedia article#21105
gene-bordegaray wants to merge 6 commits intoapache:mainfrom
gene-bordegaray:issue-21076-wikipedia-draft

Conversation

@gene-bordegaray
Copy link
Copy Markdown
Contributor

@gene-bordegaray gene-bordegaray commented Mar 23, 2026

Which issue does this PR close?

  1. Go to this page
  2. Click Edit source
  3. Paste dev/wiki/apache-datafusion.wikitext
  4. Click Show preview

@github-actions github-actions bot added the development-process Related to development process of DataFusion label Mar 23, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gene-bordegaray -- this looks great. I left some suggestions on how to make some of this language tighter.

Maybe we can wait a few days more and then submit to the wikipedia editors 🤔

Comment thread dev/wiki/apache-datafusion.wikitext Outdated
Comment thread dev/wiki/apache-datafusion.wikitext Outdated
Comment thread dev/wiki/apache-datafusion.wikitext Outdated
@alamb alamb changed the title add first draft of article add first draft of wikipedia article Mar 23, 2026
@gene-bordegaray
Copy link
Copy Markdown
Contributor Author

also a side note. I wanted to add the DF logo but my account needs to be verified (I think will be in a day or two) 😅

Comment thread dev/wiki/apache-datafusion.wikitext Outdated
| website = {{URL|https://datafusion.apache.org/}}
}}

'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.

I think we can make this a bit better in the sense of introducing DataFusion and its uniqueness. Here's what I think :

Often described as the "LLVM for Databases," [Source 1] Apache DataFusion is a modular, Arrow-native query engine library designed for embedding into custom systems rather than operating as a monolithic standalone server [Source 2 and 3]. This high-performance Rust framework provides a composable foundation, allowing developers to precisely extend query planning and vectorized execution to meet unique architectural requirements. [Source 2 and 3]

Source 1 : https://midas.bu.edu/assets/slides/andrew_lamb_slides.pdf (cc @alamb )

Source 2 and 3 (this is the first two reference) : {{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we should add the "LLVM for databases". Mostly because the primary source for it is from not the strongest source (slide show) and doesnt appear in the other sources like the SIGMOD paper or other coverage.

I was reviwing the Wikipedia guidelines and they advise anything promotional unless well-cited which this may get flagged for.

https://en.wikipedia.org/wiki/Wikipedia:Verifiability

Comment thread dev/wiki/apache-datafusion.wikitext Outdated
| website = {{URL|https://datafusion.apache.org/}}
}}

'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the project will continue to grow so we can write at the end :

Apache DataFusion now sees over one million monthly downloads. [cite crate.io source]

source : https://crates.io/search?q=datafusion

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also say "as of March 2026, DataFusion saw one million monthly downloads" if we wanted to ensure the sstatement remained accurate

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think this is great, definitely with the third party source 👍

Copy link
Copy Markdown
Contributor

@ntjohnson1 ntjohnson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got sick so fell off looking at this. I think this looks great for a first pass and we should push to wikipedia to see what the reviewers say. One note that I don't know if I have time for is that this seems to slightly over emphasize the extensibility perspective.

On a quick read through I would assume this was only for building the infrastructure and could easily miss the SQL/dataframe API bits. At rerun I use datafusion (specifically datafusion-python) quite heavily but don't really know the details about our table provider (since other people build that bit). I suspect our customers will also hit this page since we generate examples for the DataFrame API in python (and are generating more SQL examples). https://rerun.io/docs/howto/query-and-transform/dataframe_operations

Mostly just food for thought that there might be two distinct audiences interested in this page. People who build on datafusion and those who build data products using datafusion top level APIs. (I still think landing the page first makes sense then I or someone else can potentially try to add a section for more SQL/DataFrame API details)

Comment thread dev/wiki/apache-datafusion.wikitext Outdated
| website = {{URL|https://datafusion.apache.org/}}
}}

'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded one million monthly downloads on crates.io.<ref name="crates-io">{{cite web |title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io |access-date=2026-03-26}}</ref>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a formal page on Dataframes but there is a stub that refers to Spark, pandas, etc. After this page lands we should add a pointer to it from there. https://en.wikipedia.org/wiki/Dataframe

Comment thread dev/wiki/apache-datafusion.wikitext Outdated
| website = {{URL|https://datafusion.apache.org/}}
}}

'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable analytical query engine written in [[Rust (programming language)|Rust]], built on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres |first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine |journal=Proceedings of the 2024 International Conference on Management of Data |year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite web |title=Introduction |url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces for analytical query execution and is designed to be used as a library by developers building databases, query engines, and analytical tools, rather than as a standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> The project originated in 2017, was donated to the [[Apache Arrow]] project in 2019, and became a top-level project of the [[Apache Software Foundation]] in 2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native Query Engine for Apache Arrow |url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ |website=Apache DataFusion Blog |publisher=Apache Software Foundation |date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web |title=Apache Software Foundation Announces New Top-Level Project Apache DataFusion |url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 |access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded one million monthly downloads on crates.io.<ref name="crates-io">{{cite web |title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io |access-date=2026-03-26}}</ref>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I think extensible analytical query engine is clearer than embeddable analytical query engine. Extensible is what is listed on the landing page for datafusion on apache.org


== Adoption and reception ==

DataFusion has been adopted across a range of analytics and database products. [[Cloudflare]] used DataFusion in its Log Explorer product to execute SQL queries over log data stored in Cloudflare R2.<ref name="cloudflare">{{cite web |title=Cloudflare Log Explorer is now GA, providing native observability and forensics |url=https://blog.cloudflare.com/logexplorer-ga/ |website=The Cloudflare Blog |publisher=Cloudflare |date=2025-06-18 |access-date=2026-03-22}}</ref> [[Palantir Technologies|Palantir]] Lightweight Pipelines are powered by DataFusion.<ref name="palantir-2025">{{cite web |title=Announcements: July 2025 |url=https://www.palantir.com/docs/foundry/announcements/2025-07 |website=Palantir Foundry Documentation |publisher=Palantir Technologies |date=2025-07-29 |access-date=2026-03-22}}</ref><ref name="palantir-2024">{{cite web |title=Announcements: February 2024 |url=https://www.palantir.com/docs/foundry/announcements/2024-02 |website=Palantir Foundry Documentation |publisher=Palantir Technologies |date=February 2024 |access-date=2026-03-22}}</ref> [[InfluxDB]] 3.0 uses DataFusion as part of the FDAP stack: Apache Flight, DataFusion, Arrow, and Parquet.<ref name="influx-fdap">{{cite web |title=Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0 |url=https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/ |website=InfluxData |date=2023-10-25 |access-date=2026-03-22}}</ref> Other users described in public sources include EDB Postgres AI,<ref name="siliconangle-edb">{{cite web |title=Enterprise DB begins rolling AI features into PostgreSQL |url=https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/ |website=SiliconANGLE |date=2024-05-23 |access-date=2026-03-22}}</ref> Cube,<ref name="cube-pushdown">{{cite web |title=Query pushdown in Cube's semantic layer |url=https://cube.dev/blog/query-push-down-in-cubes-semantic-layer |website=Cube |date=2024-06-03 |access-date=2026-03-22}}</ref> Spice AI,<ref name="spice">{{cite web |title=How we use Apache DataFusion at Spice AI |url=https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai |website=Spice AI |date=2026-01-17 |access-date=2026-03-22}}</ref> Pydantic Logfire,<ref name="logfire">{{cite web |title=We're changing database |url=https://github.com/pydantic/logfire/issues/408 |website=GitHub |date=2024-08-29 |access-date=2026-03-22}}</ref> and Kamu.<ref name="kamu">{{cite web |title=100X faster ingestion, and FlightSQL support for connecting BI tools |url=https://www.kamu.dev/blog/2023-09-datafusion-flightsql/ |website=Kamu Data |date=2023-09-26 |access-date=2026-03-22}}</ref>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm biased to want to include a link to rerun but we don't have a blog post calling out DataFusion even though it is all over our repo. Will work on that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on that.

that is the ideal answer!

@AndreaBozzo
Copy link
Copy Markdown
Contributor

This Is nice Indeed, the expansions on SQL and dataframes can be added later and would be very useful

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 27, 2026

@gene-bordegaray should we move this to the wikipedia site now (after addressing other comments)?

@gene-bordegaray
Copy link
Copy Markdown
Contributor Author

@gene-bordegaray should we move this to the wikipedia site now (after addressing other comments)?

Ya I can address comments tonight and tmrw and take some time this weekend to the wikipedia site 👍

@gene-bordegaray
Copy link
Copy Markdown
Contributor Author

I got sick so fell off looking at this. I think this looks great for a first pass and we should push to wikipedia to see what the reviewers say. One note that I don't know if I have time for is that this seems to slightly over emphasize the extensibility perspective.

On a quick read through I would assume this was only for building the infrastructure and could easily miss the SQL/dataframe API bits. At rerun I use datafusion (specifically datafusion-python) quite heavily but don't really know the details about our table provider (since other people build that bit). I suspect our customers will also hit this page since we generate examples for the DataFrame API in python (and are generating more SQL examples). https://rerun.io/docs/howto/query-and-transform/dataframe_operations

Mostly just food for thought that there might be two distinct audiences interested in this page. People who build on datafusion and those who build data products using datafusion top level APIs. (I still think landing the page first makes sense then I or someone else can potentially try to add a section for more SQL/DataFrame API details)

Ya I definitely think it leans toward teh infrastructure sie of things as it stands ( this is what I used DF for so guilty for that 😅 ). I agree that getting something up and someone with more expereinces using DF for the Dataframes / SQL aspects can step in and add what they see fit.

@gene-bordegaray
Copy link
Copy Markdown
Contributor Author

gene-bordegaray commented Mar 29, 2026

I have chnaged the wording to be extensible and left the other comments as be to be addressed after we are first published / reviewed. I will leave this up for a day or so and then submit for review.

Lmk if this is alright with everyone 😄

cc: @alamb @AndreaBozzo @ntjohnson1 @NNhanptnk

@gene-bordegaray
Copy link
Copy Markdown
Contributor Author

gene-bordegaray commented Mar 29, 2026

Also if anyone has a confirmed account and sees this, would greatly appreciate if you could upload the DataFusion logo to Wikipedia, I am not allowed to:

DF_original_light

or whichever you think is appropriate 👍

@ntjohnson1
Copy link
Copy Markdown
Contributor

I have chnaged the wording to be extensible and left the other comments as be to be addressed after we are first published / reviewed. I will leave this up for a day or so and then submit for review.

Lmk if this is alright with everyone 😄

cc: @alamb @AndreaBozzo @ntjohnson1 @NNhanptnk

Works for me! I'd vote ship it now and we/anyone else can edit further after it goes live. Thanks for pushing this forward.

@gene-bordegaray
Copy link
Copy Markdown
Contributor Author

gene-bordegaray commented Mar 30, 2026

went ahead and submitted a draft for review: https://en.wikipedia.org/wiki/Draft:Apache_DataFusion

I can keep an eye on this

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 8, 2026

went ahead and submitted a draft for review: https://en.wikipedia.org/wiki/Draft:Apache_DataFusion

I can keep an eye on this

4k pending reviews. And I thought the DF review queue was bad 😆

Screenshot 2026-04-08 at 3 27 11 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development-process Related to development process of DataFusion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write a wikipedia article for Apache DataFusion

5 participants