Skip to content

Commit 924de28

Browse files
committed
Merge branch 'main' into fill-null
2 parents 5a3cd8c + 10600fb commit 924de28

18 files changed

Lines changed: 2192 additions & 404 deletions

Cargo.lock

Lines changed: 371 additions & 229 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -34,25 +34,25 @@ protoc = [ "datafusion-substrait/protoc" ]
3434
substrait = ["dep:datafusion-substrait"]
3535

3636
[dependencies]
37-
tokio = { version = "1.43", features = ["macros", "rt", "rt-multi-thread", "sync"] }
38-
pyo3 = { version = "0.23", features = ["extension-module", "abi3", "abi3-py39"] }
39-
pyo3-async-runtimes = { version = "0.23", features = ["tokio-runtime"]}
40-
arrow = { version = "54.2.1", features = ["pyarrow"] }
41-
datafusion = { version = "46.0.1", features = ["avro", "unicode_expressions"] }
42-
datafusion-substrait = { version = "46.0.1", optional = true }
43-
datafusion-proto = { version = "46.0.1" }
44-
datafusion-ffi = { version = "46.0.1" }
37+
tokio = { version = "1.44", features = ["macros", "rt", "rt-multi-thread", "sync"] }
38+
pyo3 = { version = "0.24", features = ["extension-module", "abi3", "abi3-py39"] }
39+
pyo3-async-runtimes = { version = "0.24", features = ["tokio-runtime"]}
40+
arrow = { version = "55.0.0", features = ["pyarrow"] }
41+
datafusion = { version = "47.0.0", features = ["avro", "unicode_expressions"] }
42+
datafusion-substrait = { version = "47.0.0", optional = true }
43+
datafusion-proto = { version = "47.0.0" }
44+
datafusion-ffi = { version = "47.0.0" }
4545
prost = "0.13.1" # keep in line with `datafusion-substrait`
46-
uuid = { version = "1.12", features = ["v4"] }
46+
uuid = { version = "1.16", features = ["v4"] }
4747
mimalloc = { version = "0.1", optional = true, default-features = false, features = ["local_dynamic_tls"] }
48-
async-trait = "0.1.73"
48+
async-trait = "0.1.88"
4949
futures = "0.3"
50-
object_store = { version = "0.11.0", features = ["aws", "gcp", "azure", "http"] }
50+
object_store = { version = "0.12.0", features = ["aws", "gcp", "azure", "http"] }
5151
url = "2"
5252

5353
[build-dependencies]
5454
prost-types = "0.13.1" # keep in line with `datafusion-substrait`
55-
pyo3-build-config = "0.23"
55+
pyo3-build-config = "0.24"
5656

5757
[lib]
5858
name = "datafusion_python"

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ Example
7272
user-guide/introduction
7373
user-guide/basics
7474
user-guide/data-sources
75+
user-guide/dataframe
7576
user-guide/common-operations/index
7677
user-guide/io/index
7778
user-guide/configuration

docs/source/user-guide/basics.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ Concepts
2121
========
2222

2323
In this section, we will cover a basic example to introduce a few key concepts. We will use the
24-
2021 Yellow Taxi Trip Records ([download](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet)), from the [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
24+
2021 Yellow Taxi Trip Records (`download <https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet>`_),
25+
from the `TLC Trip Record Data <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_.
2526

2627
.. ipython:: python
2728
@@ -72,6 +73,8 @@ DataFrames are typically created by calling a method on :py:class:`~datafusion.c
7273
calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
7374
and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.
7475

76+
For more details on working with DataFrames, including visualization options and conversion to other formats, see :doc:`dataframe`.
77+
7578
Expressions
7679
-----------
7780

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
DataFrames
19+
==========
20+
21+
Overview
22+
--------
23+
24+
DataFusion's DataFrame API provides a powerful interface for building and executing queries against data sources.
25+
It offers a familiar API similar to pandas and other DataFrame libraries, but with the performance benefits of Rust
26+
and Arrow.
27+
28+
A DataFrame represents a logical plan that can be composed through operations like filtering, projection, and aggregation.
29+
The actual execution happens when terminal operations like ``collect()`` or ``show()`` are called.
30+
31+
Basic Usage
32+
-----------
33+
34+
.. code-block:: python
35+
36+
import datafusion
37+
from datafusion import col, lit
38+
39+
# Create a context and register a data source
40+
ctx = datafusion.SessionContext()
41+
ctx.register_csv("my_table", "path/to/data.csv")
42+
43+
# Create and manipulate a DataFrame
44+
df = ctx.sql("SELECT * FROM my_table")
45+
46+
# Or use the DataFrame API directly
47+
df = (ctx.table("my_table")
48+
.filter(col("age") > lit(25))
49+
.select([col("name"), col("age")]))
50+
51+
# Execute and collect results
52+
result = df.collect()
53+
54+
# Display the first few rows
55+
df.show()
56+
57+
HTML Rendering
58+
--------------
59+
60+
When working in Jupyter notebooks or other environments that support HTML rendering, DataFrames will
61+
automatically display as formatted HTML tables, making it easier to visualize your data.
62+
63+
The ``_repr_html_`` method is called automatically by Jupyter to render a DataFrame. This method
64+
controls how DataFrames appear in notebook environments, providing a richer visualization than
65+
plain text output.
66+
67+
Customizing HTML Rendering
68+
--------------------------
69+
70+
You can customize how DataFrames are rendered in HTML by configuring the formatter:
71+
72+
.. code-block:: python
73+
74+
from datafusion.html_formatter import configure_formatter
75+
76+
# Change the default styling
77+
configure_formatter(
78+
max_rows=50, # Maximum number of rows to display
79+
max_width=None, # Maximum width in pixels (None for auto)
80+
theme="light", # Theme: "light" or "dark"
81+
precision=2, # Floating point precision
82+
thousands_separator=",", # Separator for thousands
83+
date_format="%Y-%m-%d", # Date format
84+
truncate_width=20 # Max width for string columns before truncating
85+
)
86+
87+
The formatter settings affect all DataFrames displayed after configuration.
88+
89+
Custom Style Providers
90+
----------------------
91+
92+
For advanced styling needs, you can create a custom style provider:
93+
94+
.. code-block:: python
95+
96+
from datafusion.html_formatter import StyleProvider, configure_formatter
97+
98+
class MyStyleProvider(StyleProvider):
99+
def get_table_styles(self):
100+
return {
101+
"table": "border-collapse: collapse; width: 100%;",
102+
"th": "background-color: #007bff; color: white; padding: 8px; text-align: left;",
103+
"td": "border: 1px solid #ddd; padding: 8px;",
104+
"tr:nth-child(even)": "background-color: #f2f2f2;",
105+
}
106+
107+
def get_value_styles(self, dtype, value):
108+
"""Return custom styles for specific values"""
109+
if dtype == "float" and value < 0:
110+
return "color: red;"
111+
return None
112+
113+
# Apply the custom style provider
114+
configure_formatter(style_provider=MyStyleProvider())
115+
116+
Creating a Custom Formatter
117+
---------------------------
118+
119+
For complete control over rendering, you can implement a custom formatter:
120+
121+
.. code-block:: python
122+
123+
from datafusion.html_formatter import Formatter, get_formatter
124+
125+
class MyFormatter(Formatter):
126+
def format_html(self, batches, schema, has_more=False, table_uuid=None):
127+
# Create your custom HTML here
128+
html = "<div class='my-custom-table'>"
129+
# ... formatting logic ...
130+
html += "</div>"
131+
return html
132+
133+
# Set as the global formatter
134+
configure_formatter(formatter_class=MyFormatter)
135+
136+
# Or use the formatter just for specific operations
137+
formatter = get_formatter()
138+
custom_html = formatter.format_html(batches, schema)
139+
140+
Managing Formatters
141+
-------------------
142+
143+
Reset to default formatting:
144+
145+
.. code-block:: python
146+
147+
from datafusion.html_formatter import reset_formatter
148+
149+
# Reset to default settings
150+
reset_formatter()
151+
152+
Get the current formatter settings:
153+
154+
.. code-block:: python
155+
156+
from datafusion.html_formatter import get_formatter
157+
158+
formatter = get_formatter()
159+
print(formatter.max_rows)
160+
print(formatter.theme)
161+
162+
Contextual Formatting
163+
---------------------
164+
165+
You can also use a context manager to temporarily change formatting settings:
166+
167+
.. code-block:: python
168+
169+
from datafusion.html_formatter import formatting_context
170+
171+
# Default formatting
172+
df.show()
173+
174+
# Temporarily use different formatting
175+
with formatting_context(max_rows=100, theme="dark"):
176+
df.show() # Will use the temporary settings
177+
178+
# Back to default formatting
179+
df.show()

python/datafusion/__init__.py

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@
2626
except ImportError:
2727
import importlib_metadata
2828

29+
from datafusion.col import col, column
30+
2931
from . import functions, object_store, substrait, unparser
3032

3133
# The following imports are okay to remain as opaque to the user.
@@ -45,6 +47,7 @@
4547
Expr,
4648
WindowFrame,
4749
)
50+
from .html_formatter import configure_formatter
4851
from .io import read_avro, read_csv, read_json, read_parquet
4952
from .plan import ExecutionPlan, LogicalPlan
5053
from .record_batch import RecordBatch, RecordBatchStream
@@ -76,6 +79,7 @@
7679
"col",
7780
"column",
7881
"common",
82+
"configure_formatter",
7983
"expr",
8084
"functions",
8185
"lit",
@@ -93,16 +97,6 @@
9397
]
9498

9599

96-
def column(value: str) -> Expr:
97-
"""Create a column expression."""
98-
return Expr.column(value)
99-
100-
101-
def col(value: str) -> Expr:
102-
"""Create a column expression."""
103-
return Expr.column(value)
104-
105-
106100
def literal(value) -> Expr:
107101
"""Create a literal expression."""
108102
return Expr.literal(value)

python/datafusion/col.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
"""Col class."""
19+
20+
from datafusion.expr import Expr
21+
22+
23+
class Col:
24+
"""Create a column expression.
25+
26+
This helper class allows an extra syntax of creating columns using the __getattr__
27+
method.
28+
"""
29+
30+
def __call__(self, value: str) -> Expr:
31+
"""Create a column expression."""
32+
return Expr.column(value)
33+
34+
def __getattr__(self, value: str) -> Expr:
35+
"""Create a column using attribute syntax."""
36+
# For autocomplete to work with IPython
37+
if value.startswith("__wrapped__"):
38+
return getattr(type(self), value)
39+
40+
return Expr.column(value)
41+
42+
43+
col: Col = Col()
44+
column: Col = Col()
45+
__all__ = ["col", "column"]

0 commit comments

Comments
 (0)