Skip to content

Commit 6abfc52

Browse files
committed
Enhance documentation for DataFrame streaming and memory management in index.rst and arrow.rst
1 parent f2242ce commit 6abfc52

2 files changed

Lines changed: 53 additions & 16 deletions

File tree

docs/source/user-guide/dataframe/index.rst

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,10 @@ The ``DataFrame`` class is the core abstraction in DataFusion that represents ta
2525
on that data. DataFrames provide a flexible API for transforming data through various operations such as
2626
filtering, projection, aggregation, joining, and more.
2727

28-
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
29-
terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called.
28+
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
29+
terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called. ``collect()`` loads
30+
all record batches into Python memory; for large results you may want to stream data instead using
31+
``execute_stream()`` or ``__arrow_c_stream__()``.
3032

3133
Creating DataFrames
3234
-------------------
@@ -128,27 +130,44 @@ DataFusion's DataFrame API offers a wide range of operations:
128130
129131
Terminal Operations
130132
-------------------
131-
132-
To materialize the results of your DataFrame operations:
133+
``collect()`` materializes every record batch in Python. While convenient, this
134+
eagerly loads the full result set into memory and can overwhelm the Python
135+
process for large queries. Alternatives that stream data from Rust avoid this
136+
memory growth:
133137

134138
.. code-block:: python
135139
136-
# Collect all data as PyArrow RecordBatches
140+
# Collect all data as PyArrow RecordBatches (loads entire result set)
137141
result_batches = df.collect()
138-
139-
# Convert to various formats
142+
143+
# Stream batches using the native API
144+
stream = df.execute_stream()
145+
for batch in stream:
146+
... # process each RecordBatch
147+
148+
# Stream via the Arrow C Data Interface
149+
import pyarrow as pa
150+
reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
151+
for batch in reader:
152+
...
153+
154+
# Convert to various formats (also load all data into memory)
140155
pandas_df = df.to_pandas() # Pandas DataFrame
141156
polars_df = df.to_polars() # Polars DataFrame
142157
arrow_table = df.to_arrow_table() # PyArrow Table
143158
py_dict = df.to_pydict() # Python dictionary
144159
py_list = df.to_pylist() # Python list of dictionaries
145-
160+
146161
# Display results
147162
df.show() # Print tabular format to console
148-
163+
149164
# Count rows
150165
count = df.count()
151166
167+
For large outputs, prefer engine-level writers such as ``df.write_parquet()``
168+
or other DataFusion writers. These stream data directly to the destination and
169+
avoid buffering the entire dataset in Python.
170+
152171
HTML Rendering
153172
--------------
154173

docs/source/user-guide/io/arrow.rst

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -57,16 +57,34 @@ and returns a ``StructArray``. Common pyarrow sources you can use are:
5757
Exporting from DataFusion
5858
-------------------------
5959

60-
DataFusion DataFrames implement ``__arrow_c_stream__`` PyCapsule interface, so any
61-
Python library that accepts these can import a DataFusion DataFrame directly.
62-
The exported stream yields record batches lazily using DataFusion's
63-
``execute_stream`` mechanism, allowing consumers to process results incrementally
64-
without buffering the entire dataset in memory. This streaming behavior helps
65-
avoid out-of-memory failures when working with large queries.
60+
DataFusion DataFrames implement ``__arrow_c_stream__`` so any Python library
61+
that accepts this interface can import a DataFusion ``DataFrame`` directly.
6662

63+
``collect()`` or ``pa.table(df)`` will materialize every record batch in
64+
Python. For large results this can quickly exhaust memory. Instead, stream the
65+
output incrementally:
66+
67+
.. ipython:: python
68+
69+
# Stream batches with DataFusion's native API
70+
stream = df.execute_stream()
71+
for batch in stream:
72+
... # process each RecordBatch as it arrives
73+
74+
.. ipython:: python
75+
76+
# Expose a C stream that PyArrow can consume lazily
77+
import pyarrow as pa
78+
reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
79+
for batch in reader:
80+
... # process each batch without buffering the entire table
81+
82+
If the goal is simply to persist results, prefer engine-level writers such as
83+
``df.write_parquet()``. These writers stream data from Rust directly to the
84+
destination and avoid Python-side memory growth.
6785

6886
.. ipython:: python
6987
7088
df = df.select((col("a") * lit(1.5)).alias("c"), lit("df").alias("d"))
71-
pa.table(df)
89+
pa.table(df) # loads all batches into memory
7290

0 commit comments

Comments
 (0)