@@ -25,8 +25,10 @@ The ``DataFrame`` class is the core abstraction in DataFusion that represents ta
2525on that data. DataFrames provide a flexible API for transforming data through various operations such as
2626filtering, projection, aggregation, joining, and more.
2727
28- A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
29- terminal operations like ``collect() ``, ``show() ``, or ``to_pandas() `` are called.
28+ A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
29+ terminal operations like ``collect() ``, ``show() ``, or ``to_pandas() `` are called. ``collect() `` loads
30+ all record batches into Python memory; for large results you may want to stream data instead using
31+ ``execute_stream() `` or ``__arrow_c_stream__() ``.
3032
3133Creating DataFrames
3234-------------------
@@ -128,27 +130,44 @@ DataFusion's DataFrame API offers a wide range of operations:
128130
129131 Terminal Operations
130132-------------------
131-
132- To materialize the results of your DataFrame operations:
133+ ``collect() `` materializes every record batch in Python. While convenient, this
134+ eagerly loads the full result set into memory and can overwhelm the Python
135+ process for large queries. Alternatives that stream data from Rust avoid this
136+ memory growth:
133137
134138.. code-block :: python
135139
136- # Collect all data as PyArrow RecordBatches
140+ # Collect all data as PyArrow RecordBatches (loads entire result set)
137141 result_batches = df.collect()
138-
139- # Convert to various formats
142+
143+ # Stream batches using the native API
144+ stream = df.execute_stream()
145+ for batch in stream:
146+ ... # process each RecordBatch
147+
148+ # Stream via the Arrow C Data Interface
149+ import pyarrow as pa
150+ reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
151+ for batch in reader:
152+ ...
153+
154+ # Convert to various formats (also load all data into memory)
140155 pandas_df = df.to_pandas() # Pandas DataFrame
141156 polars_df = df.to_polars() # Polars DataFrame
142157 arrow_table = df.to_arrow_table() # PyArrow Table
143158 py_dict = df.to_pydict() # Python dictionary
144159 py_list = df.to_pylist() # Python list of dictionaries
145-
160+
146161 # Display results
147162 df.show() # Print tabular format to console
148-
163+
149164 # Count rows
150165 count = df.count()
151166
167+ For large outputs, prefer engine-level writers such as ``df.write_parquet() ``
168+ or other DataFusion writers. These stream data directly to the destination and
169+ avoid buffering the entire dataset in Python.
170+
152171HTML Rendering
153172--------------
154173
0 commit comments