Enhance documentation for DataFrame streaming and memory management in index.rst and arrow.rst

kosiew · kosiew · commit 6abfc5206cdb · 2025-08-28T10:31:36.000+08:00
diff --git a/docs/source/user-guide/dataframe/index.rst b/docs/source/user-guide/dataframe/index.rst
@@ -25,8 +25,10 @@ The ``DataFrame`` class is the core abstraction in DataFusion that represents ta
 on that data. DataFrames provide a flexible API for transforming data through various operations such as
 filtering, projection, aggregation, joining, and more.
 
-A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when 
-terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called.
+A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
+terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called. ``collect()`` loads
+all record batches into Python memory; for large results you may want to stream data instead using
+``execute_stream()`` or ``__arrow_c_stream__()``.
 
 Creating DataFrames
 -------------------
@@ -128,27 +130,44 @@ DataFusion's DataFrame API offers a wide range of operations:
 
 Terminal Operations
 -------------------
-
-To materialize the results of your DataFrame operations:
+``collect()`` materializes every record batch in Python. While convenient, this
+eagerly loads the full result set into memory and can overwhelm the Python
+process for large queries. Alternatives that stream data from Rust avoid this
+memory growth:
 
 .. code-block:: python
 
-    # Collect all data as PyArrow RecordBatches
+    # Collect all data as PyArrow RecordBatches (loads entire result set)
     result_batches = df.collect()
-    
-    # Convert to various formats
+
+    # Stream batches using the native API
+    stream = df.execute_stream()
+    for batch in stream:
+        ...  # process each RecordBatch
+
+    # Stream via the Arrow C Data Interface
+    import pyarrow as pa
+    reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
+    for batch in reader:
+        ...
+
+    # Convert to various formats (also load all data into memory)
     pandas_df = df.to_pandas()        # Pandas DataFrame
     polars_df = df.to_polars()        # Polars DataFrame
     arrow_table = df.to_arrow_table() # PyArrow Table
     py_dict = df.to_pydict()          # Python dictionary
     py_list = df.to_pylist()          # Python list of dictionaries
-    
+
     # Display results
     df.show()                         # Print tabular format to console
-    
+
     # Count rows
     count = df.count()
 
+For large outputs, prefer engine-level writers such as ``df.write_parquet()``
+or other DataFusion writers. These stream data directly to the destination and
+avoid buffering the entire dataset in Python.
+
 HTML Rendering
 --------------
 
diff --git a/docs/source/user-guide/io/arrow.rst b/docs/source/user-guide/io/arrow.rst
@@ -57,16 +57,34 @@ and returns a ``StructArray``. Common pyarrow sources you can use are:
 Exporting from DataFusion
 -------------------------
 
-DataFusion DataFrames implement ``__arrow_c_stream__`` PyCapsule interface, so any
-Python library that accepts these can import a DataFusion DataFrame directly.
-The exported stream yields record batches lazily using DataFusion's
-``execute_stream`` mechanism, allowing consumers to process results incrementally
-without buffering the entire dataset in memory. This streaming behavior helps
-avoid out-of-memory failures when working with large queries.
+DataFusion DataFrames implement ``__arrow_c_stream__`` so any Python library
+that accepts this interface can import a DataFusion ``DataFrame`` directly.
 
+``collect()`` or ``pa.table(df)`` will materialize every record batch in
+Python. For large results this can quickly exhaust memory. Instead, stream the
+output incrementally:
+
+.. ipython:: python
+
+    # Stream batches with DataFusion's native API
+    stream = df.execute_stream()
+    for batch in stream:
+        ...  # process each RecordBatch as it arrives
+
+.. ipython:: python
+
+    # Expose a C stream that PyArrow can consume lazily
+    import pyarrow as pa
+    reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
+    for batch in reader:
+        ...  # process each batch without buffering the entire table
+
+If the goal is simply to persist results, prefer engine-level writers such as
+``df.write_parquet()``. These writers stream data from Rust directly to the
+destination and avoid Python-side memory growth.
 
 .. ipython:: python
 
     df = df.select((col("a") * lit(1.5)).alias("c"), lit("df").alias("d"))
-    pa.table(df)
+    pa.table(df)  # loads all batches into memory