docs: clarify lazy evaluation and terminal operations in DataFrame documentation

kosiew · kosiew · commit 6b72f080cdd7 · 2025-09-06T12:04:42.000+08:00
diff --git a/docs/source/user-guide/dataframe/index.rst b/docs/source/user-guide/dataframe/index.rst
@@ -25,8 +25,9 @@ The ``DataFrame`` class is the core abstraction in DataFusion that represents ta
 on that data. DataFrames provide a flexible API for transforming data through various operations such as
 filtering, projection, aggregation, joining, and more.
 
-A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when 
-terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called.
+A DataFrame represents a lazily evaluated logical plan. No computation occurs until you perform a
+terminal operation (such as ``collect()``, ``show()``, or ``to_pandas()``) or iterate over the
+``DataFrame``.
 
 Creating DataFrames
 -------------------
@@ -129,20 +130,25 @@ DataFusion's DataFrame API offers a wide range of operations:
 Terminal Operations
 -------------------
 
-To materialize the results of your DataFrame operations:
+To materialize the results of your DataFrame operations, call a terminal method or iterate over the
+``DataFrame`` to consume ``pyarrow.RecordBatch`` objects lazily:
 
 .. code-block:: python
 
+    # Iterate over the DataFrame to stream record batches
+    for batch in df:
+        ...  # process each batch as it is produced
+
     # Collect all data as PyArrow RecordBatches
     result_batches = df.collect()
-    
+
     # Convert to various formats
     pandas_df = df.to_pandas()        # Pandas DataFrame
     polars_df = df.to_polars()        # Polars DataFrame
     arrow_table = df.to_arrow_table() # PyArrow Table
     py_dict = df.to_pydict()          # Python dictionary
     py_list = df.to_pylist()          # Python list of dictionaries
-    
+
     # Display results
     df.show()                         # Print tabular format to console
 
@@ -154,10 +160,9 @@ PyArrow Streaming
 
 DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling
 zero-copy streaming into libraries like `PyArrow <https://arrow.apache.org/>`_.
-Earlier versions eagerly converted the entire DataFrame when exporting to
-PyArrow, which could exhaust memory on large datasets. With streaming, batches
-are produced lazily so you can process arbitrarily large results without
-out-of-memory errors.
+Because DataFrames are lazily evaluated, batches are produced only as they are
+consumed so you can process arbitrarily large results without out-of-memory
+errors.
 
 .. code-block:: python
 
diff --git a/python/datafusion/dataframe.py b/python/datafusion/dataframe.py
@@ -291,10 +291,11 @@ def __init__(
 class DataFrame:
     """Two dimensional table representation of data.
 
-    DataFrame objects are iterable; iterating over a DataFrame yields
-    :class:`pyarrow.RecordBatch` instances lazily. Use
-    :py:meth:`to_stream` to obtain a :class:`~datafusion.record_batch.RecordBatchStream`
-    for explicit iteration over the results.
+    A :py:class:`DataFrame` represents a lazily evaluated plan. Operations build
+    up the plan without executing it, and results are only materialized during a
+    terminal operation (for example, :py:meth:`collect`, :py:meth:`show`, or
+    :py:meth:`to_pandas`) or when iterating over the DataFrame, which yields
+    :class:`pyarrow.RecordBatch` objects lazily.
 
     See :ref:`user_guide_concepts` in the online documentation for more information.
     """