Skip to content

Commit 6b72f08

Browse files
committed
docs: clarify lazy evaluation and terminal operations in DataFrame documentation
1 parent 7ed0583 commit 6b72f08

File tree

2 files changed

+19
-13
lines changed

2 files changed

+19
-13
lines changed

docs/source/user-guide/dataframe/index.rst

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,9 @@ The ``DataFrame`` class is the core abstraction in DataFusion that represents ta
2525
on that data. DataFrames provide a flexible API for transforming data through various operations such as
2626
filtering, projection, aggregation, joining, and more.
2727

28-
A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
29-
terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called.
28+
A DataFrame represents a lazily evaluated logical plan. No computation occurs until you perform a
29+
terminal operation (such as ``collect()``, ``show()``, or ``to_pandas()``) or iterate over the
30+
``DataFrame``.
3031

3132
Creating DataFrames
3233
-------------------
@@ -129,20 +130,25 @@ DataFusion's DataFrame API offers a wide range of operations:
129130
Terminal Operations
130131
-------------------
131132

132-
To materialize the results of your DataFrame operations:
133+
To materialize the results of your DataFrame operations, call a terminal method or iterate over the
134+
``DataFrame`` to consume ``pyarrow.RecordBatch`` objects lazily:
133135

134136
.. code-block:: python
135137
138+
# Iterate over the DataFrame to stream record batches
139+
for batch in df:
140+
... # process each batch as it is produced
141+
136142
# Collect all data as PyArrow RecordBatches
137143
result_batches = df.collect()
138-
144+
139145
# Convert to various formats
140146
pandas_df = df.to_pandas() # Pandas DataFrame
141147
polars_df = df.to_polars() # Polars DataFrame
142148
arrow_table = df.to_arrow_table() # PyArrow Table
143149
py_dict = df.to_pydict() # Python dictionary
144150
py_list = df.to_pylist() # Python list of dictionaries
145-
151+
146152
# Display results
147153
df.show() # Print tabular format to console
148154
@@ -154,10 +160,9 @@ PyArrow Streaming
154160

155161
DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling
156162
zero-copy streaming into libraries like `PyArrow <https://arrow.apache.org/>`_.
157-
Earlier versions eagerly converted the entire DataFrame when exporting to
158-
PyArrow, which could exhaust memory on large datasets. With streaming, batches
159-
are produced lazily so you can process arbitrarily large results without
160-
out-of-memory errors.
163+
Because DataFrames are lazily evaluated, batches are produced only as they are
164+
consumed so you can process arbitrarily large results without out-of-memory
165+
errors.
161166

162167
.. code-block:: python
163168

python/datafusion/dataframe.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -291,10 +291,11 @@ def __init__(
291291
class DataFrame:
292292
"""Two dimensional table representation of data.
293293
294-
DataFrame objects are iterable; iterating over a DataFrame yields
295-
:class:`pyarrow.RecordBatch` instances lazily. Use
296-
:py:meth:`to_stream` to obtain a :class:`~datafusion.record_batch.RecordBatchStream`
297-
for explicit iteration over the results.
294+
A :py:class:`DataFrame` represents a lazily evaluated plan. Operations build
295+
up the plan without executing it, and results are only materialized during a
296+
terminal operation (for example, :py:meth:`collect`, :py:meth:`show`, or
297+
:py:meth:`to_pandas`) or when iterating over the DataFrame, which yields
298+
:class:`pyarrow.RecordBatch` objects lazily.
298299
299300
See :ref:`user_guide_concepts` in the online documentation for more information.
300301
"""

0 commit comments

Comments
 (0)