kosiew
diff --git a/‎AGENTS.md‎
Lines changed: 53 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 0 additions & 20 deletions b/‎Cargo.lock‎
Lines changed: 0 additions & 20 deletions
diff --git a/‎Cargo.toml‎
Lines changed: 0 additions & 1 deletion b/‎Cargo.toml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎benchmarks/collect_gil_bench.py‎
Lines changed: 0 additions & 24 deletions b/‎benchmarks/collect_gil_bench.py‎
Lines changed: 0 additions & 24 deletions
diff --git a/‎docs/source/user-guide/configuration.rst‎
Lines changed: 0 additions & 21 deletions b/‎docs/source/user-guide/configuration.rst‎
Lines changed: 0 additions & 21 deletions
diff --git a/‎docs/source/user-guide/dataframe/collect-gil.md‎
Lines changed: 0 additions & 26 deletions b/‎docs/source/user-guide/dataframe/collect-gil.md‎
Lines changed: 0 additions & 26 deletions
diff --git a/‎docs/source/user-guide/dataframe/index.rst‎
Lines changed: 9 additions & 32 deletions b/‎docs/source/user-guide/dataframe/index.rst‎
Lines changed: 9 additions & 32 deletions
diff --git a/‎docs/source/user-guide/io/arrow.rst‎
Lines changed: 7 additions & 24 deletions b/‎docs/source/user-guide/io/arrow.rst‎
Lines changed: 7 additions & 24 deletions
diff --git a/‎python/datafusion/context.py‎
Lines changed: 1 addition & 7 deletions b/‎python/datafusion/context.py‎
Lines changed: 1 addition & 7 deletions
@@ -0,0 +1,53 @@
+# AGENTS Instructions
+
+This repository contains Python bindings for Rust's DataFusion.
+
+## Development workflow
+- Ensure git submodules are initialized: `git submodule update --init`.
+- Build the Rust extension before running tests:
+  - `uv run --no-project maturin develop --uv`
+- Run tests with pytest:
+  - `uv --no-project pytest .`
+
+## Linting and formatting
+- Use pre-commit for linting/formatting.
+- Run hooks for changed files before committing:
+  - `pre-commit run --files <files>`
+  - or `pre-commit run --all-files`
+- Hooks enforce:
+  - Python linting/formatting via Ruff
+  - Rust formatting via `cargo fmt`
+  - Rust linting via `cargo clippy`
+
+## Notes
+- The repository mixes Python and Rust; ensure changes build for both languages.
+- If adding new dependencies, update `pyproject.toml` and run `uv sync --dev --no-install-package datafusion`.
+
+## Helper Functions
+- `python/datafusion/io.py` offers global context readers:
+  - `read_parquet`
+  - `read_json`
+  - `read_csv`
+  - `read_avro`
+- `python/datafusion/user_defined.py` exports convenience creators for user-defined functions:
+  - `udf` (scalar)
+  - `udaf` (aggregate)
+  - `udwf` (window)
+  - `udtf` (table)
+- `python/datafusion/col.py` exposes the `Col` helper with `col` and `column` instances for building column expressions using attribute access.
+- `python/datafusion/catalog.py` provides Python-based catalog and schema providers.
+- `python/datafusion/object_store.py` exposes object store connectors: `AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `LocalFileSystem`, and `Http`.
+- `python/datafusion/unparser.py` converts logical plans back to SQL via the `Dialect` and `Unparser` classes.
+- `python/datafusion/dataframe_formatter.py` offers configurable HTML and string formatting for DataFrames (replaces the deprecated `html_formatter.py`).
+- `python/tests/generic.py` includes utilities for test data generation:
+  - `data`
+  - `data_with_nans`
+  - `data_datetime`
+  - `data_date32`
+  - `data_timedelta`
+  - `data_binary_other`
+  - `write_parquet`
+- `python/tests/conftest.py` defines reusable pytest fixtures:
+  - `ctx` creates a `SessionContext`.
+  - `database` registers a sample CSV dataset.
+- `src/dataframe.rs` provides the `collect_record_batches_to_display` helper to fetch the first non-empty record batch and flag if more are available.
@@ -48,7 +48,6 @@ uuid = { version = "1.18", features = ["v4"] }
 mimalloc = { version = "0.1", optional = true, default-features = false, features = ["local_dynamic_tls"] }
 async-trait = "0.1.89"
 futures = "0.3"
-rayon = "1.10"
 object_store = { version = "0.12.3", features = ["aws", "gcp", "azure", "http"] }
 url = "2"
 log = "0.4.27"
 
@@ -47,26 +47,5 @@ a :py:class:`~datafusion.context.SessionConfig` and :py:class:`~datafusion.conte
     print(ctx)
 
 
-.. _target_partitions:
-
-Target partitions and threads
------------------------------
-
-The :py:meth:`~datafusion.context.SessionConfig.with_target_partitions` method
-controls how many partitions DataFusion uses when executing a query. Each
-partition is processed on its own thread, so this setting effectively limits
-the number of threads that will be scheduled.
-
-For most workloads a good starting value is the number of logical CPU cores on
-your machine. You can use :func:`os.cpu_count` to automatically configure this::
-
-    import os
-    config = SessionConfig().with_target_partitions(os.cpu_count())
-
-Choosing a value significantly higher than the available cores can lead to
-excessive context switching without performance gains, while a much lower value
-may underutilize the machine.
-
-
 You can read more about available :py:class:`~datafusion.context.SessionConfig` options in the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
 and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.
@@ -25,10 +25,8 @@ The ``DataFrame`` class is the core abstraction in DataFusion that represents ta
 on that data. DataFrames provide a flexible API for transforming data through various operations such as
 filtering, projection, aggregation, joining, and more.
 
-A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when
-terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called. ``collect()`` loads
-all record batches into Python memory; for large results you may want to stream data instead using
-``execute_stream()`` or ``__arrow_c_stream__()``.
+A DataFrame represents a logical plan that is lazily evaluated. The actual execution occurs only when 
+terminal operations like ``collect()``, ``show()``, or ``to_pandas()`` are called.
 
 Creating DataFrames
 -------------------
@@ -130,47 +128,27 @@ DataFusion's DataFrame API offers a wide range of operations:
 
 Terminal Operations
 -------------------
-``collect()`` materializes every record batch in Python. While convenient, this
-eagerly loads the full result set into memory and can overwhelm the Python
-process for large queries. Alternatives that stream data from Rust avoid this
-memory growth:
+
+To materialize the results of your DataFrame operations:
 
 .. code-block:: python
 
-    # Collect all data as PyArrow RecordBatches (loads entire result set)
+    # Collect all data as PyArrow RecordBatches
     result_batches = df.collect()
-
-    # Stream batches using the native API
-    stream = df.execute_stream()
-    for batch in stream:
-        ...  # process each RecordBatch
-
-    # Stream via the Arrow C Data Interface
-    import pyarrow as pa
-    reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
-    for batch in reader:
-        ...
-
-    # Convert to various formats (also load all data into memory)
+    
+    # Convert to various formats
     pandas_df = df.to_pandas()        # Pandas DataFrame
     polars_df = df.to_polars()        # Polars DataFrame
     arrow_table = df.to_arrow_table() # PyArrow Table
     py_dict = df.to_pydict()          # Python dictionary
     py_list = df.to_pylist()          # Python list of dictionaries
-
+    
     # Display results
     df.show()                         # Print tabular format to console
-
+    
     # Count rows
     count = df.count()
 
-For large outputs, prefer engine-level writers such as ``df.write_parquet()``
-or other DataFusion writers. These stream data directly to the destination and
-avoid buffering the entire dataset in Python.
-
-For more on parallel record batch conversion and the Python GIL, see
-:doc:`collect-gil`.
-
 HTML Rendering
 --------------
 
@@ -229,4 +207,3 @@ For a complete list of available functions, see the :py:mod:`datafusion.function
    :maxdepth: 1
 
    rendering
-   collect-gil
@@ -57,34 +57,17 @@ and returns a ``StructArray``. Common pyarrow sources you can use are:
 Exporting from DataFusion
 -------------------------
 
-DataFusion DataFrames implement ``__arrow_c_stream__`` so any Python library
-that accepts this interface can import a DataFusion ``DataFrame`` directly.
+DataFusion DataFrames implement ``__arrow_c_stream__`` PyCapsule interface, so any
+Python library that accepts these can import a DataFusion DataFrame directly.
 
-``collect()`` or ``pa.table(df)`` will materialize every record batch in
-Python. For large results this can quickly exhaust memory. Instead, stream the
-output incrementally:
+.. warning::
+    It is important to note that this will cause the DataFrame execution to happen, which may be
+    a time consuming task. That is, you will cause a
+    :py:func:`datafusion.dataframe.DataFrame.collect` operation call to occur.
 
-.. ipython:: python
-
-    # Stream batches with DataFusion's native API
-    stream = df.execute_stream()
-    for batch in stream:
-        ...  # process each RecordBatch as it arrives
-
-.. ipython:: python
-
-    # Expose a C stream that PyArrow can consume lazily
-    import pyarrow as pa
-    reader = pa.ipc.RecordBatchStreamReader._import_from_c(df.__arrow_c_stream__())
-    for batch in reader:
-        ...  # process each batch without buffering the entire table
-
-If the goal is simply to persist results, prefer engine-level writers such as
-``df.write_parquet()``. These writers stream data from Rust directly to the
-destination and avoid Python-side memory growth.
 
 .. ipython:: python
 
     df = df.select((col("a") * lit(1.5)).alias("c"), lit("df").alias("d"))
-    pa.table(df)  # loads all batches into memory
+    pa.table(df)
 
@@ -161,13 +161,7 @@ def with_batch_size(self, batch_size: int) -> SessionConfig:
     def with_target_partitions(self, target_partitions: int) -> SessionConfig:
         """Customize the number of target partitions for query execution.
 
-        Each partition is processed on its own thread, so this value controls
-        the degree of parallelism. A good starting point is the number of
-        logical CPU cores on your machine, for example
-        ``SessionConfig().with_target_partitions(os.cpu_count())``.
-
-        See the :ref:`configuration guide <target_partitions>` for more
-        discussion on choosing a value.
+        Increasing partitions can increase concurrency.
 
         Args:
             target_partitions: Number of target partitions.