Skip to content

Commit c229edc

Browse files
committed
Revert "UNPICK"
This reverts commit 2c87e18.
1 parent 2c87e18 commit c229edc

36 files changed

+1272
-609
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,16 +187,23 @@ See [examples](examples/README.md) for more information.
187187

188188
## How to install
189189

190+
DataFusion works with any library exposing the Arrow PyCapsule interface. If you
191+
need `pyarrow`, install the optional extra.
192+
190193
### uv
191194

192195
```bash
193196
uv add datafusion
197+
# or with PyArrow support
198+
uv add "datafusion[pyarrow]"
194199
```
195200

196201
### Pip
197202

198203
```bash
199204
pip install datafusion
205+
# or with PyArrow support
206+
pip install "datafusion[pyarrow]"
200207
# or
201208
python -m pip install datafusion
202209
```

docs/mdbook/src/installation.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,13 @@
1818

1919
DataFusion is easy to install, just like any other Python library.
2020

21+
DataFusion works with any library exposing the Arrow PyCapsule interface. If
22+
you rely on `pyarrow`, install the optional extra:
23+
24+
```bash
25+
uv pip install "datafusion[pyarrow]"
26+
```
27+
2128
## Using uv
2229

2330
If you do not yet have a virtual environment, create one:
@@ -36,12 +43,16 @@ Or, to add to a project:
3643

3744
```bash
3845
uv add datafusion
46+
# or with PyArrow support
47+
uv add "datafusion[pyarrow]"
3948
```
4049

4150
## Using pip
4251

4352
``` bash
4453
pip install datafusion
54+
# or with PyArrow support
55+
pip install "datafusion[pyarrow]"
4556
```
4657

4758
## uv & JupyterLab setup

docs/source/conf.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@
7272
suppress_warnings = ["autoapi.python_import_resolution"]
7373
autoapi_python_class_content = "both"
7474
autoapi_keep_files = False # set to True for debugging generated files
75+
autoapi_options = ["members", "undoc-members", "special-members"]
76+
autoapi_member_options = {
77+
"special-members": "__iter__,__aiter__,__arrow_c_array__,__arrow_c_stream__"
78+
}
7579

7680

7781
def autoapi_skip_member_fn(app, what, name, obj, skip, options) -> bool: # noqa: ARG001

docs/source/user-guide/common-operations/functions.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,10 @@ Casting
109109

110110
Casting expressions to different data types using :py:func:`~datafusion.functions.arrow_cast`
111111

112+
DataFusion's :class:`~datafusion.types.DataType` can be constructed from any
113+
object implementing ``__arrow_c_schema__`` and passed to ``arrow_cast`` without
114+
requiring :mod:`pyarrow`.
115+
112116
.. ipython:: python
113117
114118
df.select(

docs/source/user-guide/data-sources.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,12 @@ as Delta Lake. This will require a recent version of
158158
df = ctx.table("my_delta_table")
159159
df.show()
160160
161-
On older versions of ``deltalake`` (prior to 0.22) you can use the
161+
Any Python object that implements the
162+
``__arrow_c_stream__`` protocol can be registered with
163+
``register_dataset``. This includes scanners from libraries such as
164+
``nanoarrow``, ``Polars``, or ``DuckDB``.
165+
166+
On older versions of ``deltalake`` (prior to 0.22) you can use the
162167
`Arrow DataSet <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html>`_
163168
interface to import to DataFusion, but this does not support features such as filter push down
164169
which can lead to a significant performance difference.

docs/source/user-guide/dataframe/index.rst

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,10 +145,49 @@ To materialize the results of your DataFrame operations:
145145
146146
# Display results
147147
df.show() # Print tabular format to console
148-
148+
149149
# Count rows
150150
count = df.count()
151151
152+
PyArrow Streaming
153+
-----------------
154+
155+
DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling
156+
zero-copy streaming into libraries like `PyArrow <https://arrow.apache.org/>`_.
157+
Earlier versions eagerly converted the entire DataFrame when exporting to
158+
PyArrow, which could exhaust memory on large datasets. With streaming, batches
159+
are produced lazily so you can process arbitrarily large results without
160+
out-of-memory errors.
161+
162+
.. code-block:: python
163+
164+
import pyarrow as pa
165+
166+
# Create a PyArrow RecordBatchReader without materializing all batches
167+
reader = pa.RecordBatchReader._import_from_c_capsule(df.__arrow_c_stream__())
168+
for batch in reader:
169+
... # process each batch as it is produced
170+
171+
DataFrames expose :py:meth:`~datafusion.DataFrame.to_stream`, which returns a
172+
``RecordBatchStream`` for lazily processing results without materializing them
173+
all at once:
174+
175+
.. code-block:: python
176+
177+
stream = df.to_stream()
178+
for batch in stream:
179+
... # process each batch as it is produced
180+
181+
DataFrames themselves are also iterable and delegate to ``to_stream()`` under
182+
the hood:
183+
184+
.. code-block:: python
185+
186+
for batch in df:
187+
... # process each batch as it is produced
188+
189+
See :doc:`../io/arrow` for additional details on the Arrow interface.
190+
152191
HTML Rendering
153192
--------------
154193

docs/source/user-guide/introduction.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,16 @@ DataFusion through various examples and highlight the most effective ways of usi
2626
Installation
2727
------------
2828

29-
DataFusion is a Python library and, as such, can be installed via pip from `PyPI <https://pypi.org/project/datafusion>`__.
29+
DataFusion is a Python library and, as such, can be installed via pip from
30+
`PyPI <https://pypi.org/project/datafusion>`__. DataFusion works with any
31+
library exposing the Arrow PyCapsule interface. If you need ``pyarrow``,
32+
install the optional extra.
3033

3134
.. code-block:: shell
3235
3336
pip install datafusion
37+
# or with PyArrow support
38+
pip install "datafusion[pyarrow]"
3439
3540
You can verify the installation by running:
3641

pyproject.toml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,19 +38,22 @@ classifiers = [
3838
"Programming Language :: Python :: 3.9",
3939
"Programming Language :: Python :: 3.10",
4040
"Programming Language :: Python :: 3.11",
41-
"Programming Language :: Python :: 3.12",
41+
"Programming Language :: Python :: 3.12",
4242
"Programming Language :: Python :: 3.13",
4343
"Programming Language :: Python",
4444
"Programming Language :: Rust",
4545
]
46-
dependencies = ["pyarrow>=11.0.0", "typing-extensions;python_version<'3.13'"]
46+
dependencies = ["typing-extensions;python_version<'3.13'"]
4747
dynamic = ["version"]
4848

4949
[project.urls]
5050
homepage = "https://datafusion.apache.org/python"
5151
documentation = "https://datafusion.apache.org/python"
5252
repository = "https://github.com/apache/datafusion-python"
5353

54+
[project.optional-dependencies]
55+
pyarrow = ["pyarrow>=11.0.0"]
56+
5457
[tool.isort]
5558
profile = "black"
5659

python/datafusion/catalog.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -150,8 +150,8 @@ def __repr__(self) -> str:
150150
return self.table.__repr__()
151151

152152
@staticmethod
153-
def from_dataset(dataset: pa.dataset.Dataset) -> Table:
154-
"""Turn a pyarrow Dataset into a Table."""
153+
def from_dataset(dataset: object) -> Table:
154+
"""Turn any ``__arrow_c_stream__`` source into a Table."""
155155
return Table(df_internal.catalog.RawTable.from_dataset(dataset))
156156

157157
@property

python/datafusion/context.py

Lines changed: 24 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@
2222
import warnings
2323
from typing import TYPE_CHECKING, Any, Protocol
2424

25-
import pyarrow as pa
25+
from datafusion.common import DataTypeMap
26+
from datafusion.types import ensure_pyarrow_type
2627

2728
try:
2829
from warnings import deprecated # Python 3.13+
@@ -45,6 +46,7 @@
4546

4647
import pandas as pd
4748
import polars as pl
49+
import pyarrow as pa
4850

4951
from datafusion.plan import ExecutionPlan, LogicalPlan
5052

@@ -550,7 +552,7 @@ def register_listing_table(
550552
self,
551553
name: str,
552554
path: str | pathlib.Path,
553-
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
555+
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
554556
file_extension: str = ".parquet",
555557
schema: pa.Schema | None = None,
556558
file_sort_order: list[list[Expr | SortExpr]] | None = None,
@@ -803,7 +805,7 @@ def register_parquet(
803805
self,
804806
name: str,
805807
path: str | pathlib.Path,
806-
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
808+
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
807809
parquet_pruning: bool = True,
808810
file_extension: str = ".parquet",
809811
skip_metadata: bool = True,
@@ -895,7 +897,7 @@ def register_json(
895897
schema: pa.Schema | None = None,
896898
schema_infer_max_records: int = 1000,
897899
file_extension: str = ".json",
898-
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
900+
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
899901
file_compression_type: str | None = None,
900902
) -> None:
901903
"""Register a JSON file as a table.
@@ -933,7 +935,7 @@ def register_avro(
933935
path: str | pathlib.Path,
934936
schema: pa.Schema | None = None,
935937
file_extension: str = ".avro",
936-
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
938+
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
937939
) -> None:
938940
"""Register an Avro file as a table.
939941
@@ -954,12 +956,16 @@ def register_avro(
954956
name, str(path), schema, file_extension, table_partition_cols
955957
)
956958

957-
def register_dataset(self, name: str, dataset: pa.dataset.Dataset) -> None:
958-
"""Register a :py:class:`pa.dataset.Dataset` as a table.
959+
def register_dataset(self, name: str, dataset: object) -> None:
960+
"""Register any ``__arrow_c_stream__`` source as a table.
961+
962+
Any Python object implementing the Arrow ``__arrow_c_stream__`` protocol
963+
can be registered, including objects from libraries such as nanoarrow,
964+
Polars, DuckDB, or :py:mod:`pyarrow.dataset`.
959965
960966
Args:
961967
name: Name of the table to register.
962-
dataset: PyArrow dataset.
968+
dataset: Object exposing ``__arrow_c_stream__``.
963969
"""
964970
self.ctx.register_dataset(name, dataset)
965971

@@ -1009,7 +1015,7 @@ def read_json(
10091015
schema: pa.Schema | None = None,
10101016
schema_infer_max_records: int = 1000,
10111017
file_extension: str = ".json",
1012-
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
1018+
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
10131019
file_compression_type: str | None = None,
10141020
) -> DataFrame:
10151021
"""Read a line-delimited JSON data source.
@@ -1049,7 +1055,7 @@ def read_csv(
10491055
delimiter: str = ",",
10501056
schema_infer_max_records: int = 1000,
10511057
file_extension: str = ".csv",
1052-
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
1058+
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
10531059
file_compression_type: str | None = None,
10541060
) -> DataFrame:
10551061
"""Read a CSV data source.
@@ -1094,7 +1100,7 @@ def read_csv(
10941100
def read_parquet(
10951101
self,
10961102
path: str | pathlib.Path,
1097-
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
1103+
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
10981104
parquet_pruning: bool = True,
10991105
file_extension: str = ".parquet",
11001106
skip_metadata: bool = True,
@@ -1145,7 +1151,7 @@ def read_avro(
11451151
self,
11461152
path: str | pathlib.Path,
11471153
schema: pa.Schema | None = None,
1148-
file_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
1154+
file_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
11491155
file_extension: str = ".avro",
11501156
) -> DataFrame:
11511157
"""Create a :py:class:`DataFrame` for reading Avro data source.
@@ -1181,26 +1187,27 @@ def execute(self, plan: ExecutionPlan, partitions: int) -> RecordBatchStream:
11811187

11821188
@staticmethod
11831189
def _convert_table_partition_cols(
1184-
table_partition_cols: list[tuple[str, str | pa.DataType]],
1185-
) -> list[tuple[str, pa.DataType]]:
1190+
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]],
1191+
) -> list[tuple[str, Any]]:
11861192
warn = False
11871193
converted_table_partition_cols = []
11881194

11891195
for col, data_type in table_partition_cols:
11901196
if isinstance(data_type, str):
11911197
warn = True
11921198
if data_type == "string":
1193-
converted_data_type = pa.string()
1199+
mapped = DataTypeMap.py_map_from_arrow_type_str("utf8")
11941200
elif data_type == "int":
1195-
converted_data_type = pa.int32()
1201+
mapped = DataTypeMap.py_map_from_arrow_type_str("int32")
11961202
else:
11971203
message = (
11981204
f"Unsupported literal data type '{data_type}' for partition "
11991205
"column. Supported types are 'string' and 'int'"
12001206
)
12011207
raise ValueError(message)
1208+
converted_data_type = ensure_pyarrow_type(mapped)
12021209
else:
1203-
converted_data_type = data_type
1210+
converted_data_type = ensure_pyarrow_type(data_type)
12041211

12051212
converted_table_partition_cols.append((col, converted_data_type))
12061213

0 commit comments

Comments
 (0)