Skip to content

Commit 2c87e18

Browse files
committed
UNPICK
1 parent 3db159a commit 2c87e18

36 files changed

+609
-1272
lines changed

README.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -187,23 +187,16 @@ See [examples](examples/README.md) for more information.
187187

188188
## How to install
189189

190-
DataFusion works with any library exposing the Arrow PyCapsule interface. If you
191-
need `pyarrow`, install the optional extra.
192-
193190
### uv
194191

195192
```bash
196193
uv add datafusion
197-
# or with PyArrow support
198-
uv add "datafusion[pyarrow]"
199194
```
200195

201196
### Pip
202197

203198
```bash
204199
pip install datafusion
205-
# or with PyArrow support
206-
pip install "datafusion[pyarrow]"
207200
# or
208201
python -m pip install datafusion
209202
```

docs/mdbook/src/installation.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,6 @@
1818

1919
DataFusion is easy to install, just like any other Python library.
2020

21-
DataFusion works with any library exposing the Arrow PyCapsule interface. If
22-
you rely on `pyarrow`, install the optional extra:
23-
24-
```bash
25-
uv pip install "datafusion[pyarrow]"
26-
```
27-
2821
## Using uv
2922

3023
If you do not yet have a virtual environment, create one:
@@ -43,16 +36,12 @@ Or, to add to a project:
4336

4437
```bash
4538
uv add datafusion
46-
# or with PyArrow support
47-
uv add "datafusion[pyarrow]"
4839
```
4940

5041
## Using pip
5142

5243
``` bash
5344
pip install datafusion
54-
# or with PyArrow support
55-
pip install "datafusion[pyarrow]"
5645
```
5746

5847
## uv & JupyterLab setup

docs/source/conf.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,10 +72,6 @@
7272
suppress_warnings = ["autoapi.python_import_resolution"]
7373
autoapi_python_class_content = "both"
7474
autoapi_keep_files = False # set to True for debugging generated files
75-
autoapi_options = ["members", "undoc-members", "special-members"]
76-
autoapi_member_options = {
77-
"special-members": "__iter__,__aiter__,__arrow_c_array__,__arrow_c_stream__"
78-
}
7975

8076

8177
def autoapi_skip_member_fn(app, what, name, obj, skip, options) -> bool: # noqa: ARG001

docs/source/user-guide/common-operations/functions.rst

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -109,10 +109,6 @@ Casting
109109

110110
Casting expressions to different data types using :py:func:`~datafusion.functions.arrow_cast`
111111

112-
DataFusion's :class:`~datafusion.types.DataType` can be constructed from any
113-
object implementing ``__arrow_c_schema__`` and passed to ``arrow_cast`` without
114-
requiring :mod:`pyarrow`.
115-
116112
.. ipython:: python
117113
118114
df.select(

docs/source/user-guide/data-sources.rst

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -158,12 +158,7 @@ as Delta Lake. This will require a recent version of
158158
df = ctx.table("my_delta_table")
159159
df.show()
160160
161-
Any Python object that implements the
162-
``__arrow_c_stream__`` protocol can be registered with
163-
``register_dataset``. This includes scanners from libraries such as
164-
``nanoarrow``, ``Polars``, or ``DuckDB``.
165-
166-
On older versions of ``deltalake`` (prior to 0.22) you can use the
161+
On older versions of ``deltalake`` (prior to 0.22) you can use the
167162
`Arrow DataSet <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html>`_
168163
interface to import to DataFusion, but this does not support features such as filter push down
169164
which can lead to a significant performance difference.

docs/source/user-guide/dataframe/index.rst

Lines changed: 1 addition & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -145,49 +145,10 @@ To materialize the results of your DataFrame operations:
145145
146146
# Display results
147147
df.show() # Print tabular format to console
148-
148+
149149
# Count rows
150150
count = df.count()
151151
152-
PyArrow Streaming
153-
-----------------
154-
155-
DataFusion DataFrames implement the ``__arrow_c_stream__`` protocol, enabling
156-
zero-copy streaming into libraries like `PyArrow <https://arrow.apache.org/>`_.
157-
Earlier versions eagerly converted the entire DataFrame when exporting to
158-
PyArrow, which could exhaust memory on large datasets. With streaming, batches
159-
are produced lazily so you can process arbitrarily large results without
160-
out-of-memory errors.
161-
162-
.. code-block:: python
163-
164-
import pyarrow as pa
165-
166-
# Create a PyArrow RecordBatchReader without materializing all batches
167-
reader = pa.RecordBatchReader._import_from_c_capsule(df.__arrow_c_stream__())
168-
for batch in reader:
169-
... # process each batch as it is produced
170-
171-
DataFrames expose :py:meth:`~datafusion.DataFrame.to_stream`, which returns a
172-
``RecordBatchStream`` for lazily processing results without materializing them
173-
all at once:
174-
175-
.. code-block:: python
176-
177-
stream = df.to_stream()
178-
for batch in stream:
179-
... # process each batch as it is produced
180-
181-
DataFrames themselves are also iterable and delegate to ``to_stream()`` under
182-
the hood:
183-
184-
.. code-block:: python
185-
186-
for batch in df:
187-
... # process each batch as it is produced
188-
189-
See :doc:`../io/arrow` for additional details on the Arrow interface.
190-
191152
HTML Rendering
192153
--------------
193154

docs/source/user-guide/introduction.rst

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,16 +26,11 @@ DataFusion through various examples and highlight the most effective ways of usi
2626
Installation
2727
------------
2828

29-
DataFusion is a Python library and, as such, can be installed via pip from
30-
`PyPI <https://pypi.org/project/datafusion>`__. DataFusion works with any
31-
library exposing the Arrow PyCapsule interface. If you need ``pyarrow``,
32-
install the optional extra.
29+
DataFusion is a Python library and, as such, can be installed via pip from `PyPI <https://pypi.org/project/datafusion>`__.
3330

3431
.. code-block:: shell
3532
3633
pip install datafusion
37-
# or with PyArrow support
38-
pip install "datafusion[pyarrow]"
3934
4035
You can verify the installation by running:
4136

pyproject.toml

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -38,22 +38,19 @@ classifiers = [
3838
"Programming Language :: Python :: 3.9",
3939
"Programming Language :: Python :: 3.10",
4040
"Programming Language :: Python :: 3.11",
41-
"Programming Language :: Python :: 3.12",
41+
"Programming Language :: Python :: 3.12",
4242
"Programming Language :: Python :: 3.13",
4343
"Programming Language :: Python",
4444
"Programming Language :: Rust",
4545
]
46-
dependencies = ["typing-extensions;python_version<'3.13'"]
46+
dependencies = ["pyarrow>=11.0.0", "typing-extensions;python_version<'3.13'"]
4747
dynamic = ["version"]
4848

4949
[project.urls]
5050
homepage = "https://datafusion.apache.org/python"
5151
documentation = "https://datafusion.apache.org/python"
5252
repository = "https://github.com/apache/datafusion-python"
5353

54-
[project.optional-dependencies]
55-
pyarrow = ["pyarrow>=11.0.0"]
56-
5754
[tool.isort]
5855
profile = "black"
5956

python/datafusion/catalog.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -150,8 +150,8 @@ def __repr__(self) -> str:
150150
return self.table.__repr__()
151151

152152
@staticmethod
153-
def from_dataset(dataset: object) -> Table:
154-
"""Turn any ``__arrow_c_stream__`` source into a Table."""
153+
def from_dataset(dataset: pa.dataset.Dataset) -> Table:
154+
"""Turn a pyarrow Dataset into a Table."""
155155
return Table(df_internal.catalog.RawTable.from_dataset(dataset))
156156

157157
@property

python/datafusion/context.py

Lines changed: 17 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,7 @@
2222
import warnings
2323
from typing import TYPE_CHECKING, Any, Protocol
2424

25-
from datafusion.common import DataTypeMap
26-
from datafusion.types import ensure_pyarrow_type
25+
import pyarrow as pa
2726

2827
try:
2928
from warnings import deprecated # Python 3.13+
@@ -46,7 +45,6 @@
4645

4746
import pandas as pd
4847
import polars as pl
49-
import pyarrow as pa
5048

5149
from datafusion.plan import ExecutionPlan, LogicalPlan
5250

@@ -552,7 +550,7 @@ def register_listing_table(
552550
self,
553551
name: str,
554552
path: str | pathlib.Path,
555-
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
553+
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
556554
file_extension: str = ".parquet",
557555
schema: pa.Schema | None = None,
558556
file_sort_order: list[list[Expr | SortExpr]] | None = None,
@@ -805,7 +803,7 @@ def register_parquet(
805803
self,
806804
name: str,
807805
path: str | pathlib.Path,
808-
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
806+
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
809807
parquet_pruning: bool = True,
810808
file_extension: str = ".parquet",
811809
skip_metadata: bool = True,
@@ -897,7 +895,7 @@ def register_json(
897895
schema: pa.Schema | None = None,
898896
schema_infer_max_records: int = 1000,
899897
file_extension: str = ".json",
900-
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
898+
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
901899
file_compression_type: str | None = None,
902900
) -> None:
903901
"""Register a JSON file as a table.
@@ -935,7 +933,7 @@ def register_avro(
935933
path: str | pathlib.Path,
936934
schema: pa.Schema | None = None,
937935
file_extension: str = ".avro",
938-
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
936+
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
939937
) -> None:
940938
"""Register an Avro file as a table.
941939
@@ -956,16 +954,12 @@ def register_avro(
956954
name, str(path), schema, file_extension, table_partition_cols
957955
)
958956

959-
def register_dataset(self, name: str, dataset: object) -> None:
960-
"""Register any ``__arrow_c_stream__`` source as a table.
961-
962-
Any Python object implementing the Arrow ``__arrow_c_stream__`` protocol
963-
can be registered, including objects from libraries such as nanoarrow,
964-
Polars, DuckDB, or :py:mod:`pyarrow.dataset`.
957+
def register_dataset(self, name: str, dataset: pa.dataset.Dataset) -> None:
958+
"""Register a :py:class:`pa.dataset.Dataset` as a table.
965959
966960
Args:
967961
name: Name of the table to register.
968-
dataset: Object exposing ``__arrow_c_stream__``.
962+
dataset: PyArrow dataset.
969963
"""
970964
self.ctx.register_dataset(name, dataset)
971965

@@ -1015,7 +1009,7 @@ def read_json(
10151009
schema: pa.Schema | None = None,
10161010
schema_infer_max_records: int = 1000,
10171011
file_extension: str = ".json",
1018-
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
1012+
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
10191013
file_compression_type: str | None = None,
10201014
) -> DataFrame:
10211015
"""Read a line-delimited JSON data source.
@@ -1055,7 +1049,7 @@ def read_csv(
10551049
delimiter: str = ",",
10561050
schema_infer_max_records: int = 1000,
10571051
file_extension: str = ".csv",
1058-
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
1052+
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
10591053
file_compression_type: str | None = None,
10601054
) -> DataFrame:
10611055
"""Read a CSV data source.
@@ -1100,7 +1094,7 @@ def read_csv(
11001094
def read_parquet(
11011095
self,
11021096
path: str | pathlib.Path,
1103-
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
1097+
table_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
11041098
parquet_pruning: bool = True,
11051099
file_extension: str = ".parquet",
11061100
skip_metadata: bool = True,
@@ -1151,7 +1145,7 @@ def read_avro(
11511145
self,
11521146
path: str | pathlib.Path,
11531147
schema: pa.Schema | None = None,
1154-
file_partition_cols: list[tuple[str, str | DataTypeMap | Any]] | None = None,
1148+
file_partition_cols: list[tuple[str, str | pa.DataType]] | None = None,
11551149
file_extension: str = ".avro",
11561150
) -> DataFrame:
11571151
"""Create a :py:class:`DataFrame` for reading Avro data source.
@@ -1187,27 +1181,26 @@ def execute(self, plan: ExecutionPlan, partitions: int) -> RecordBatchStream:
11871181

11881182
@staticmethod
11891183
def _convert_table_partition_cols(
1190-
table_partition_cols: list[tuple[str, str | DataTypeMap | Any]],
1191-
) -> list[tuple[str, Any]]:
1184+
table_partition_cols: list[tuple[str, str | pa.DataType]],
1185+
) -> list[tuple[str, pa.DataType]]:
11921186
warn = False
11931187
converted_table_partition_cols = []
11941188

11951189
for col, data_type in table_partition_cols:
11961190
if isinstance(data_type, str):
11971191
warn = True
11981192
if data_type == "string":
1199-
mapped = DataTypeMap.py_map_from_arrow_type_str("utf8")
1193+
converted_data_type = pa.string()
12001194
elif data_type == "int":
1201-
mapped = DataTypeMap.py_map_from_arrow_type_str("int32")
1195+
converted_data_type = pa.int32()
12021196
else:
12031197
message = (
12041198
f"Unsupported literal data type '{data_type}' for partition "
12051199
"column. Supported types are 'string' and 'int'"
12061200
)
12071201
raise ValueError(message)
1208-
converted_data_type = ensure_pyarrow_type(mapped)
12091202
else:
1210-
converted_data_type = ensure_pyarrow_type(data_type)
1203+
converted_data_type = data_type
12111204

12121205
converted_table_partition_cols.append((col, converted_data_type))
12131206

0 commit comments

Comments
 (0)