Skip to content

Commit a7ac0d4

Browse files
committed
Add geoarrow dependency and document current capabilities
1 parent 9b36337 commit a7ac0d4

File tree

5 files changed

+328
-13
lines changed

5 files changed

+328
-13
lines changed

mkdocs/docs/geospatial.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Geospatial Types
2+
3+
PyIceberg supports Iceberg v3 geospatial primitive types: `geometry` and `geography`.
4+
5+
## Overview
6+
7+
Iceberg v3 introduces native support for spatial data types:
8+
9+
- **`geometry(C)`**: Represents geometric shapes in a coordinate reference system (CRS)
10+
- **`geography(C, A)`**: Represents geographic shapes with CRS and calculation algorithm
11+
12+
Both types store values as WKB (Well-Known Binary) bytes.
13+
14+
## Requirements
15+
16+
- Iceberg format version 3 or higher
17+
- `geoarrow-pyarrow` for full GeoArrow extension type support (optional: `pip install pyiceberg[geoarrow]`)
18+
19+
## Usage
20+
21+
### Declaring Columns
22+
23+
```python
24+
from pyiceberg.schema import Schema
25+
from pyiceberg.types import NestedField, GeometryType, GeographyType
26+
27+
# Schema with geometry and geography columns
28+
schema = Schema(
29+
NestedField(1, "id", IntegerType(), required=True),
30+
NestedField(2, "location", GeometryType(), required=True),
31+
NestedField(3, "boundary", GeographyType(), required=False),
32+
)
33+
```
34+
35+
### Type Parameters
36+
37+
#### GeometryType
38+
39+
```python
40+
# Default CRS (OGC:CRS84)
41+
GeometryType()
42+
43+
# Custom CRS
44+
GeometryType("EPSG:4326")
45+
```
46+
47+
#### GeographyType
48+
49+
```python
50+
# Default CRS (OGC:CRS84) and algorithm (spherical)
51+
GeographyType()
52+
53+
# Custom CRS
54+
GeographyType("EPSG:4326")
55+
56+
# Custom CRS and algorithm
57+
GeographyType("EPSG:4326", "planar")
58+
```
59+
60+
### String Type Syntax
61+
62+
Types can also be specified as strings in schema definitions:
63+
64+
```python
65+
# Using string type names
66+
NestedField(1, "point", "geometry", required=True)
67+
NestedField(2, "region", "geography", required=True)
68+
69+
# With parameters
70+
NestedField(3, "location", "geometry('EPSG:4326')", required=True)
71+
NestedField(4, "boundary", "geography('EPSG:4326', 'planar')", required=True)
72+
```
73+
74+
## Data Representation
75+
76+
Values are represented as WKB (Well-Known Binary) bytes at runtime:
77+
78+
```python
79+
# Example: Point(0, 0) in WKB format
80+
point_wkb = bytes.fromhex("0101000000000000000000000000000000000000")
81+
```
82+
83+
## Current Limitations
84+
85+
1. **WKB/WKT Conversion**: Converting between WKB bytes and WKT strings requires external libraries (like Shapely). PyIceberg does not include this conversion to avoid heavy dependencies.
86+
87+
2. **Spatial Predicates**: Spatial filtering (e.g., ST_Contains, ST_Intersects) is not yet supported for query pushdown.
88+
89+
3. **Bounds Metrics**: Geometry/geography columns do not currently contribute to data file bounds metrics.
90+
91+
4. **Without geoarrow-pyarrow**: When the `geoarrow-pyarrow` package is not installed, geometry and geography columns are stored as binary without GeoArrow extension type metadata. The Iceberg schema preserves type information, but other tools reading the Parquet files directly may not recognize them as spatial types. Install with `pip install pyiceberg[geoarrow]` for full GeoArrow support.
92+
93+
## Format Version
94+
95+
Geometry and geography types require Iceberg format version 3:
96+
97+
```python
98+
from pyiceberg.table import TableProperties
99+
100+
# Creating a v3 table
101+
table = catalog.create_table(
102+
identifier="db.spatial_table",
103+
schema=schema,
104+
properties={
105+
TableProperties.FORMAT_VERSION: "3"
106+
}
107+
)
108+
```
109+
110+
Attempting to use these types with format version 1 or 2 will raise a validation error.

pyiceberg/io/pyarrow.py

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -801,24 +801,35 @@ def visit_binary(self, _: BinaryType) -> pa.DataType:
801801
return pa.large_binary()
802802

803803
def visit_geometry(self, geometry_type: GeometryType) -> pa.DataType:
804-
"""Convert geometry type to PyArrow binary.
804+
"""Convert geometry type to PyArrow type.
805805
806-
Note: PyArrow 21.0.0+ supports native GEOMETRY logical type from Arrow GEO.
807-
For now, we use large_binary which stores WKB bytes.
808-
Future enhancement: detect PyArrow version and use pa.geometry() when available.
806+
When geoarrow-pyarrow is available, returns a GeoArrow WKB extension type
807+
with CRS metadata. Otherwise, falls back to large_binary which stores WKB bytes.
809808
"""
810-
# TODO: When PyArrow 21.0.0+ is available, use pa.geometry() with CRS metadata
811-
return pa.large_binary()
809+
try:
810+
import geoarrow.pyarrow as ga
811+
812+
return ga.wkb().with_crs(geometry_type.crs)
813+
except ImportError:
814+
return pa.large_binary()
812815

813816
def visit_geography(self, geography_type: GeographyType) -> pa.DataType:
814-
"""Convert geography type to PyArrow binary.
817+
"""Convert geography type to PyArrow type.
815818
816-
Note: PyArrow 21.0.0+ supports native GEOGRAPHY logical type from Arrow GEO.
817-
For now, we use large_binary which stores WKB bytes.
818-
Future enhancement: detect PyArrow version and use pa.geography() when available.
819+
When geoarrow-pyarrow is available, returns a GeoArrow WKB extension type
820+
with CRS and edge type metadata. Otherwise, falls back to large_binary which stores WKB bytes.
819821
"""
820-
# TODO: When PyArrow 21.0.0+ is available, use pa.geography() with CRS and algorithm metadata
821-
return pa.large_binary()
822+
try:
823+
import geoarrow.pyarrow as ga
824+
825+
wkb_type = ga.wkb().with_crs(geography_type.crs)
826+
# Map Iceberg algorithm to GeoArrow edge type
827+
if geography_type.algorithm == "spherical":
828+
wkb_type = wkb_type.with_edge_type(ga.EdgeType.SPHERICAL)
829+
# "planar" is the default edge type in GeoArrow, no need to set explicitly
830+
return wkb_type
831+
except ImportError:
832+
return pa.large_binary()
822833

823834

824835
def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar:

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ hf = ["huggingface-hub>=0.24.0"]
9696
pyiceberg-core = ["pyiceberg-core>=0.5.1,<0.8.0"]
9797
datafusion = ["datafusion>=45,<49"]
9898
gcp-auth = ["google-auth>=2.4.0"]
99+
geoarrow = ["geoarrow-pyarrow>=0.2.0"]
99100

100101
[dependency-groups]
101102
dev = [

tests/io/test_pyarrow.py

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,8 @@
9898
DoubleType,
9999
FixedType,
100100
FloatType,
101+
GeographyType,
102+
GeometryType,
101103
IntegerType,
102104
ListType,
103105
LongType,
@@ -596,6 +598,108 @@ def test_binary_type_to_pyarrow() -> None:
596598
assert visit(iceberg_type, _ConvertToArrowSchema()) == pa.large_binary()
597599

598600

601+
def test_geometry_type_to_pyarrow_without_geoarrow() -> None:
602+
"""Test geometry type falls back to large_binary when geoarrow is not available."""
603+
import sys
604+
605+
iceberg_type = GeometryType()
606+
607+
# Remove geoarrow from sys.modules if present and block re-import
608+
saved_modules = {}
609+
for mod_name in list(sys.modules.keys()):
610+
if mod_name.startswith("geoarrow"):
611+
saved_modules[mod_name] = sys.modules.pop(mod_name)
612+
613+
import builtins
614+
615+
original_import = builtins.__import__
616+
617+
def mock_import(name: str, *args: Any, **kwargs: Any) -> Any:
618+
if name.startswith("geoarrow"):
619+
raise ImportError(f"No module named '{name}'")
620+
return original_import(name, *args, **kwargs)
621+
622+
try:
623+
builtins.__import__ = mock_import
624+
result = visit(iceberg_type, _ConvertToArrowSchema())
625+
assert result == pa.large_binary()
626+
finally:
627+
builtins.__import__ = original_import
628+
sys.modules.update(saved_modules)
629+
630+
631+
def test_geography_type_to_pyarrow_without_geoarrow() -> None:
632+
"""Test geography type falls back to large_binary when geoarrow is not available."""
633+
import sys
634+
635+
iceberg_type = GeographyType()
636+
637+
# Remove geoarrow from sys.modules if present and block re-import
638+
saved_modules = {}
639+
for mod_name in list(sys.modules.keys()):
640+
if mod_name.startswith("geoarrow"):
641+
saved_modules[mod_name] = sys.modules.pop(mod_name)
642+
643+
import builtins
644+
645+
original_import = builtins.__import__
646+
647+
def mock_import(name: str, *args: Any, **kwargs: Any) -> Any:
648+
if name.startswith("geoarrow"):
649+
raise ImportError(f"No module named '{name}'")
650+
return original_import(name, *args, **kwargs)
651+
652+
try:
653+
builtins.__import__ = mock_import
654+
result = visit(iceberg_type, _ConvertToArrowSchema())
655+
assert result == pa.large_binary()
656+
finally:
657+
builtins.__import__ = original_import
658+
sys.modules.update(saved_modules)
659+
660+
661+
def test_geometry_type_to_pyarrow_with_geoarrow() -> None:
662+
"""Test geometry type uses geoarrow WKB extension type when available."""
663+
pytest.importorskip("geoarrow.pyarrow")
664+
import geoarrow.pyarrow as ga
665+
666+
# Test default CRS
667+
iceberg_type = GeometryType()
668+
result = visit(iceberg_type, _ConvertToArrowSchema())
669+
expected = ga.wkb().with_crs("OGC:CRS84")
670+
assert result == expected
671+
672+
# Test custom CRS
673+
iceberg_type_custom = GeometryType("EPSG:4326")
674+
result_custom = visit(iceberg_type_custom, _ConvertToArrowSchema())
675+
expected_custom = ga.wkb().with_crs("EPSG:4326")
676+
assert result_custom == expected_custom
677+
678+
679+
def test_geography_type_to_pyarrow_with_geoarrow() -> None:
680+
"""Test geography type uses geoarrow WKB extension type with edge type when available."""
681+
pytest.importorskip("geoarrow.pyarrow")
682+
import geoarrow.pyarrow as ga
683+
684+
# Test default (spherical algorithm)
685+
iceberg_type = GeographyType()
686+
result = visit(iceberg_type, _ConvertToArrowSchema())
687+
expected = ga.wkb().with_crs("OGC:CRS84").with_edge_type(ga.EdgeType.SPHERICAL)
688+
assert result == expected
689+
690+
# Test custom CRS with spherical
691+
iceberg_type_custom = GeographyType("EPSG:4326", "spherical")
692+
result_custom = visit(iceberg_type_custom, _ConvertToArrowSchema())
693+
expected_custom = ga.wkb().with_crs("EPSG:4326").with_edge_type(ga.EdgeType.SPHERICAL)
694+
assert result_custom == expected_custom
695+
696+
# Test planar algorithm (no edge type set, uses default)
697+
iceberg_type_planar = GeographyType("OGC:CRS84", "planar")
698+
result_planar = visit(iceberg_type_planar, _ConvertToArrowSchema())
699+
expected_planar = ga.wkb().with_crs("OGC:CRS84")
700+
assert result_planar == expected_planar
701+
702+
599703
def test_struct_type_to_pyarrow(table_schema_simple: Schema) -> None:
600704
expected = pa.struct(
601705
[

0 commit comments

Comments
 (0)