|
| 1 | +# RFC: Iceberg v3 Geospatial Primitive Types |
| 2 | + |
| 3 | +## Motivation |
| 4 | + |
| 5 | +Apache Iceberg v3 introduces native geospatial types (`geometry` and `geography`) to support spatial data workloads. These types enable: |
| 6 | + |
| 7 | +1. **Interoperability**: Consistent spatial data representation across Iceberg implementations |
| 8 | +2. **Query optimization**: Future support for spatial predicate pushdown |
| 9 | +3. **Standards compliance**: Alignment with OGC and ISO spatial data standards |
| 10 | + |
| 11 | +This RFC describes the design and implementation of these types in PyIceberg. |
| 12 | + |
| 13 | +## Scope |
| 14 | + |
| 15 | +**In scope:** |
| 16 | +- `geometry(C)` and `geography(C, A)` primitive type definitions |
| 17 | +- Type parsing and serialization (round-trip support) |
| 18 | +- Avro mapping (WKB bytes) |
| 19 | +- PyArrow/Parquet conversion (with version-aware fallback) |
| 20 | +- Format version enforcement (v3 required) |
| 21 | + |
| 22 | +**Out of scope (future work):** |
| 23 | +- Spatial predicate pushdown (e.g., ST_Contains, ST_Intersects) |
| 24 | +- WKB/WKT conversion (requires external dependencies) |
| 25 | +- Geometry/geography bounds metrics |
| 26 | +- Spatial indexing |
| 27 | + |
| 28 | +## Non-Goals |
| 29 | + |
| 30 | +- Adding heavy dependencies like Shapely, GEOS, or GeoPandas |
| 31 | +- Implementing spatial operations or computations |
| 32 | +- Supporting format versions < 3 |
| 33 | + |
| 34 | +## Design |
| 35 | + |
| 36 | +### Type Parameters |
| 37 | + |
| 38 | +**GeometryType:** |
| 39 | +- `crs` (string): Coordinate Reference System, defaults to `"OGC:CRS84"` |
| 40 | + |
| 41 | +**GeographyType:** |
| 42 | +- `crs` (string): Coordinate Reference System, defaults to `"OGC:CRS84"` |
| 43 | +- `algorithm` (string): Geographic algorithm, defaults to `"spherical"` |
| 44 | + |
| 45 | +### Type String Format |
| 46 | + |
| 47 | +```python |
| 48 | +# Default parameters |
| 49 | +"geometry" |
| 50 | +"geography" |
| 51 | + |
| 52 | +# With custom CRS |
| 53 | +"geometry('EPSG:4326')" |
| 54 | +"geography('EPSG:4326')" |
| 55 | + |
| 56 | +# With custom CRS and algorithm |
| 57 | +"geography('EPSG:4326', 'planar')" |
| 58 | +``` |
| 59 | + |
| 60 | +### Runtime Representation |
| 61 | + |
| 62 | +Values are stored as WKB (Well-Known Binary) bytes at runtime. This matches the Avro and Parquet physical representation per the Iceberg spec. |
| 63 | + |
| 64 | +### JSON Single-Value Serialization |
| 65 | + |
| 66 | +Per the Iceberg spec, geometry/geography values should be serialized as WKT (Well-Known Text) strings in JSON. However, since we represent values as WKB bytes at runtime, conversion between WKB and WKT would require external dependencies. |
| 67 | + |
| 68 | +**Current behavior:** `NotImplementedError` is raised for JSON serialization/deserialization until a conversion strategy is established. |
| 69 | + |
| 70 | +### Avro Mapping |
| 71 | + |
| 72 | +Both geometry and geography types map to Avro `bytes` type, consistent with `BinaryType` handling. |
| 73 | + |
| 74 | +### PyArrow/Parquet Mapping |
| 75 | + |
| 76 | +**With geoarrow-pyarrow installed:** |
| 77 | +- Geometry types convert to GeoArrow WKB extension type with CRS metadata |
| 78 | +- Geography types convert to GeoArrow WKB extension type with CRS and edge type metadata |
| 79 | +- Uses `geoarrow.pyarrow.wkb().with_crs()` and `.with_edge_type()` for full GeoArrow compatibility |
| 80 | + |
| 81 | +**Without geoarrow-pyarrow:** |
| 82 | +- Geometry and geography types fall back to `pa.large_binary()` |
| 83 | +- This provides WKB storage without GEO logical type metadata |
| 84 | + |
| 85 | +## Compatibility |
| 86 | + |
| 87 | +### Format Version |
| 88 | + |
| 89 | +Geometry and geography types require Iceberg format version 3. Attempting to use them with format version 1 or 2 will raise a validation error via `Schema.check_format_version_compatibility()`. |
| 90 | + |
| 91 | +### geoarrow-pyarrow |
| 92 | + |
| 93 | +- **Optional dependency**: Install with `pip install pyiceberg[geoarrow]` |
| 94 | +- **Without geoarrow**: Geometry/geography stored as binary columns (WKB) |
| 95 | +- **With geoarrow**: Full GeoArrow extension type support with CRS/edge metadata |
| 96 | + |
| 97 | +### Breaking Changes |
| 98 | + |
| 99 | +None. These are new types that do not affect existing functionality. |
| 100 | + |
| 101 | +## Dependency/Versioning |
| 102 | + |
| 103 | +**Required:** |
| 104 | +- PyIceberg core (no new dependencies) |
| 105 | + |
| 106 | +**Optional for full functionality:** |
| 107 | +- PyArrow 21.0.0+ for native Parquet GEO logical types |
| 108 | + |
| 109 | +## Testing Strategy |
| 110 | + |
| 111 | +1. **Unit tests** (`test_types.py`): |
| 112 | + - Type creation with default/custom parameters |
| 113 | + - `__str__` and `__repr__` methods |
| 114 | + - JSON serialization/deserialization round-trip |
| 115 | + - Equality, hashing, and pickling |
| 116 | + - `minimum_format_version()` enforcement |
| 117 | + |
| 118 | +2. **Integration tests** (future): |
| 119 | + - End-to-end table creation with geometry/geography columns |
| 120 | + - Parquet file round-trip with PyArrow |
| 121 | + |
| 122 | +## Known Limitations |
| 123 | + |
| 124 | +1. **No WKB/WKT conversion**: JSON single-value serialization raises `NotImplementedError` |
| 125 | +2. **No bounds metrics**: Cannot extract bounds from WKB without parsing |
| 126 | +3. **No spatial predicates**: Query optimization for spatial filters not yet implemented |
| 127 | +4. **PyArrow < 21.0.0**: Falls back to binary type without GEO metadata |
| 128 | +5. **Reverse conversion from Parquet**: Binary columns cannot be distinguished from geometry/geography without Iceberg schema metadata |
| 129 | + |
| 130 | +## File Locations |
| 131 | + |
| 132 | +| Component | File | |
| 133 | +|-----------|------| |
| 134 | +| Type definitions | `pyiceberg/types.py` | |
| 135 | +| Conversions | `pyiceberg/conversions.py` | |
| 136 | +| Schema visitors | `pyiceberg/schema.py` | |
| 137 | +| Avro conversion | `pyiceberg/utils/schema_conversion.py` | |
| 138 | +| PyArrow conversion | `pyiceberg/io/pyarrow.py` | |
| 139 | +| Unit tests | `tests/test_types.py` | |
| 140 | + |
| 141 | +## References |
| 142 | + |
| 143 | +- [Iceberg v3 Type Specification](https://iceberg.apache.org/spec/#schemas-and-data-types) |
| 144 | +- [Arrow GEO Proposal](https://arrow.apache.org/docs/format/GeoArrow.html) |
| 145 | +- [Arrow PR #45459](https://github.com/apache/arrow/pull/45459) |
0 commit comments