Skip to content

Commit 0d1d7a1

Browse files
committed
Add RFC
1 parent a7ac0d4 commit 0d1d7a1

File tree

1 file changed

+145
-0
lines changed

1 file changed

+145
-0
lines changed
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# RFC: Iceberg v3 Geospatial Primitive Types
2+
3+
## Motivation
4+
5+
Apache Iceberg v3 introduces native geospatial types (`geometry` and `geography`) to support spatial data workloads. These types enable:
6+
7+
1. **Interoperability**: Consistent spatial data representation across Iceberg implementations
8+
2. **Query optimization**: Future support for spatial predicate pushdown
9+
3. **Standards compliance**: Alignment with OGC and ISO spatial data standards
10+
11+
This RFC describes the design and implementation of these types in PyIceberg.
12+
13+
## Scope
14+
15+
**In scope:**
16+
- `geometry(C)` and `geography(C, A)` primitive type definitions
17+
- Type parsing and serialization (round-trip support)
18+
- Avro mapping (WKB bytes)
19+
- PyArrow/Parquet conversion (with version-aware fallback)
20+
- Format version enforcement (v3 required)
21+
22+
**Out of scope (future work):**
23+
- Spatial predicate pushdown (e.g., ST_Contains, ST_Intersects)
24+
- WKB/WKT conversion (requires external dependencies)
25+
- Geometry/geography bounds metrics
26+
- Spatial indexing
27+
28+
## Non-Goals
29+
30+
- Adding heavy dependencies like Shapely, GEOS, or GeoPandas
31+
- Implementing spatial operations or computations
32+
- Supporting format versions < 3
33+
34+
## Design
35+
36+
### Type Parameters
37+
38+
**GeometryType:**
39+
- `crs` (string): Coordinate Reference System, defaults to `"OGC:CRS84"`
40+
41+
**GeographyType:**
42+
- `crs` (string): Coordinate Reference System, defaults to `"OGC:CRS84"`
43+
- `algorithm` (string): Geographic algorithm, defaults to `"spherical"`
44+
45+
### Type String Format
46+
47+
```python
48+
# Default parameters
49+
"geometry"
50+
"geography"
51+
52+
# With custom CRS
53+
"geometry('EPSG:4326')"
54+
"geography('EPSG:4326')"
55+
56+
# With custom CRS and algorithm
57+
"geography('EPSG:4326', 'planar')"
58+
```
59+
60+
### Runtime Representation
61+
62+
Values are stored as WKB (Well-Known Binary) bytes at runtime. This matches the Avro and Parquet physical representation per the Iceberg spec.
63+
64+
### JSON Single-Value Serialization
65+
66+
Per the Iceberg spec, geometry/geography values should be serialized as WKT (Well-Known Text) strings in JSON. However, since we represent values as WKB bytes at runtime, conversion between WKB and WKT would require external dependencies.
67+
68+
**Current behavior:** `NotImplementedError` is raised for JSON serialization/deserialization until a conversion strategy is established.
69+
70+
### Avro Mapping
71+
72+
Both geometry and geography types map to Avro `bytes` type, consistent with `BinaryType` handling.
73+
74+
### PyArrow/Parquet Mapping
75+
76+
**With geoarrow-pyarrow installed:**
77+
- Geometry types convert to GeoArrow WKB extension type with CRS metadata
78+
- Geography types convert to GeoArrow WKB extension type with CRS and edge type metadata
79+
- Uses `geoarrow.pyarrow.wkb().with_crs()` and `.with_edge_type()` for full GeoArrow compatibility
80+
81+
**Without geoarrow-pyarrow:**
82+
- Geometry and geography types fall back to `pa.large_binary()`
83+
- This provides WKB storage without GEO logical type metadata
84+
85+
## Compatibility
86+
87+
### Format Version
88+
89+
Geometry and geography types require Iceberg format version 3. Attempting to use them with format version 1 or 2 will raise a validation error via `Schema.check_format_version_compatibility()`.
90+
91+
### geoarrow-pyarrow
92+
93+
- **Optional dependency**: Install with `pip install pyiceberg[geoarrow]`
94+
- **Without geoarrow**: Geometry/geography stored as binary columns (WKB)
95+
- **With geoarrow**: Full GeoArrow extension type support with CRS/edge metadata
96+
97+
### Breaking Changes
98+
99+
None. These are new types that do not affect existing functionality.
100+
101+
## Dependency/Versioning
102+
103+
**Required:**
104+
- PyIceberg core (no new dependencies)
105+
106+
**Optional for full functionality:**
107+
- PyArrow 21.0.0+ for native Parquet GEO logical types
108+
109+
## Testing Strategy
110+
111+
1. **Unit tests** (`test_types.py`):
112+
- Type creation with default/custom parameters
113+
- `__str__` and `__repr__` methods
114+
- JSON serialization/deserialization round-trip
115+
- Equality, hashing, and pickling
116+
- `minimum_format_version()` enforcement
117+
118+
2. **Integration tests** (future):
119+
- End-to-end table creation with geometry/geography columns
120+
- Parquet file round-trip with PyArrow
121+
122+
## Known Limitations
123+
124+
1. **No WKB/WKT conversion**: JSON single-value serialization raises `NotImplementedError`
125+
2. **No bounds metrics**: Cannot extract bounds from WKB without parsing
126+
3. **No spatial predicates**: Query optimization for spatial filters not yet implemented
127+
4. **PyArrow < 21.0.0**: Falls back to binary type without GEO metadata
128+
5. **Reverse conversion from Parquet**: Binary columns cannot be distinguished from geometry/geography without Iceberg schema metadata
129+
130+
## File Locations
131+
132+
| Component | File |
133+
|-----------|------|
134+
| Type definitions | `pyiceberg/types.py` |
135+
| Conversions | `pyiceberg/conversions.py` |
136+
| Schema visitors | `pyiceberg/schema.py` |
137+
| Avro conversion | `pyiceberg/utils/schema_conversion.py` |
138+
| PyArrow conversion | `pyiceberg/io/pyarrow.py` |
139+
| Unit tests | `tests/test_types.py` |
140+
141+
## References
142+
143+
- [Iceberg v3 Type Specification](https://iceberg.apache.org/spec/#schemas-and-data-types)
144+
- [Arrow GEO Proposal](https://arrow.apache.org/docs/format/GeoArrow.html)
145+
- [Arrow PR #45459](https://github.com/apache/arrow/pull/45459)

0 commit comments

Comments
 (0)