Commit e9011e2
committed
feat: auto-register ObjectStore and accept it in read/register methods
Closes #899. Users no longer need to call `register_object_store()`
before reading remote files. Two complementary mechanisms are provided:
**Auto-registration from URL scheme**
`try_register_url_store()` is called inside every `read_*` /
`register_*` method. It parses the path, detects the scheme (s3, gs,
az/abfss, http/https) and builds an appropriate `ObjectStore` from
environment variables. An existing registration is never overwritten, so
an explicit `register_object_store()` call still takes precedence.
Anonymous S3 access is enabled via `AWS_SKIP_SIGNATURE=true/1`
(avoids EC2 IMDS timeouts when not running on AWS).
**`object_store` parameter on read/register methods**
All eight `read_*` / `register_*` methods (`read_parquet`,
`register_parquet`, `read_csv`, `register_csv`, `read_json`,
`register_json`, `read_avro`, `register_avro`) now accept an optional
`object_store` keyword argument. Passing a store instance registers it
for the URL immediately, with no separate call required:
```python
from datafusion.object_store import S3Store
store = S3Store("my-bucket", region="us-east-1", skip_signature=True)
df = ctx.read_parquet("s3://my-bucket/data.parquet", object_store=store)
```
**pyo3-object-store integration**
Replaced the hand-rolled `store.rs` Python classes with
`pyo3-object_store 0.9` (compatible with object_store 0.13), which
provides richer, actively maintained Python builders for every backend.
The `datafusion.object_store` module now exposes `S3Store`, `GCSStore`,
`AzureStore`, `HTTPStore`, `LocalStore`, `MemoryStore`, and `from_url`.
Legacy names (`AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `Http`,
`LocalFileSystem`) are kept as backward-compatible aliases.
`register_object_store(url, store)` now takes a full URL prefix and a
`PyObjectStore` instead of the old `(scheme, StorageContexts, host)`
triple, matching the pattern suggested in #899.
**Tests**
Added integration tests (`@pytest.mark.integration`):
- `test_read_http_csv` - reads CSV from GitHub raw HTTPS
- `test_read_https_parquet` - reads Parquet from Apache parquet-testing
- `test_read_s3_parquet_explicit` - passes `S3Store` via `object_store=`
- `test_read_s3_parquet_auto` - uses `AWS_SKIP_SIGNATURE=true` env var1 parent be8dd9d commit e9011e2
File tree
10 files changed
+401
-68
lines changed- crates/core
- src
- python
- datafusion
- tests
10 files changed
+401
-68
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| 66 | + | |
66 | 67 | | |
67 | 68 | | |
68 | 69 | | |
| |||
0 commit comments