Skip to content

Commit e9011e2

Browse files
committed
feat: auto-register ObjectStore and accept it in read/register methods
Closes #899. Users no longer need to call `register_object_store()` before reading remote files. Two complementary mechanisms are provided: **Auto-registration from URL scheme** `try_register_url_store()` is called inside every `read_*` / `register_*` method. It parses the path, detects the scheme (s3, gs, az/abfss, http/https) and builds an appropriate `ObjectStore` from environment variables. An existing registration is never overwritten, so an explicit `register_object_store()` call still takes precedence. Anonymous S3 access is enabled via `AWS_SKIP_SIGNATURE=true/1` (avoids EC2 IMDS timeouts when not running on AWS). **`object_store` parameter on read/register methods** All eight `read_*` / `register_*` methods (`read_parquet`, `register_parquet`, `read_csv`, `register_csv`, `read_json`, `register_json`, `read_avro`, `register_avro`) now accept an optional `object_store` keyword argument. Passing a store instance registers it for the URL immediately, with no separate call required: ```python from datafusion.object_store import S3Store store = S3Store("my-bucket", region="us-east-1", skip_signature=True) df = ctx.read_parquet("s3://my-bucket/data.parquet", object_store=store) ``` **pyo3-object-store integration** Replaced the hand-rolled `store.rs` Python classes with `pyo3-object_store 0.9` (compatible with object_store 0.13), which provides richer, actively maintained Python builders for every backend. The `datafusion.object_store` module now exposes `S3Store`, `GCSStore`, `AzureStore`, `HTTPStore`, `LocalStore`, `MemoryStore`, and `from_url`. Legacy names (`AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `Http`, `LocalFileSystem`) are kept as backward-compatible aliases. `register_object_store(url, store)` now takes a full URL prefix and a `PyObjectStore` instead of the old `(scheme, StorageContexts, host)` triple, matching the pattern suggested in #899. **Tests** Added integration tests (`@pytest.mark.integration`): - `test_read_http_csv` - reads CSV from GitHub raw HTTPS - `test_read_https_parquet` - reads Parquet from Apache parquet-testing - `test_read_s3_parquet_explicit` - passes `S3Store` via `object_store=` - `test_read_s3_parquet_auto` - uses `AWS_SKIP_SIGNATURE=true` env var
1 parent be8dd9d commit e9011e2

File tree

10 files changed

+401
-68
lines changed

10 files changed

+401
-68
lines changed

Cargo.lock

Lines changed: 54 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ async-trait = "0.1.89"
5656
futures = "0.3"
5757
cstr = "0.2"
5858
object_store = { version = "0.13.1" }
59+
pyo3-object_store = { version = "0.9" }
5960
url = "2"
6061
log = "0.4.29"
6162
parking_lot = "0.12"

crates/core/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ async-trait = { workspace = true }
6363
futures = { workspace = true }
6464
cstr = { workspace = true }
6565
object_store = { workspace = true, features = ["aws", "gcp", "azure", "http"] }
66+
pyo3-object_store = { workspace = true }
6667
url = { workspace = true }
6768
log = { workspace = true }
6869
parking_lot = { workspace = true }

0 commit comments

Comments
 (0)