Skip to content

Commit 8cc23f3

Browse files
committed
feat: auto-register ObjectStore and accept it in read/register methods
Closes #899. Users no longer need to call `register_object_store()` before reading remote files. Two complementary mechanisms are provided: **Auto-registration from URL scheme** `try_register_url_store()` is called inside every `read_*` / `register_*` method. It parses the path, detects the scheme (s3, gs, az/abfss, http/https) and builds an appropriate `ObjectStore` from environment variables. An existing registration is never overwritten, so an explicit `register_object_store()` call still takes precedence. Anonymous S3 access is enabled via `AWS_SKIP_SIGNATURE=true/1` (avoids EC2 IMDS timeouts when not running on AWS). **`object_store` parameter on read/register methods** All eight `read_*` / `register_*` methods (`read_parquet`, `register_parquet`, `read_csv`, `register_csv`, `read_json`, `register_json`, `read_avro`, `register_avro`) now accept an optional `object_store` keyword argument. Passing a store instance registers it for the URL immediately, with no separate call required: ```python from datafusion.object_store import S3Store store = S3Store("my-bucket", region="us-east-1", skip_signature=True) df = ctx.read_parquet("s3://my-bucket/data.parquet", object_store=store) ``` **pyo3-object-store integration** Replaced the hand-rolled `store.rs` Python classes with `pyo3-object_store 0.9` (compatible with object_store 0.13), which provides richer, actively maintained Python builders for every backend. The `datafusion.object_store` module now exposes `S3Store`, `GCSStore`, `AzureStore`, `HTTPStore`, `LocalStore`, `MemoryStore`, and `from_url`. Legacy names (`AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `Http`, `LocalFileSystem`) are kept as backward-compatible aliases. `register_object_store(url, store)` now takes a full URL prefix and a `PyObjectStore` instead of the old `(scheme, StorageContexts, host)` triple, matching the pattern suggested in #899. **Tests** Added integration tests (`@pytest.mark.integration`): - `test_read_http_csv` - reads CSV from GitHub raw HTTPS - `test_read_https_parquet` - reads Parquet from Apache parquet-testing - `test_read_s3_parquet_explicit` - passes `S3Store` via `object_store=` - `test_read_s3_parquet_auto` - uses `AWS_SKIP_SIGNATURE=true` env var
1 parent be8dd9d commit 8cc23f3

File tree

76 files changed

+543
-148
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+543
-148
lines changed

Cargo.lock

Lines changed: 54 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ async-trait = "0.1.89"
5656
futures = "0.3"
5757
cstr = "0.2"
5858
object_store = { version = "0.13.1" }
59+
pyo3-object_store = { version = "0.9" }
5960
url = "2"
6061
log = "0.4.29"
6162
parking_lot = "0.12"

benchmarks/db-benchmark/groupby-datafusion.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,10 @@
2020
import timeit
2121
from pathlib import Path
2222

23-
import datafusion as df
2423
import pyarrow as pa
24+
from pyarrow import csv as pacsv
25+
26+
import datafusion as df
2527
from datafusion import (
2628
RuntimeEnvBuilder,
2729
SessionConfig,
@@ -31,7 +33,6 @@
3133
from datafusion import (
3234
functions as f,
3335
)
34-
from pyarrow import csv as pacsv
3536

3637
print("# groupby-datafusion.py", flush=True)
3738

benchmarks/db-benchmark/join-datafusion.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,11 @@
2020
import timeit
2121
from pathlib import Path
2222

23+
from pyarrow import csv as pacsv
24+
2325
import datafusion as df
2426
from datafusion import col
2527
from datafusion import functions as f
26-
from pyarrow import csv as pacsv
2728

2829
print("# join-datafusion.py", flush=True)
2930

benchmarks/max_cpu_usage.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@
4747
import time
4848

4949
import pyarrow as pa
50+
5051
from datafusion import SessionConfig, SessionContext, col
5152
from datafusion import functions as f
5253

conftest.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,10 @@
1717

1818
"""Pytest configuration for doctest namespace injection."""
1919

20-
import datafusion as dfn
2120
import numpy as np
2221
import pytest
22+
23+
import datafusion as dfn
2324
from datafusion import col, lit
2425
from datafusion import functions as F
2526

crates/core/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ async-trait = { workspace = true }
6363
futures = { workspace = true }
6464
cstr = { workspace = true }
6565
object_store = { workspace = true, features = ["aws", "gcp", "azure", "http"] }
66+
pyo3-object_store = { workspace = true }
6667
url = { workspace = true }
6768
log = { workspace = true }
6869
parking_lot = { workspace = true }

0 commit comments

Comments
 (0)