|
| 1 | +# AGENTS Instructions |
| 2 | + |
| 3 | +This repository contains Python bindings for Rust's DataFusion. |
| 4 | + |
| 5 | +## Development workflow |
| 6 | +- Ensure git submodules are initialized: `git submodule update --init`. |
| 7 | +- Build the Rust extension before running tests: |
| 8 | + - `uv run --no-project maturin develop --uv` |
| 9 | +- Run tests with pytest: |
| 10 | + - `uv --no-project pytest .` |
| 11 | + |
| 12 | +## Linting and formatting |
| 13 | +- Use pre-commit for linting/formatting. |
| 14 | +- Run hooks for changed files before committing: |
| 15 | + - `pre-commit run --files <files>` |
| 16 | + - or `pre-commit run --all-files` |
| 17 | +- Hooks enforce: |
| 18 | + - Python linting/formatting via Ruff |
| 19 | + - Rust formatting via `cargo fmt` |
| 20 | + - Rust linting via `cargo clippy` |
| 21 | + |
| 22 | +## Notes |
| 23 | +- The repository mixes Python and Rust; ensure changes build for both languages. |
| 24 | +- If adding new dependencies, update `pyproject.toml` and run `uv sync --dev --no-install-package datafusion`. |
| 25 | + |
| 26 | +## Helper Functions |
| 27 | +- `python/datafusion/io.py` offers global context readers: |
| 28 | + - `read_parquet` |
| 29 | + - `read_json` |
| 30 | + - `read_csv` |
| 31 | + - `read_avro` |
| 32 | +- `python/datafusion/user_defined.py` exports convenience creators for user-defined functions: |
| 33 | + - `udf` (scalar) |
| 34 | + - `udaf` (aggregate) |
| 35 | + - `udwf` (window) |
| 36 | + - `udtf` (table) |
| 37 | +- `python/datafusion/col.py` exposes the `Col` helper with `col` and `column` instances for building column expressions using attribute access. |
| 38 | +- `python/datafusion/catalog.py` provides Python-based catalog and schema providers. |
| 39 | +- `python/datafusion/object_store.py` exposes object store connectors: `AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `LocalFileSystem`, and `Http`. |
| 40 | +- `python/datafusion/unparser.py` converts logical plans back to SQL via the `Dialect` and `Unparser` classes. |
| 41 | +- `python/datafusion/dataframe_formatter.py` offers configurable HTML and string formatting for DataFrames (replaces the deprecated `html_formatter.py`). |
| 42 | +- `python/tests/generic.py` includes utilities for test data generation: |
| 43 | + - `data` |
| 44 | + - `data_with_nans` |
| 45 | + - `data_datetime` |
| 46 | + - `data_date32` |
| 47 | + - `data_timedelta` |
| 48 | + - `data_binary_other` |
| 49 | + - `write_parquet` |
| 50 | +- `python/tests/conftest.py` defines reusable pytest fixtures: |
| 51 | + - `ctx` creates a `SessionContext`. |
| 52 | + - `database` registers a sample CSV dataset. |
| 53 | +- `src/dataframe.rs` provides the `collect_record_batches_to_display` helper to fetch the first non-empty record batch and flag if more are available. |
0 commit comments