Skip to content

Commit c8ef515

Browse files
committed
UNPICK added AGENTS.md
1 parent bf22c1d commit c8ef515

File tree

1 file changed

+72
-0
lines changed

1 file changed

+72
-0
lines changed

AGENTS.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# AGENTS Instructions
2+
3+
This repository contains Python bindings for Rust's DataFusion.
4+
5+
## Development workflow
6+
- Ensure git submodules are initialized: `git submodule update --init`.
7+
- Build the Rust extension before running tests:
8+
- `uv run --no-project maturin develop --uv`
9+
- Run tests with pytest:
10+
- `uv --no-project pytest .`
11+
12+
## Linting and formatting
13+
- Use pre-commit for linting/formatting.
14+
- Run hooks for changed files before committing:
15+
- `pre-commit run --files <files>`
16+
- or `pre-commit run --all-files`
17+
- Hooks enforce:
18+
- Python linting/formatting via Ruff
19+
- Rust formatting via `cargo fmt`
20+
- Rust linting via `cargo clippy`
21+
22+
## Notes
23+
- The repository mixes Python and Rust; ensure changes build for both languages.
24+
- If adding new dependencies, update `pyproject.toml` and run `uv sync --dev --no-install-package datafusion`.
25+
26+
## Refactoring opportunities
27+
- Avoid using private or low-level APIs when a stable, public helper exists. For example,
28+
automated refactors should spot and replace uses:
29+
30+
```python
31+
# Before (uses private/low-level API)
32+
# PyArrow example
33+
reader = pa.RecordBatchReader._import_from_c_capsule(
34+
df.__arrow_c_stream__()
35+
)
36+
37+
# After (use public API)
38+
reader = pa.RecordBatchReader.from_stream(df)
39+
```
40+
41+
Look for call chains that invoke `_import_from_c_capsule` with `__arrow_c_stream__()`
42+
and prefer `from_stream(df)` instead. This improves readability and avoids
43+
relying on private PyArrow internals that may change.
44+
45+
## Helper Functions
46+
- `python/datafusion/io.py` offers global context readers:
47+
- `read_parquet`
48+
- `read_json`
49+
- `read_csv`
50+
- `read_avro`
51+
- `python/datafusion/user_defined.py` exports convenience creators for user-defined functions:
52+
- `udf` (scalar)
53+
- `udaf` (aggregate)
54+
- `udwf` (window)
55+
- `udtf` (table)
56+
- `python/datafusion/col.py` exposes the `Col` helper with `col` and `column` instances for building column expressions using attribute access.
57+
- `python/datafusion/catalog.py` provides Python-based catalog and schema providers.
58+
- `python/datafusion/object_store.py` exposes object store connectors: `AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `LocalFileSystem`, and `Http`.
59+
- `python/datafusion/unparser.py` converts logical plans back to SQL via the `Dialect` and `Unparser` classes.
60+
- `python/datafusion/dataframe_formatter.py` offers configurable HTML and string formatting for DataFrames (replaces the deprecated `html_formatter.py`).
61+
- `python/tests/generic.py` includes utilities for test data generation:
62+
- `data`
63+
- `data_with_nans`
64+
- `data_datetime`
65+
- `data_date32`
66+
- `data_timedelta`
67+
- `data_binary_other`
68+
- `write_parquet`
69+
- `python/tests/conftest.py` defines reusable pytest fixtures:
70+
- `ctx` creates a `SessionContext`.
71+
- `database` registers a sample CSV dataset.
72+
- `src/dataframe.rs` provides the `collect_record_batches_to_display` helper to fetch the first non-empty record batch and flag if more are available.

0 commit comments

Comments
 (0)