Skip to content

Commit 47fbf1e

Browse files
committed
UNPICK added AGENTS.md
1 parent 950afd9 commit 47fbf1e

File tree

1 file changed

+97
-0
lines changed

1 file changed

+97
-0
lines changed

AGENTS.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# AGENTS Instructions
2+
3+
This repository contains Python bindings for Rust's DataFusion.
4+
5+
## Development workflow
6+
- Ensure git submodules are initialized: `git submodule update --init`.
7+
- Build the Rust extension before running tests:
8+
- `uv run --no-project maturin develop --uv`
9+
- Run tests with pytest:
10+
- `uv --no-project pytest .`
11+
12+
## Linting and formatting
13+
- Use pre-commit for linting/formatting.
14+
- Run hooks for changed files before committing:
15+
- `pre-commit run --files <files>`
16+
- or `pre-commit run --all-files`
17+
- Hooks enforce:
18+
- Python linting/formatting via Ruff
19+
- Rust formatting via `cargo fmt`
20+
- Rust linting via `cargo clippy`
21+
- Ruff rules that frequently fail in this repo:
22+
- **Import sorting (`I001`)**: Keep import blocks sorted/grouped. Running `ruff check --select I --fix <files>` will repair order.
23+
- **Type-checking guards (`TCH001`)**: Place imports that are only needed for typing (e.g., `AggregateUDF`, `ScalarUDF`, `TableFunction`, `WindowUDF`, `NullTreatment`, `DataFrame`) inside a `if TYPE_CHECKING:` block.
24+
- **Docstring spacing (`D202`, `D205`)**: The summary line must be separated from the body with exactly one blank line, and there must be no blank line immediately after the closing triple quotes.
25+
- **Ternary suggestions (`SIM108`)**: Prefer single-line ternary expressions when Ruff requests them over multi-line `if`/`else` assignments.
26+
27+
## Notes
28+
- The repository mixes Python and Rust; ensure changes build for both languages.
29+
- If adding new dependencies, update `pyproject.toml` and run `uv sync --dev --no-install-package datafusion`.
30+
31+
## Refactoring opportunities
32+
- Avoid using private or low-level APIs when a stable, public helper exists. For example,
33+
automated refactors should spot and replace uses:
34+
35+
```python
36+
# Before (uses private/low-level API)
37+
# PyArrow example
38+
reader = pa.RecordBatchReader._import_from_c_capsule(
39+
df.__arrow_c_stream__()
40+
)
41+
42+
# After (use public API)
43+
reader = pa.RecordBatchReader.from_stream(df)
44+
```
45+
46+
Look for call chains that invoke `_import_from_c_capsule` with `__arrow_c_stream__()`
47+
and prefer `from_stream(df)` instead. This improves readability and avoids
48+
relying on private PyArrow internals that may change.
49+
50+
## Helper Functions
51+
52+
## Commenting guidance
53+
54+
Use comments intentionally. Prefer three kinds of comments depending on purpose:
55+
56+
- Implementation Comments
57+
- Explains non-obvious choices and tricky implementations
58+
- Serves as breadcrumbs for future developers
59+
60+
- Documentation Comments
61+
- Describes functions, classes, and modules
62+
- Acts as public interface documentation
63+
64+
- Contextual Comments
65+
- Documents assumptions, preconditions, and non-obvious requirements
66+
67+
Keep comments concise and up-to-date; prefer clear code over comments when
68+
possible, and move long-form design notes into the repository docs or an
69+
appropriate design file.
70+
71+
- `python/datafusion/io.py` offers global context readers:
72+
- `read_parquet`
73+
- `read_json`
74+
- `read_csv`
75+
- `read_avro`
76+
- `python/datafusion/user_defined.py` exports convenience creators for user-defined functions:
77+
- `udf` (scalar)
78+
- `udaf` (aggregate)
79+
- `udwf` (window)
80+
- `udtf` (table)
81+
- `python/datafusion/col.py` exposes the `Col` helper with `col` and `column` instances for building column expressions using attribute access.
82+
- `python/datafusion/catalog.py` provides Python-based catalog and schema providers.
83+
- `python/datafusion/object_store.py` exposes object store connectors: `AmazonS3`, `GoogleCloud`, `MicrosoftAzure`, `LocalFileSystem`, and `Http`.
84+
- `python/datafusion/unparser.py` converts logical plans back to SQL via the `Dialect` and `Unparser` classes.
85+
- `python/datafusion/dataframe_formatter.py` offers configurable HTML and string formatting for DataFrames (replaces the deprecated `html_formatter.py`).
86+
- `python/tests/generic.py` includes utilities for test data generation:
87+
- `data`
88+
- `data_with_nans`
89+
- `data_datetime`
90+
- `data_date32`
91+
- `data_timedelta`
92+
- `data_binary_other`
93+
- `write_parquet`
94+
- `python/tests/conftest.py` defines reusable pytest fixtures:
95+
- `ctx` creates a `SessionContext`.
96+
- `database` registers a sample CSV dataset.
97+
- `src/dataframe.rs` provides the `collect_record_batches_to_display` helper to fetch the first non-empty record batch and flag if more are available.

0 commit comments

Comments
 (0)