Skip to content

Commit be8dd9d

Browse files
timsaucerclaude
andauthored
Add AI skill to check current repository against upstream APIs (#1460)
* Initial commit for skill to check upstream repo * Add instructions on using the check-upstream skill * Add FFI type coverage and implementation pattern to check-upstream skill Document the full FFI type pipeline (Rust PyO3 wrapper → Protocol type → Python wrapper → ABC base class → exports → example) and catalog which upstream datafusion-ffi types are supported, which have been evaluated as not needing direct exposure, and how to check for new gaps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update check-upstream skill to include FFI types as a checkable area Add "ffi types" to the argument-hint and description so users can invoke the skill with `/check-upstream ffi types`. Also add pipeline verification step to ensure each supported FFI type has the full end-to-end chain (PyO3 wrapper, Protocol, Python wrapper with type hints, ABC, exports). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move FFI Types section alongside other areas to check Section 7 (FFI Types) was incorrectly placed after the Output Format and Implementation Pattern sections. Move it to sit after Section 6 (SessionContext Methods), consistent with the other checkable areas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Replace static FFI type list with dynamic discovery instruction The supported FFI types list would go stale as new types are added. Replace it with a grep instruction to discover them at check time, keeping only the "evaluated and not requiring exposure" list which captures rationale not derivable from code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Make Python API the source of truth for upstream coverage checks Functions exposed in Python (e.g., as aliases of other Rust bindings) were being falsely reported as missing because they lacked a dedicated #[pyfunction] in Rust. The user-facing API is the Python layer, so coverage should be measured there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add exclusion list for DataFrame methods already covered by Python API show_limit is covered by DataFrame.show() and with_param_values is covered by SessionContext.sql(param_values=...), so neither needs separate exposure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Move skills to .ai/skills/ for tool-agnostic discoverability Moves the canonical skill definitions from .claude/skills/ to .ai/skills/ and replaces .claude/skills with a symlink, so Claude Code still discovers them while other AI agents can find them in a tool-neutral location. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add AGENTS.md for tool-agnostic agent instructions with CLAUDE.md symlink AGENTS.md points agents to .ai/skills/ for skill discovery. CLAUDE.md symlinks to it so Claude Code picks it up as project instructions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Make README upstream coverage section tool-agnostic Remove Claude Code references and update skill path from .claude/skills/ to .ai/skills/ to match the new tool-neutral directory structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add GitHub issue lookup step to check-upstream skill When gaps are identified, search open issues at apache/datafusion-python before reporting. Existing issues are linked in the report rather than duplicated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Require Python test coverage in issues created by check-upstream skill Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add license text --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0113a6e commit be8dd9d

File tree

5 files changed

+438
-0
lines changed

5 files changed

+438
-0
lines changed

.ai/skills/check-upstream/SKILL.md

Lines changed: 382 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
---
21+
name: check-upstream
22+
description: Check if upstream Apache DataFusion features (functions, DataFrame ops, SessionContext methods, FFI types) are exposed in this Python project. Use when adding missing functions, auditing API coverage, or ensuring parity with upstream.
23+
argument-hint: [area] (e.g., "scalar functions", "aggregate functions", "window functions", "dataframe", "session context", "ffi types", "all")
24+
---
25+
26+
# Check Upstream DataFusion Feature Coverage
27+
28+
You are auditing the datafusion-python project to find features from the upstream Apache DataFusion Rust library that are **not yet exposed** in this Python binding project. Your goal is to identify gaps and, if asked, implement the missing bindings.
29+
30+
**IMPORTANT: The Python API is the source of truth for coverage.** A function or method is considered "exposed" if it exists in the Python API (e.g., `python/datafusion/functions.py`), even if there is no corresponding entry in the Rust bindings. Many upstream functions are aliases of other functions — the Python layer can expose these aliases by calling a different underlying Rust binding. Do NOT report a function as missing if it appears in the Python `__all__` list and has a working implementation, regardless of whether a matching `#[pyfunction]` exists in Rust.
31+
32+
## Areas to Check
33+
34+
The user may specify an area via `$ARGUMENTS`. If no area is specified or "all" is given, check all areas.
35+
36+
### 1. Scalar Functions
37+
38+
**Upstream source of truth:**
39+
- Rust docs: https://docs.rs/datafusion/latest/datafusion/functions/index.html
40+
- User docs: https://datafusion.apache.org/user-guide/sql/scalar_functions.html
41+
42+
**Where they are exposed in this project:**
43+
- Python API: `python/datafusion/functions.py` — each function wraps a call to `datafusion._internal.functions`
44+
- Rust bindings: `crates/core/src/functions.rs``#[pyfunction]` definitions registered via `init_module()`
45+
46+
**How to check:**
47+
1. Fetch the upstream scalar function documentation page
48+
2. Compare against functions listed in `python/datafusion/functions.py` (check the `__all__` list and function definitions)
49+
3. A function is covered if it exists in the Python API — it does NOT need a dedicated Rust `#[pyfunction]`. Many functions are aliases that reuse another function's Rust binding.
50+
4. Only report functions that are missing from the Python `__all__` list / function definitions
51+
52+
### 2. Aggregate Functions
53+
54+
**Upstream source of truth:**
55+
- Rust docs: https://docs.rs/datafusion/latest/datafusion/functions_aggregate/index.html
56+
- User docs: https://datafusion.apache.org/user-guide/sql/aggregate_functions.html
57+
58+
**Where they are exposed in this project:**
59+
- Python API: `python/datafusion/functions.py` (aggregate functions are mixed in with scalar functions)
60+
- Rust bindings: `crates/core/src/functions.rs`
61+
62+
**How to check:**
63+
1. Fetch the upstream aggregate function documentation page
64+
2. Compare against aggregate functions in `python/datafusion/functions.py` (check `__all__` list and function definitions)
65+
3. A function is covered if it exists in the Python API, even if it aliases another function's Rust binding
66+
4. Report only functions missing from the Python API
67+
68+
### 3. Window Functions
69+
70+
**Upstream source of truth:**
71+
- Rust docs: https://docs.rs/datafusion/latest/datafusion/functions_window/index.html
72+
- User docs: https://datafusion.apache.org/user-guide/sql/window_functions.html
73+
74+
**Where they are exposed in this project:**
75+
- Python API: `python/datafusion/functions.py` (window functions like `rank`, `dense_rank`, `lag`, `lead`, etc.)
76+
- Rust bindings: `crates/core/src/functions.rs`
77+
78+
**How to check:**
79+
1. Fetch the upstream window function documentation page
80+
2. Compare against window functions in `python/datafusion/functions.py` (check `__all__` list and function definitions)
81+
3. A function is covered if it exists in the Python API, even if it aliases another function's Rust binding
82+
4. Report only functions missing from the Python API
83+
84+
### 4. Table Functions
85+
86+
**Upstream source of truth:**
87+
- Rust docs: https://docs.rs/datafusion/latest/datafusion/functions_table/index.html
88+
- User docs: https://datafusion.apache.org/user-guide/sql/table_functions.html (if available)
89+
90+
**Where they are exposed in this project:**
91+
- Python API: `python/datafusion/functions.py` and `python/datafusion/user_defined.py` (TableFunction/udtf)
92+
- Rust bindings: `crates/core/src/functions.rs` and `crates/core/src/udtf.rs`
93+
94+
**How to check:**
95+
1. Fetch the upstream table function documentation
96+
2. Compare against what's available in the Python API
97+
3. A function is covered if it exists in the Python API, even if it aliases another function's Rust binding
98+
4. Report only functions missing from the Python API
99+
100+
### 5. DataFrame Operations
101+
102+
**Upstream source of truth:**
103+
- Rust docs: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html
104+
105+
**Where they are exposed in this project:**
106+
- Python API: `python/datafusion/dataframe.py` — the `DataFrame` class
107+
- Rust bindings: `crates/core/src/dataframe.rs``PyDataFrame` with `#[pymethods]`
108+
109+
**Evaluated and not requiring separate Python exposure:**
110+
- `show_limit` — already covered by `DataFrame.show()`, which provides the same functionality with a simpler API
111+
- `with_param_values` — already covered by the `param_values` argument on `SessionContext.sql()`, which accomplishes the same thing more robustly
112+
113+
**How to check:**
114+
1. Fetch the upstream DataFrame documentation page listing all methods
115+
2. Compare against methods in `python/datafusion/dataframe.py` — this is the source of truth for coverage
116+
3. The Rust bindings (`crates/core/src/dataframe.rs`) may be consulted for context, but a method is covered if it exists in the Python API
117+
4. Check against the "evaluated and not requiring exposure" list before flagging as a gap
118+
5. Report only methods missing from the Python API
119+
120+
### 6. SessionContext Methods
121+
122+
**Upstream source of truth:**
123+
- Rust docs: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html
124+
125+
**Where they are exposed in this project:**
126+
- Python API: `python/datafusion/context.py` — the `SessionContext` class
127+
- Rust bindings: `crates/core/src/context.rs``PySessionContext` with `#[pymethods]`
128+
129+
**How to check:**
130+
1. Fetch the upstream SessionContext documentation page listing all methods
131+
2. Compare against methods in `python/datafusion/context.py` — this is the source of truth for coverage
132+
3. The Rust bindings (`crates/core/src/context.rs`) may be consulted for context, but a method is covered if it exists in the Python API
133+
4. Report only methods missing from the Python API
134+
135+
### 7. FFI Types (datafusion-ffi)
136+
137+
**Upstream source of truth:**
138+
- Crate source: https://github.com/apache/datafusion/tree/main/datafusion/ffi/src
139+
- Rust docs: https://docs.rs/datafusion-ffi/latest/datafusion_ffi/
140+
141+
**Where they are exposed in this project:**
142+
- Rust bindings: various files under `crates/core/src/` and `crates/util/src/`
143+
- FFI example: `examples/datafusion-ffi-example/src/`
144+
- Dependency declared in root `Cargo.toml` and `crates/core/Cargo.toml`
145+
146+
**Discovering currently supported FFI types:**
147+
Grep for `use datafusion_ffi::` in `crates/core/src/` and `crates/util/src/` to find all FFI types currently imported and used.
148+
149+
**Evaluated and not requiring direct Python exposure:**
150+
These upstream FFI types have been reviewed and do not need to be independently exposed to end users:
151+
- `FFI_ExecutionPlan` — already used indirectly through table providers; no need for direct exposure
152+
- `FFI_PhysicalExpr` / `FFI_PhysicalSortExpr` — internal physical planning types not expected to be needed by end users
153+
- `FFI_RecordBatchStream` — one level deeper than FFI_ExecutionPlan, used internally when execution plans stream results
154+
- `FFI_SessionRef` / `ForeignSession` — session sharing across FFI; Python manages sessions natively via SessionContext
155+
- `FFI_SessionConfig` — Python can configure sessions natively without FFI
156+
- `FFI_ConfigOptions` / `FFI_TableOptions` — internal configuration plumbing
157+
- `FFI_PlanProperties` / `FFI_Boundedness` / `FFI_EmissionType` — read from existing plans, not user-facing
158+
- `FFI_Partitioning` — supporting type for physical planning
159+
- Supporting/utility types (`FFI_Option`, `FFI_Result`, `WrappedSchema`, `WrappedArray`, `FFI_ColumnarValue`, `FFI_Volatility`, `FFI_InsertOp`, `FFI_AccumulatorArgs`, `FFI_Accumulator`, `FFI_GroupsAccumulator`, `FFI_EmitTo`, `FFI_AggregateOrderSensitivity`, `FFI_PartitionEvaluator`, `FFI_PartitionEvaluatorArgs`, `FFI_Range`, `FFI_SortOptions`, `FFI_Distribution`, `FFI_ExprProperties`, `FFI_SortProperties`, `FFI_Interval`, `FFI_TableProviderFilterPushDown`, `FFI_TableType`) — used as building blocks within the types above, not independently exposed
160+
161+
**How to check:**
162+
1. Discover currently supported types by grepping for `use datafusion_ffi::` in `crates/core/src/` and `crates/util/src/`, then compare against the upstream `datafusion-ffi` crate's `lib.rs` exports
163+
2. If new FFI types appear upstream, evaluate whether they represent a user-facing capability
164+
3. Check against the "evaluated and not requiring exposure" list before flagging as a gap
165+
4. Report any genuinely new types that enable user-facing functionality
166+
5. For each currently supported FFI type, verify the full pipeline is present using the checklist from "Adding a New FFI Type":
167+
- Rust PyO3 wrapper with `from_pycapsule()` method
168+
- Python Protocol type (e.g., `ScalarUDFExportable`) for FFI objects
169+
- Python wrapper class with full type hints on all public methods
170+
- ABC base class (if the type can be user-implemented)
171+
- Registered in Rust `init_module()` and Python `__init__.py`
172+
- FFI example in `examples/datafusion-ffi-example/`
173+
- Type appears in union type hints where accepted
174+
175+
## Checking for Existing GitHub Issues
176+
177+
After identifying missing APIs, search the open issues at https://github.com/apache/datafusion-python/issues for each gap to see if an issue already exists requesting that API be exposed. Search using the function or method name as the query.
178+
179+
- If an existing issue is found, include a link to it in the report. Do NOT create a new issue.
180+
- If no existing issue is found, note that no issue exists yet. If the user asks to create issues for missing APIs, each issue should specify that Python test coverage is required as part of the implementation.
181+
182+
## Output Format
183+
184+
For each area checked, produce a report like:
185+
186+
```
187+
## [Area Name] Coverage Report
188+
189+
### Currently Exposed (X functions/methods)
190+
- list of what's already available
191+
192+
### Missing from Upstream (Y functions/methods)
193+
- function_name — brief description of what it does (existing issue: #123)
194+
- function_name — brief description of what it does (no existing issue)
195+
196+
### Notes
197+
- Any relevant observations about partial implementations, naming differences, etc.
198+
```
199+
200+
## Implementation Pattern
201+
202+
If the user asks you to implement missing features, follow these patterns:
203+
204+
### Adding a New Function (Scalar/Aggregate/Window)
205+
206+
**Step 1: Rust binding** in `crates/core/src/functions.rs`:
207+
```rust
208+
#[pyfunction]
209+
#[pyo3(signature = (arg1, arg2))]
210+
fn new_function_name(arg1: PyExpr, arg2: PyExpr) -> PyResult<PyExpr> {
211+
Ok(datafusion::functions::module::expr_fn::new_function_name(arg1.expr, arg2.expr).into())
212+
}
213+
```
214+
Then register in `init_module()`:
215+
```rust
216+
m.add_wrapped(wrap_pyfunction!(new_function_name))?;
217+
```
218+
219+
**Step 2: Python wrapper** in `python/datafusion/functions.py`:
220+
```python
221+
def new_function_name(arg1: Expr, arg2: Expr) -> Expr:
222+
"""Description of what the function does.
223+
224+
Args:
225+
arg1: Description of first argument.
226+
arg2: Description of second argument.
227+
228+
Returns:
229+
Description of return value.
230+
"""
231+
return Expr(f.new_function_name(arg1.expr, arg2.expr))
232+
```
233+
Add to `__all__` list.
234+
235+
### Adding a New DataFrame Method
236+
237+
**Step 1: Rust binding** in `crates/core/src/dataframe.rs`:
238+
```rust
239+
#[pymethods]
240+
impl PyDataFrame {
241+
fn new_method(&self, py: Python, param: PyExpr) -> PyDataFusionResult<Self> {
242+
let df = self.df.as_ref().clone().new_method(param.into())?;
243+
Ok(Self::new(df))
244+
}
245+
}
246+
```
247+
248+
**Step 2: Python wrapper** in `python/datafusion/dataframe.py`:
249+
```python
250+
def new_method(self, param: Expr) -> DataFrame:
251+
"""Description of the method."""
252+
return DataFrame(self.df.new_method(param.expr))
253+
```
254+
255+
### Adding a New SessionContext Method
256+
257+
**Step 1: Rust binding** in `crates/core/src/context.rs`:
258+
```rust
259+
#[pymethods]
260+
impl PySessionContext {
261+
pub fn new_method(&self, py: Python, param: String) -> PyDataFusionResult<PyDataFrame> {
262+
let df = wait_for_future(py, self.ctx.new_method(&param))?;
263+
Ok(PyDataFrame::new(df))
264+
}
265+
}
266+
```
267+
268+
**Step 2: Python wrapper** in `python/datafusion/context.py`:
269+
```python
270+
def new_method(self, param: str) -> DataFrame:
271+
"""Description of the method."""
272+
return DataFrame(self.ctx.new_method(param))
273+
```
274+
275+
### Adding a New FFI Type
276+
277+
FFI types require a full pipeline from C struct through to a typed Python wrapper. Each layer must be present.
278+
279+
**Step 1: Rust PyO3 wrapper class** in a new or existing file under `crates/core/src/`:
280+
```rust
281+
use datafusion_ffi::new_type::FFI_NewType;
282+
283+
#[pyclass(from_py_object, frozen, name = "RawNewType", module = "datafusion.module_name", subclass)]
284+
pub struct PyNewType {
285+
pub inner: Arc<dyn NewTypeTrait>,
286+
}
287+
288+
#[pymethods]
289+
impl PyNewType {
290+
#[staticmethod]
291+
fn from_pycapsule(obj: &Bound<'_, PyAny>) -> PyDataFusionResult<Self> {
292+
let capsule = obj
293+
.getattr("__datafusion_new_type__")?
294+
.call0()?
295+
.downcast::<PyCapsule>()?;
296+
let ffi_ptr = unsafe { capsule.reference::<FFI_NewType>() };
297+
let provider: Arc<dyn NewTypeTrait> = ffi_ptr.into();
298+
Ok(Self { inner: provider })
299+
}
300+
301+
fn some_method(&self) -> PyResult<...> {
302+
// wrap inner trait method
303+
}
304+
}
305+
```
306+
Register in the appropriate `init_module()`:
307+
```rust
308+
m.add_class::<PyNewType>()?;
309+
```
310+
311+
**Step 2: Python Protocol type** in the appropriate Python module (e.g., `python/datafusion/catalog.py`):
312+
```python
313+
class NewTypeExportable(Protocol):
314+
"""Type hint for objects providing a __datafusion_new_type__ PyCapsule."""
315+
316+
def __datafusion_new_type__(self) -> object: ...
317+
```
318+
319+
**Step 3: Python wrapper class** in the same module:
320+
```python
321+
class NewType:
322+
"""Description of the type.
323+
324+
This class wraps a DataFusion NewType, which can be created from a native
325+
Python implementation or imported from an FFI-compatible library.
326+
"""
327+
328+
def __init__(
329+
self,
330+
new_type: df_internal.module_name.RawNewType | NewTypeExportable,
331+
) -> None:
332+
if isinstance(new_type, df_internal.module_name.RawNewType):
333+
self._raw = new_type
334+
else:
335+
self._raw = df_internal.module_name.RawNewType.from_pycapsule(new_type)
336+
337+
def some_method(self) -> ReturnType:
338+
"""Description of the method."""
339+
return self._raw.some_method()
340+
```
341+
342+
**Step 4: ABC base class** (if users should be able to subclass and provide custom implementations in Python):
343+
```python
344+
from abc import ABC, abstractmethod
345+
346+
class NewTypeProvider(ABC):
347+
"""Abstract base class for implementing a custom NewType in Python."""
348+
349+
@abstractmethod
350+
def some_method(self) -> ReturnType:
351+
"""Description of the method."""
352+
...
353+
```
354+
355+
**Step 5: Module exports** — add to the appropriate `__init__.py`:
356+
- Add the wrapper class (`NewType`) to `python/datafusion/__init__.py`
357+
- Add the ABC (`NewTypeProvider`) if applicable
358+
- Add the Protocol type (`NewTypeExportable`) if it should be public
359+
360+
**Step 6: FFI example** — add an example implementation under `examples/datafusion-ffi-example/src/`:
361+
```rust
362+
// examples/datafusion-ffi-example/src/new_type.rs
363+
use datafusion_ffi::new_type::FFI_NewType;
364+
// ... example showing how an external Rust library exposes this type via PyCapsule
365+
```
366+
367+
**Checklist for each FFI type:**
368+
- [ ] Rust PyO3 wrapper with `from_pycapsule()` method
369+
- [ ] Python Protocol type (e.g., `NewTypeExportable`) for FFI objects
370+
- [ ] Python wrapper class with full type hints on all public methods
371+
- [ ] ABC base class (if the type can be user-implemented)
372+
- [ ] Registered in Rust `init_module()` and Python `__init__.py`
373+
- [ ] FFI example in `examples/datafusion-ffi-example/`
374+
- [ ] Type appears in union type hints where accepted (e.g., `Table | TableProviderExportable`)
375+
376+
## Important Notes
377+
378+
- The upstream DataFusion version used by this project is specified in `crates/core/Cargo.toml` — check the `datafusion` dependency version to ensure you're comparing against the right upstream version.
379+
- Some upstream features may intentionally not be exposed (e.g., internal-only APIs). Use judgment about what's user-facing.
380+
- When fetching upstream docs, prefer the published docs.rs documentation as it matches the crate version.
381+
- Function aliases (e.g., `array_append` / `list_append`) should both be exposed if upstream supports them.
382+
- Check the `__all__` list in `functions.py` to see what's publicly exported vs just defined.

.claude/skills

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../.ai/skills

0 commit comments

Comments
 (0)