Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
name: dataframely
description: A declarative, Polars-native data frame validation library. Use when implementing data processing logic in polars.
license: BSD-3-Clause
---

# Dataframely skill

`dataframely` provides `dy.Schema` and `dy.Collection` to document and enforce the structure of single or multiple
related data frames.

## `dy.Schema` example

A `dy.Schema` describes the structure of a single dataframe.

```python
class HouseSchema(dy.Schema):
"""A schema for a dataframe describing houses."""

street: dy.String(primary_key=True)
number: dy.UInt16(primary_key=True)
# Number of rooms
rooms: dy.UInt8()
# Area in square meters
area: dy.UInt16()
Comment on lines +17 to +25
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dy.Schema example uses variable annotations with constructor calls (e.g., street: dy.String(...)), which is not valid Python and also contradicts Dataframely’s Schema API (columns must be defined via simple assignment). Update the example to use assignment (e.g., street = dy.String(...)).

Copilot uses AI. Check for mistakes.
```

## `dy.Collection` example

A `dy.Collection` describes a set of related dataframes, each described by a `dy.Schema`. Dataframes in a collection
should share at least a subset of their primary key.

```python
class MyStreetSchema(dy.Schema):
"""A schema for a dataframe describing streets."""

# Shared primary key component with MyHouseSchema
street: dy.String(primary_key=True)
city: dy.String()


class MyCollection(dy.Collection):
"""A collection of related dataframes."""

houses: MyHouseSchema
streets: MyStreetSchema
```
Comment on lines +34 to +47
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dy.Collection example references MyHouseSchema but only defines HouseSchema, and it annotates members as bare schemas (houses: MyHouseSchema) whereas Dataframely collections expect dy.LazyFrame[SomeSchema] (or similar) annotations. Please make the example consistent and use the correct member annotation type.

Copilot uses AI. Check for mistakes.

# Usage conventions

## Use clear interfaces

Structure data processing code with clear interfaces documented using `dataframely` type hints:

```python
def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
# Internal dataframes do not require schemas
df: pl.LazyFrame = ...
return MyPreprocessedSchema.validate(df, cast=True)
```

Use schemas for all input, output, and intermediate dataframes. Schemas may be omitted for short-lived temporary
dataframes and private helper functions (prefixed with `_`).

## `filter` vs `validate`

Both `.validate` and `.filter` enforce the schema at runtime. Pass `cast=True` for safe type-casting.

- **`Schema.validate`** — raises on failure. Use when failures are unexpected (e.g. transforming already-validated
data).
- **`Schema.filter`** — returns valid rows plus a `FailureInfo` describing filtered-out rows. Use when failures are
possible and should be handled gracefully (e.g. logging and skipping invalid rows).

## Testing

Every data transformation must have unit tests. Test each branch of the transformation logic. Do not test properties
already guaranteed by the schema.

### Test structure

1. Create synthetic input data
2. Define the expected output
3. Execute the transformation
4. Compare using `assert_frame_equal` from `polars.testing` (or `diffly.testing` if installed)

```python
from polars.testing import assert_frame_equal


def test_grouped_sum():
df = pl.DataFrame({
"col1": [1, 2, 3],
"col2": ["a", "a", "b"],
}).pipe(MyInputSchema.validate, cast=True)

expected = pl.DataFrame({
"col1": ["a", "b"],
"col2": [3, 3],
})

result = my_code(df)

assert assert_frame_equal(expected, result)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polars.testing.assert_frame_equal raises on mismatch and returns None on success. Using assert assert_frame_equal(...) will always fail when frames are equal. Call assert_frame_equal(...) directly (without assert) and consider keeping argument order consistent with the rest of the docs/examples.

Suggested change
assert assert_frame_equal(expected, result)
assert_frame_equal(result, expected)

Copilot uses AI. Check for mistakes.
```

### Generating synthetic input data

For complex schemas where only some columns are relevant to the test, use `dataframely`'s synthetic data generation:

```python
# Random data meeting all schema constraints
random_data = MyInputSchema.sample(num_rows=100)
```

Use fully random data for property tests where exact contents don't matter. Use overrides to pin specific columns while
randomly sampling the rest:

```python
random_data_with_overrides = HouseSchema.sample(
num_rows=5,
overrides={
"street": ["Main St.", "Main St.", "Main St.", "Second St.", "Second St."],
}
)
```

# Getting more information

`dataframely` relies on clear function signatures, type hints and doc strings. If you need more information, check the
locally installed code.
1 change: 0 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@

_mod = importlib.import_module("dataframely")


project = "dataframely"
copyright = f"{datetime.date.today().year}, QuantCo, Inc"
author = "QuantCo, Inc."
Expand Down
71 changes: 71 additions & 0 deletions docs/guides/coding-agents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Using `dataframely` with coding agents

Coding agents are particularly powerful when two criteria are met:

1. The agent can know all required information and does not need to guess.
2. The results of the agent's work can be easily verified.

`dataframely` helps you fulfill these criteria.

To help your coding agent write good `dataframely` code, we provide a
`dataframely` [skill](https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/SKILL.md)
following the [
`agentskills.io` spec](https://agentskills.io/specification). You can install
it by placing it where your agent can find it. For example, if you are using `claude`:

```bash
mkdir -p .claude/skills/dataframely/
curl -o .claude/skills/dataframely/SKILL.md https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/SKILL.md
```

or if you are using skills.sh:

```bash
npx skills add Quantco/dataframely
```

Refer to the documentation of your coding agent for instructions on how to add custom skills.

## Tell the agent about your data with `dataframely` schemas

`dataframely` schemas provide a clear format for documenting dataframe structure and contents, which helps coding
agents understand your code base. We recommend structuring your data processing code using clear interfaces that are
documented using
`dataframely` type hints. This streamlines your coding agent's ability to find the right schema at the right time.

For example:

```python
def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
...
```

gives a coding agent much more information than the schema-less alternative:

```python
def load_data(raw: pl.LazyFrame) -> pl.DataFrame:
...
```

This convention also makes your code more readable and maintainable for human developers.

If there is additional domain information that is not natively expressed through the structure of the schema,
we recommend documenting this as docstrings on the definition of the schema columns. One common example would be the
semantic meanings of enum values referring to conventions in the data:

```python
class HospitalStaySchema(dy.Schema):
# Reason for admission to the hospital
# N = Emergency
# V = Transfer from another hospital
# ...
admission_reason = dy.Enum(["N", "V", ...])
```

## Verifying results

`dataframely` supports you and your coding agent in writing unit tests for individual pieces of logic. One significant
bottle neck is the generation of appropriate test data. Check
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: “bottle neck” should be “bottleneck”.

Suggested change
bottle neck is the generation of appropriate test data. Check
bottleneck is the generation of appropriate test data. Check

Copilot uses AI. Check for mistakes.
out [our documentation on synthetic data generation](./features/data-generation.md) to see how `dataframely` can help
you generate realistic test data that meets the constraints of your schema. We recommend requiring your coding agent to
write tests using this functionality to verify its work.
1 change: 1 addition & 0 deletions docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
quickstart
examples/index
features/index
coding-agents
development
migration/index
faq
Expand Down
26 changes: 26 additions & 0 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pixi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ sphinx = ">=8.2"
sphinx-copybutton = "*"
sphinx-design = "*"
sphinx-toolbox = "*"

[feature.docs.tasks]
docs = { cmd = "rm -rf _build && find . -name _gen -type d -exec rm -rf \"{}\" + && sphinx-build -M html . _build --fail-on-warning", cwd = "docs", depends-on = "postinstall" }
readthedocs = { cmd = "rm -rf $READTHEDOCS_OUTPUT/html && cp -r docs/_build/html $READTHEDOCS_OUTPUT/html", depends-on = "docs" }
Expand Down
Loading