Quantco · AndreasAlbertQC · Mar 2, 2026 · Mar 2, 2026 · Mar 11, 2026 · Mar 11, 2026
@@ -0,0 +1,130 @@
+---
+name: dataframely
+description: A declarative, Polars-native data frame validation library. Use when implementing data processing logic in polars.
+license: BSD-3-Clause
+---
+
+# Dataframely skill
+
+`dataframely` provides `dy.Schema` and `dy.Collection` to document and enforce the structure of single or multiple
+related data frames.
+
+## `dy.Schema` example
+
+A `dy.Schema` describes the structure of a single dataframe.
+
+```python
+class HouseSchema(dy.Schema):
+    """A schema for a dataframe describing houses."""
+
+    street: dy.String(primary_key=True)
+    number: dy.UInt16(primary_key=True)
+    # Number of rooms
+    rooms: dy.UInt8()
+    # Area in square meters
+    area: dy.UInt16()
+```
+
+## `dy.Collection` example
+
+A `dy.Collection` describes a set of related dataframes, each described by a `dy.Schema`. Dataframes in a collection
+should share at least a subset of their primary key.
+
+```python
+class MyStreetSchema(dy.Schema):
+    """A schema for a dataframe describing streets."""
+
+    # Shared primary key component with MyHouseSchema
+    street: dy.String(primary_key=True)
+    city: dy.String()
+
+
+class MyCollection(dy.Collection):
+    """A collection of related dataframes."""
+
+    houses: MyHouseSchema
+    streets: MyStreetSchema
+```
+
+# Usage conventions
+
+## Use clear interfaces
+
+Structure data processing code with clear interfaces documented using `dataframely` type hints:
+
+```python
+def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
+    # Internal dataframes do not require schemas
+    df: pl.LazyFrame = ...
+    return MyPreprocessedSchema.validate(df, cast=True)
+```
+
+Use schemas for all input, output, and intermediate dataframes. Schemas may be omitted for short-lived temporary
+dataframes and private helper functions (prefixed with `_`).
+
+## `filter` vs `validate`
+
+Both `.validate` and `.filter` enforce the schema at runtime. Pass `cast=True` for safe type-casting.
+
+- **`Schema.validate`** — raises on failure. Use when failures are unexpected (e.g. transforming already-validated
+  data).
+- **`Schema.filter`** — returns valid rows plus a `FailureInfo` describing filtered-out rows. Use when failures are
+  possible and should be handled gracefully (e.g. logging and skipping invalid rows).
+
+## Testing
+
+Every data transformation must have unit tests. Test each branch of the transformation logic. Do not test properties
+already guaranteed by the schema.
+
+### Test structure
+
+1. Create synthetic input data
+2. Define the expected output
+3. Execute the transformation
+4. Compare using `assert_frame_equal` from `polars.testing` (or `diffly.testing` if installed)
+
+```python
+from polars.testing import assert_frame_equal
+
+
+def test_grouped_sum():
+    df = pl.DataFrame({
+        "col1": [1, 2, 3],
+        "col2": ["a", "a", "b"],
+    }).pipe(MyInputSchema.validate, cast=True)
+
+    expected = pl.DataFrame({
+        "col1": ["a", "b"],
+        "col2": [3, 3],
+    })
+
+    result = my_code(df)
+
+    assert assert_frame_equal(expected, result)
-    assert assert_frame_equal(expected, result)
+    assert_frame_equal(result, expected)
-    assert assert_frame_equal(expected, result)
+    assert_frame_equal(result, expected)
+```
+
+### Generating synthetic input data
+
+For complex schemas where only some columns are relevant to the test, use `dataframely`'s synthetic data generation:
+
+```python
+# Random data meeting all schema constraints
+random_data = MyInputSchema.sample(num_rows=100)
+```
+
+Use fully random data for property tests where exact contents don't matter. Use overrides to pin specific columns while
+randomly sampling the rest:
+
+```python
+random_data_with_overrides = HouseSchema.sample(
+    num_rows=5,
+    overrides={
+        "street": ["Main St.", "Main St.", "Main St.", "Second St.", "Second St."],
+    }
+)
+```
+
+# Getting more information
+
+`dataframely` relies on clear function signatures, type hints and doc strings. If you need more information, check the
+locally installed code.
@@ -22,7 +22,6 @@
 
 _mod = importlib.import_module("dataframely")
 
-
 project = "dataframely"
 copyright = f"{datetime.date.today().year}, QuantCo, Inc"
 author = "QuantCo, Inc."

@@ -0,0 +1,71 @@
+# Using `dataframely` with coding agents
+
+Coding agents are particularly powerful when two criteria are met:
+
+1. The agent can know all required information and does not need to guess.
+2. The results of the agent's work can be easily verified.
+
+`dataframely` helps you fulfill these criteria.
+
+To help your coding agent write good `dataframely` code, we provide a
+`dataframely` [skill](https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/SKILL.md)
+following the [
+`agentskills.io` spec](https://agentskills.io/specification). You can install
+it by placing it where your agent can find it. For example, if you are using `claude`:
+
+```bash
+mkdir -p .claude/skills/dataframely/
+curl -o .claude/skills/dataframely/SKILL.md https://raw.githubusercontent.com/Quantco/dataframely/refs/heads/main/SKILL.md
+```
+
+or if you are using skills.sh:
+
+```bash
+npx skills add Quantco/dataframely
+```
+
+Refer to the documentation of your coding agent for instructions on how to add custom skills.
+
+## Tell the agent about your data with `dataframely` schemas
+
+`dataframely` schemas provide a clear format for documenting dataframe structure and contents, which helps coding
+agents understand your code base. We recommend structuring your data processing code using clear interfaces that are
+documented using
+`dataframely` type hints. This streamlines your coding agent's ability to find the right schema at the right time.
+
+For example:
+
+```python
+def preprocess(raw: dy.LazyFrame[MyRawSchema]) -> dy.DataFrame[MyPreprocessedSchema]:
+    ...
+```
+
+gives a coding agent much more information than the schema-less alternative:
+
+```python
+def load_data(raw: pl.LazyFrame) -> pl.DataFrame:
+    ...
+```
+
+This convention also makes your code more readable and maintainable for human developers.
+
+If there is additional domain information that is not natively expressed through the structure of the schema,
+we recommend documenting this as docstrings on the definition of the schema columns. One common example would be the
+semantic meanings of enum values referring to conventions in the data:
+
+```python
+class HospitalStaySchema(dy.Schema):
+    # Reason for admission to the hospital
+    # N = Emergency
+    # V = Transfer from another hospital
+    # ...
+    admission_reason = dy.Enum(["N", "V", ...])
+```
+
+## Verifying results
+
+`dataframely` supports you and your coding agent in writing unit tests for individual pieces of logic. One significant
+bottle neck is the generation of appropriate test data. Check
-bottle neck is the generation of appropriate test data. Check
+bottleneck is the generation of appropriate test data. Check
-bottle neck is the generation of appropriate test data. Check
+bottleneck is the generation of appropriate test data. Check
+out [our documentation on synthetic data generation](./features/data-generation.md) to see how `dataframely` can help
+you generate realistic test data that meets the constraints of your schema. We recommend requiring your coding agent to
+write tests using this functionality to verify its work.
@@ -7,6 +7,7 @@
 quickstart
 examples/index
 features/index
+coding-agents
 development
 migration/index
 faq

@@ -36,6 +36,7 @@ sphinx = ">=8.2"
 sphinx-copybutton = "*"
 sphinx-design = "*"
 sphinx-toolbox = "*"
+
 [feature.docs.tasks]
 docs = { cmd = "rm -rf _build && find . -name _gen -type d -exec rm -rf \"{}\" + && sphinx-build -M html . _build --fail-on-warning", cwd = "docs", depends-on = "postinstall" }
 readthedocs = { cmd = "rm -rf $READTHEDOCS_OUTPUT/html && cp -r docs/_build/html $READTHEDOCS_OUTPUT/html", depends-on = "docs" }