Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## What this repo is

A Principal Data Engineer interview exercise: a small dbt + DuckDB warehouse inherited from a contractor. The active task is in `DATA-123.md` (Q1 revenue reconciliation against Stripe captures is off). Read it first.

`DATA-123.md` is **rendered from `DATA-123.md.tmpl` by `setup/generate.py`** — re-running `make setup` overwrites it. Don't edit it directly if you want notes to persist; put notes elsewhere.

## Commands

All commands go through the Makefile, which wraps `uv run`. Don't invoke `dbt` directly unless you have a reason — the Make targets pin the right working directory and env.

```bash
make setup # rebuild warehouse.duckdb from raw/*.csv, then dbt full-refresh
make seed # regenerate raw/*.csv from setup/generate.py, then full-refresh
make run # dbt run (incremental)
make full # dbt run --full-refresh
make test # dbt test (singular tests in tests/ + schema tests in *.yml)
make lint # sqlfluff lint models/
make sql Q="select count(*) from main_orders_dw.order_fact" # read-only one-shot query
make clean # rm warehouse.duckdb (next `make setup` rebuilds)
```

Run a single dbt model or test:

```bash
uv run dbt run --select order_fact
uv run dbt run --select +order_fact # model + upstream
uv run dbt test --select order_fact # all tests on a model
uv run dbt test --select test_name:order_fact_revenue_reconciliation
```

`profiles.yml` lives in the repo root (not `~/.dbt/`); dbt picks it up via project-local config. Target is `dbt_duckdb` → `./warehouse.duckdb`.

## Architecture

### Layering (enforced by convention, not tooling)

`base/` → `staging/` → (`lookup/` | `dw/`) → `reporting/`

Materializations are set in `dbt_project.yml` per layer:
- `base/`, `staging/` → views
- `lookup/` → tables
- `dw/` → **incremental** (the only persisted heavy layer)
- `reporting/` → views

Each subject area (`orders/`, `merchants/`) gets its own schema per layer (e.g. `main_orders_dw`, `main_merchants_lookup`). When writing `make sql Q=...` queries, use the fully-qualified `<schema>.<table>` name.

### Incremental pattern

`models/orders/dw/order_fact.sql` and `order_line_fact.sql` are incremental on `ordered_at`, gated by the `get_incremental_value(incr_col)` macro in `macros/get_incremental_value.sql`. The macro is a DuckDB-flavored shim of Extend's internal macro of the same name — it returns `max(incr_col)` from the existing relation, or `'1900-01-01'` on first build. The incremental `WHERE` clause uses this to filter new rows.

**Watch-out:** `order_fact.sql` currently filters incrementally on `ordered_at >= get_incremental_value('updated_at_dwh')` — note the column mismatch. If you change incremental logic, verify the watermark column matches the filter column.

### Order grain & shipment fan-out

`order_fact` is **one row per order**, but the underlying join goes through shipments → shipment_line_items → line_items, which fans out. The model deduplicates with `QUALIFY row_number() OVER (PARTITION BY order_id ORDER BY shipped_at) = 1` at the bottom. Any change touching the join keys or grain needs to keep that invariant.

`revenue` on `order_fact` is computed as `sum(quantity_shipped * unit_price)` aggregated to `(order, shipment)` then collapsed to first shipment by `shipped_at`. This is the suspected source of the DATA-123 discrepancy — orders with multiple shipments will lose revenue from non-first shipments.

### Out-of-scope by prior design

Per `docs/designs/2024-Q3-orders-redesign.md`:
- **Refunds are intentionally not modeled.** `raw/` contains `refunds_*.csv` files but they're not wired into any model. If refund logic is needed, it belongs on a separate `refund_fact`, not on `order_fact`.
- **Merchants are current-state only** (`lkp_merchants`). No Type-2 history. Tier-at-time-of-order needs a separate snapshot.
Comment on lines +66 to +67
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update the refunds documentation to reflect new modeling.

Lines 66-67 state that "Refunds are intentionally not modeled" and cite the 2024-Q3 design as justification. However, this PR introduces comprehensive refunds modeling via refund_fact, order_fact refund aggregates, and order_line_fact refund attribution. This documentation is now contradictory and will mislead future developers.

📝 Proposed fix to update the documentation
-Per `docs/designs/2024-Q3-orders-redesign.md`:
-- **Refunds are intentionally not modeled.** `raw/` contains `refunds_*.csv` files but they're not wired into any model. If refund logic is needed, it belongs on a separate `refund_fact`, not on `order_fact`.
-- **Merchants are current-state only** (`lkp_merchants`). No Type-2 history. Tier-at-time-of-order needs a separate snapshot.
+Per design decisions:
+- **Refunds are modeled on `refund_fact`** (see `docs/designs/2026-Q2-refunds-modeling.md`). Refund aggregates (`refund_total`, `net_revenue`) appear on `order_fact` and `order_line_fact`, but refund event details live on the separate `refund_fact` table.
+- **Merchants are current-state only** (`lkp_merchants`). No Type-2 history. Tier-at-time-of-order needs a separate snapshot.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@CLAUDE.md` around lines 66 - 67, The docs currently state "Refunds are
intentionally not modeled" but the PR adds comprehensive refund modeling; update
CLAUDE.md to remove or replace that bullet and explicitly document the new
refund design: state that refunds are modeled via refund_fact, that order-level
refund aggregates are stored on order_fact, and that order_line_fact contains
refund attribution; also note which raw/refunds_*.csv inputs are consumed and
any relevant transformation or join keys used for refund_fact so future
developers aren't misled.


Read the design doc before proposing structural changes — its "out of scope" section reflects deliberate decisions, not gaps.

## Conventions (from CONTRIBUTING.md)

- SQL: 4-space indent, **leading commas**, lowercase identifiers, uppercase keywords. `make lint` enforces this via sqlfluff (config in `.sqlfluff`, DuckDB dialect, dbt templater).
- Every fact gets `unique` + `not_null` on its PK.
- Prefer row-level invariants over aggregate reconciliation tests — aggregates tell you something is wrong, not where.
- `severity: warn` is for noisy upstream conditions, **not** for known-broken state. The singular test `tests/order_fact_revenue_reconciliation.sql` is currently `severity='warn'` — that's a smell to investigate, not a precedent to copy.

## Seed data

`raw/*.csv` is **committed**. `setup/generate.py` is deterministic (`SEED = 20260517`) so regenerating produces identical files. Date range is 2024-11-01 → 2026-05-01 (18 months); 5k merchants, 500 products, 10k orders. Refund CSVs exist in `raw/` but are unused (see above).
Loading