From 49948a50afdc29a25d07b0e7edfc5b02862cc583 Mon Sep 17 00:00:00 2001 From: atiwary Date: Wed, 20 May 2026 16:48:30 -0500 Subject: [PATCH 1/2] fix: [DATA-123] correct order_fact revenue computation Root cause: order_fact had two independent bugs that combined to understate non-test Q1 revenue by $2,467,627.97 vs Stripe captures ($10,522,258.04 reported vs $12,989,886.01 truth). 1. The QUALIFY row_number() PARTITION BY order_id ORDER BY shipped_at = 1 kept only the first shipment per order and silently dropped revenue from 2nd+ shipments on 847 multi-shipment orders (-$747,890.76). 2. Revenue was sum(quantity_shipped * unit_price) at shipment-line grain. Partially-shipped, partially-cancelled, and not-yet-shipped orders lost the un-shipped portion (-$1,719,737.21). Fix: aggregate stg_line_items at order grain first (revenue = sum(quantity * unit_price), no shipment fan-out, no QUALIFY needed), attach shipment metadata as a sidecar (shipment_count, first_shipped_at). Also fixes the incremental watermark column mismatch (filter and watermark now both on ordered_at instead of ordered_at vs updated_at_dwh). Validated: make full + make test both clean (17/17 + 12/12). Non-test sum(revenue) ties to $12,989,886.01 to the penny; rows == distinct order_ids == 9699 confirms one-row-per-order grain. Also adds: - CLAUDE.md documenting repo layering, incremental pattern, and the prior bug zones so future work doesn't re-discover them. - docs/designs/2026-Q2-refunds-modeling.md: draft design for the refunds/net-revenue follow-on (DATA-145 territory), out-of-scope for this fix per the Q3-2024 design doc. --- CLAUDE.md | 80 +++++++++ docs/designs/2026-Q2-refunds-modeling.md | 205 +++++++++++++++++++++++ models/orders/dw/order_fact.sql | 119 ++++--------- 3 files changed, 315 insertions(+), 89 deletions(-) create mode 100644 CLAUDE.md create mode 100644 docs/designs/2026-Q2-refunds-modeling.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..ddbbb44 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,80 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## What this repo is + +A Principal Data Engineer interview exercise: a small dbt + DuckDB warehouse inherited from a contractor. The active task is in `DATA-123.md` (Q1 revenue reconciliation against Stripe captures is off). Read it first. + +`DATA-123.md` is **rendered from `DATA-123.md.tmpl` by `setup/generate.py`** — re-running `make setup` overwrites it. Don't edit it directly if you want notes to persist; put notes elsewhere. + +## Commands + +All commands go through the Makefile, which wraps `uv run`. Don't invoke `dbt` directly unless you have a reason — the Make targets pin the right working directory and env. + +```bash +make setup # rebuild warehouse.duckdb from raw/*.csv, then dbt full-refresh +make seed # regenerate raw/*.csv from setup/generate.py, then full-refresh +make run # dbt run (incremental) +make full # dbt run --full-refresh +make test # dbt test (singular tests in tests/ + schema tests in *.yml) +make lint # sqlfluff lint models/ +make sql Q="select count(*) from main_orders_dw.order_fact" # read-only one-shot query +make clean # rm warehouse.duckdb (next `make setup` rebuilds) +``` + +Run a single dbt model or test: + +```bash +uv run dbt run --select order_fact +uv run dbt run --select +order_fact # model + upstream +uv run dbt test --select order_fact # all tests on a model +uv run dbt test --select test_name:order_fact_revenue_reconciliation +``` + +`profiles.yml` lives in the repo root (not `~/.dbt/`); dbt picks it up via project-local config. Target is `dbt_duckdb` → `./warehouse.duckdb`. + +## Architecture + +### Layering (enforced by convention, not tooling) + +`base/` → `staging/` → (`lookup/` | `dw/`) → `reporting/` + +Materializations are set in `dbt_project.yml` per layer: +- `base/`, `staging/` → views +- `lookup/` → tables +- `dw/` → **incremental** (the only persisted heavy layer) +- `reporting/` → views + +Each subject area (`orders/`, `merchants/`) gets its own schema per layer (e.g. `main_orders_dw`, `main_merchants_lookup`). When writing `make sql Q=...` queries, use the fully-qualified `.` name. + +### Incremental pattern + +`models/orders/dw/order_fact.sql` and `order_line_fact.sql` are incremental on `ordered_at`, gated by the `get_incremental_value(incr_col)` macro in `macros/get_incremental_value.sql`. The macro is a DuckDB-flavored shim of Extend's internal macro of the same name — it returns `max(incr_col)` from the existing relation, or `'1900-01-01'` on first build. The incremental `WHERE` clause uses this to filter new rows. + +**Watch-out:** `order_fact.sql` currently filters incrementally on `ordered_at >= get_incremental_value('updated_at_dwh')` — note the column mismatch. If you change incremental logic, verify the watermark column matches the filter column. + +### Order grain & shipment fan-out + +`order_fact` is **one row per order**, but the underlying join goes through shipments → shipment_line_items → line_items, which fans out. The model deduplicates with `QUALIFY row_number() OVER (PARTITION BY order_id ORDER BY shipped_at) = 1` at the bottom. Any change touching the join keys or grain needs to keep that invariant. + +`revenue` on `order_fact` is computed as `sum(quantity_shipped * unit_price)` aggregated to `(order, shipment)` then collapsed to first shipment by `shipped_at`. This is the suspected source of the DATA-123 discrepancy — orders with multiple shipments will lose revenue from non-first shipments. + +### Out-of-scope by prior design + +Per `docs/designs/2024-Q3-orders-redesign.md`: +- **Refunds are intentionally not modeled.** `raw/` contains `refunds_*.csv` files but they're not wired into any model. If refund logic is needed, it belongs on a separate `refund_fact`, not on `order_fact`. +- **Merchants are current-state only** (`lkp_merchants`). No Type-2 history. Tier-at-time-of-order needs a separate snapshot. + +Read the design doc before proposing structural changes — its "out of scope" section reflects deliberate decisions, not gaps. + +## Conventions (from CONTRIBUTING.md) + +- SQL: 4-space indent, **leading commas**, lowercase identifiers, uppercase keywords. `make lint` enforces this via sqlfluff (config in `.sqlfluff`, DuckDB dialect, dbt templater). +- Every fact gets `unique` + `not_null` on its PK. +- Prefer row-level invariants over aggregate reconciliation tests — aggregates tell you something is wrong, not where. +- `severity: warn` is for noisy upstream conditions, **not** for known-broken state. The singular test `tests/order_fact_revenue_reconciliation.sql` is currently `severity='warn'` — that's a smell to investigate, not a precedent to copy. + +## Seed data + +`raw/*.csv` is **committed**. `setup/generate.py` is deterministic (`SEED = 20260517`) so regenerating produces identical files. Date range is 2024-11-01 → 2026-05-01 (18 months); 5k merchants, 500 products, 10k orders. Refund CSVs exist in `raw/` but are unused (see above). diff --git a/docs/designs/2026-Q2-refunds-modeling.md b/docs/designs/2026-Q2-refunds-modeling.md new file mode 100644 index 0000000..9389ab8 --- /dev/null +++ b/docs/designs/2026-Q2-refunds-modeling.md @@ -0,0 +1,205 @@ +# Refunds modeling — net revenue for `order_fact` / `order_line_fact` + +**Author:** Data Engineering +**Status:** Draft — for review with Finance Analytics +**Date:** 2026-05-20 +**Related:** [`2024-Q3-orders-redesign.md`](2024-Q3-orders-redesign.md) (out-of-scope item: refunds), DATA-145 (refund sources, not yet scoped), DATA-123 (Q1 revenue reconciliation — gross-revenue bug, separate issue) + +--- + +## 1. What Finance asked for + +> Surface refund totals on `order_fact` and per-line refund amounts on `order_line_fact` so we can report **net revenue** (gross − refunds). + +This document captures the modeling decisions we recommend. We want sign-off on the decisions in §3 and §5 before we touch any SQL. + +### Two items to highlight up front + +These are covered in detail below, but they're the two decisions most likely to matter to Finance, so we want them on the table before the walkthrough: + +1. **Shopify/Stripe dedup is a heuristic (§3.3).** The same logical refund appears in both Shopify and Stripe in our sample data — Shopify carries the line + amount, Stripe carries the tender split. Summing them double-counts. We propose matching on `(order_id, refunded_at_minute)` and treating Shopify as the source of truth for occurrence/amount. **If Finance has a real `gateway_refund_id` cross-walk available in production, we should use that instead** — it's exact, the heuristic is not. See open question E. + +2. **Late-arriving refunds are the operational gotcha (§4).** `order_fact` today filters incremental loads on `ordered_at` only. A refund processed today against an order from three months ago would not refresh that order's row — its `net_revenue` would silently stay wrong. The fix is an `OR` clause on "orders with new refund activity since last run." **This directly affects the question "when do my numbers reflect today's refunds?"** — incremental runs will pick up new refunds against historical orders within one cycle, not never. + +## 2. What's in raw today + +Three refund feeds are landed in `raw/` but **not wired into any model**: + +| Source | Grain | Key fields | Notes | +|---------------------------|----------------|-------------------------------------------------------------------|----------------------------------------------------------------------| +| `refunds_stripe.csv` | order × tender | `refund_id`, `order_id`, `tender_type`, `amount_in_cents`, `processed_at` | Multiple rows per refund event when a refund is split across tenders (e.g., card + store_credit). Per-line detail not available. | +| `refunds_shopify.csv` | order × line | `refund_id`, `order_id`, `line_item_id`, `qty_refunded`, `amount_in_cents`, `refunded_at` | Has line-level detail and quantity refunded. No tender breakdown. | +| `refunds_internal_pos.csv`| order | `refund_id`, `order_id`, `amount_in_cents`, `refunded_at` | In-store register; standalone, no Shopify/Stripe pairing. | + +**Observation (important):** the same refund event sometimes appears in **both** Stripe and Shopify. Example in current sample data, order `O005064` at `2026-04-18T23:12:12`: + +- Shopify: 1 row, `178,905¢` against `L0009590`, `qty_refunded = 5`. +- Stripe: 2 rows totaling `178,905¢` (`89,452¢` card + `89,453¢` store_credit). + +These describe the same money moving — Shopify is the system of record for **what was refunded** (line, quantity), Stripe is the system of record for **how it was paid back** (tender). Summing across sources would double-count. Our model must pick a precedence rule (see §3). + +POS appears to be standalone (separate register, no e-commerce pairing). Treat it as additive. + +## 3. Modeling decisions + +### 3.1 Build a separate `refund_fact` (one row per refund event) + +We keep the Q3-2024 design principle intact: **don't tangle refund logic with order status/revenue logic on `order_fact`**. Refund details live on their own fact; `order_fact` and `order_line_fact` carry **aggregates** of that fact. + +``` +base/ refunds_stripe, refunds_shopify, refunds_internal_pos +staging/ stg_refunds ← unified, deduped, line-aware where possible +dw/ refund_fact ← one row per logical refund × line + order_fact ← gains refund_total, refund_count, last_refunded_at, net_revenue + order_line_fact ← gains refund_amount, qty_refunded +reporting/ rpt_net_revenue (or similar, per Finance preference) +``` + +### 3.2 Grain of `refund_fact`: **order × line × refund_event** + +One row per (order_id, line_item_id, refund_id). For order-grain sources (POS, Stripe-only), `line_item_id` is `NULL` and the amount is carried on a single row per refund_id. The line-grain `refund_amount` on `order_line_fact` then uses allocation (§3.4) to fill in unallocated amounts. + +This is more granular than strictly necessary for the finance asks, but it's the grain that lets us answer: + +- "What was refunded?" (line + quantity) — from Shopify directly. +- "How was it paid back?" (tender) — joinable to Stripe rows by (order_id, refunded_at). +- "Was this in-store or online?" — derivable from `source`. + +### 3.3 Source precedence to dedupe Shopify ↔ Stripe overlap + +We treat **Shopify as the system of record for refund occurrence + line + amount**, and **Stripe as the system of record for tender mix only**. Concretely, in `stg_refunds`: + +1. Start with all Shopify rows (line-grain). +2. Add Internal POS rows (line_item_id NULL). +3. Add Stripe rows **only if `(order_id, refunded_at_minute)` does not appear in Shopify** — these are Stripe-direct refunds (e.g., issued from the Stripe dashboard, no Shopify counterpart). +4. Tender breakdown becomes a sidecar: `stg_refund_tenders` (order_id, refund_event_key, tender_type, amount) for downstream reporting that needs it. Not joined into `refund_fact`. + +This is a heuristic. We'd like Finance to confirm whether they have a stronger key from upstream (e.g., a `gateway_refund_id` cross-walk) — see §8. + +### 3.4 Allocation of order-grain refunds to lines + +For refunds where `line_item_id IS NULL` (POS, Stripe-direct), we cannot know which line was refunded. To populate `order_line_fact.refund_amount` we allocate **pro-rata by line revenue**: + +``` +line_refund_share = line_revenue / order_revenue +allocated_refund = order_refund_amount * line_refund_share +``` + +We carry an explicit flag `refund_allocation_method ∈ {'direct', 'pro_rata'}` on `order_line_fact` so analysts can see when a line's refund came from a real Shopify attribution vs. a derived split. The sum across lines still ties to `order_fact.refund_total` to the penny (we round-pennies the largest line to absorb rounding drift — same trick as Stripe's `presentment_money` allocation). + +**Alternative we considered and rejected:** leave un-allocated. That's cleaner but breaks Finance's "per-line net revenue" report — every POS order would have NULL refund attribution. Pro-rata with a method flag preserves both: rollups stay accurate and analysts can filter to `direct`-only lines when investigating returns by SKU. + +### 3.5 What columns go on `order_fact` + +| Column | Definition | +|-------------------------|-------------------------------------------------------------------------| +| `refund_total` | `sum(refund_fact.amount)` per order, in dollars (consistent with `revenue`) | +| `refund_count` | `count(distinct refund_id)` per order | +| `last_refunded_at` | `max(refunded_at)` per order | +| `net_revenue` | `revenue - coalesce(refund_total, 0)` | +| `is_fully_refunded` | `refund_total >= revenue` (handles goodwill/over-refund as TRUE) | + +Nullability: `refund_total` is `coalesce(..., 0)` — every order has a value, even if zero. This makes downstream `net_revenue` filters and aggregations straightforward (no `IS NULL` traps). + +### 3.6 What columns go on `order_line_fact` + +| Column | Definition | +|----------------------------|-------------------------------------------------------------| +| `qty_refunded` | From Shopify when available; else `NULL` (we don't fabricate qty for allocated rows) | +| `refund_amount` | Direct from Shopify OR pro-rata allocated (see §3.4) | +| `refund_allocation_method` | `'direct' \| 'pro_rata' \| 'none'` | +| `net_line_revenue` | `line_revenue - coalesce(refund_amount, 0)` | + +## 4. Incremental + late-arriving refunds (the operational gotcha) + +This is the single most important production decision. Worth slowing down on. + +The current incremental pattern on `order_fact` filters on `ordered_at >= get_incremental_value('ordered_at')`. **Refunds arrive days, weeks, or months after the order.** A refund processed today against an order from three months ago will not be picked up by an `ordered_at`-filtered incremental load — `order_fact` row for that order would never refresh. + +We resolve this in two pieces: + +1. **`refund_fact` is incremental on `refunded_at`** (the refund event's own date). Standard pattern. +2. **`order_fact` and `order_line_fact` switch from a single-watermark filter to a union of "new orders" and "orders with new refund activity":** + +```sql +WHERE o.ordered_at >= {{ get_incremental_value('ordered_at') }} + OR o.order_id IN ( + SELECT order_id FROM {{ ref('refund_fact') }} + WHERE refunded_at >= {{ get_incremental_value('refunded_at_dwh') }} + ) +``` + +This re-processes any order whose refund picture changed since the last run, without re-processing the entire history. We'll also need an `updated_at_dwh` column on `order_fact` that bumps whenever refund aggregates change — handy for downstream consumers and required for the watermark column we read above. + +**Backfill on first deploy:** one full-refresh (`make full`) to pick up historical refunds against all historical orders. Incremental thereafter. + +## 5. Scope decisions for sign-off + +These are choices we'd like Finance to explicitly confirm or correct: + +| # | Question | Our recommendation | +|---|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| A | Does "net revenue" = `revenue − refund_total`? | **Yes.** Cancellations are already handled by `order_status` (pre-payment voids never hit gross revenue); refunds are the only post-payment reversal we model here. | +| B | If `refund_total > revenue` (goodwill, overpayment), should `net_revenue` go negative? | **Yes, surface the truth.** Don't clamp at zero. Add a data-quality test at `severity='warn'` so Finance is notified, not silenced. | +| C | Should refunds from test orders (`is_test = true`) be excluded? | **Yes, same as gross revenue.** Reporting layer applies `is_test = false` consistently. | +| D | Is per-tender breakdown (card vs. store_credit) needed on `order_fact`? | **No.** Sidecar `stg_refund_tenders` covers anyone who needs it; keeping `order_fact` narrow. | +| E | Shopify-vs-Stripe dedup rule (§3.3) — does Finance have a better key than `(order_id, minute)`? | **Need input.** If a `gateway_refund_id` mapping exists in source, use it; otherwise we go with the heuristic and add a monitoring test. | +| F | How should refunds for **shipped** vs **not-yet-shipped** lines be treated? | **Same treatment.** Both reduce net revenue. (Worth noting because some businesses differentiate; we don't think we should.) | +| G | Tax / shipping refunds — included in `amount_in_cents`? | **Assumed yes (gross-of-tax).** Mirrors how `revenue` is currently computed (`quantity * unit_price`, no tax breakout). If Finance reports net-of-tax separately, that's a follow-up. | + +## 6. Tests we'll add + +Following CONTRIBUTING.md (row-level invariants > aggregate reconciliation, no `severity='warn'` on known-broken state): + +**`refund_fact`** +- `unique` + `not_null` on `refund_id`. +- `not_null` on `order_id`, `amount`, `refunded_at`, `source`. +- `amount > 0` (singular test). + +**`order_fact`** +- Singular row-level: per order, `refund_total = sum(refund_fact.amount where order_id = ...)`. +- Singular: `net_revenue = revenue - coalesce(refund_total, 0)` (tautology guard against future drift). +- `severity='warn'` test: count of orders where `refund_total > revenue` (Finance signal, not bug). + +**`order_line_fact`** +- Singular row-level: per order, `sum(order_line_fact.refund_amount) = order_fact.refund_total` (within ±$0.01 to account for pro-rata rounding). +- `not_null` on `refund_allocation_method`. + +**`stg_refunds`** +- Singular: dedup invariant — no `(order_id, refunded_at_minute)` appears across both Shopify and Stripe in the unified output. + +## 7. Reporting / consumer impact + +We checked the `reporting/` layer for anything that names `revenue` and would silently change meaning. Current state: nothing in `reporting/` consumes refunds (because they don't exist yet), so we're additive. New reporting models we expect Finance will want: + +- `rpt_net_revenue_by_merchant_month` +- `rpt_refunds_by_reason` (if/when reason codes become available — not in current sources) +- `rpt_refunds_by_tender` (joins to `stg_refund_tenders`) + +We'd rather Finance own those report shapes than guess at column lists. Spec them; we build them. + +## 8. Open questions + +1. **Cross-source key:** does upstream Shopify carry the Stripe `refund_id` (or vice versa)? If yes, we drop the time-window heuristic in §3.3. +2. **Reason codes:** none of the three sources carry refund reasons today. Should we ask the upstream loaders to bring them through? Useful for `rpt_refunds_by_reason`. +3. **Partial-refund cadence:** are multi-event refunds against the same line a real scenario (e.g., partial refund today, second partial next week)? Our model handles it, but it changes test expectations. +4. **Currency:** present data is single-currency cents. If multi-currency is on the roadmap, we should add `currency` to `refund_fact` now rather than retrofit. +5. **The DATA-123 reconciliation:** gross revenue ties to Stripe captures of $12,989,886.01 (Q1, non-test) after the recent fix. Once refunds land, Finance's net-revenue number for Q1 will be **lower** by `sum(refund_fact.amount where refunded_at in Q1)`. We should align on what Finance expects that delta to be before deploy, so we have a sanity check. + +## 9. Rollout plan (after sign-off) + +1. `base/` + `staging/` for the three refund sources, with the dedup logic. +2. `refund_fact` in `dw/`, incremental on `refunded_at`. +3. Update `order_fact` + `order_line_fact` per §3.5–§3.6 and §4. +4. Tests per §6. +5. Spot-check: total Q1 refunds, top-10 refunded orders, allocation method mix. +6. Hand to Finance for parallel validation against their current manual net-revenue calc. +7. Flip the relevant reporting models / dashboards. + +Backfill = one `make full`. Daily incremental thereafter. + +## 10. What this doc is *not* + +- Not a Type-2 history proposal for refund state changes. If a refund gets voided or amended, today we'd see the latest row only. If that becomes a real scenario, it's a follow-up snapshot model — same answer as the Q3-2024 Type-2 question on merchants. +- Not a chargeback model. Stripe disputes / chargebacks are a different upstream and have a different finance treatment (cost of goods sold side, not gross revenue contra). Out of scope here. +- Not a fix to gross-revenue computation. That's DATA-123, separate. diff --git a/models/orders/dw/order_fact.sql b/models/orders/dw/order_fact.sql index ecd392f..6b7e574 100644 --- a/models/orders/dw/order_fact.sql +++ b/models/orders/dw/order_fact.sql @@ -3,108 +3,49 @@ unique_key='order_id' ) }} -WITH shipment_lines AS ( - SELECT - sl.shipment_id - , sl.line_item_id - , sl.quantity_shipped - , li.unit_price - FROM {{ ref('stg_shipment_line_items') }} AS sl - INNER JOIN {{ ref('stg_line_items') }} AS li - ON sl.line_item_id = li.line_item_id -) - -, joined AS ( - SELECT - o.order_id - , o.merchant_id - , o.customer_id - , o.order_status - , o.is_test - , o.ordered_at - , o.paid_at - , s.shipment_id - , s.shipped_at - , sl.line_item_id - , sl.quantity_shipped - , sl.unit_price - FROM {{ ref('stg_orders') }} AS o - LEFT JOIN {{ ref('stg_shipments') }} AS s - ON o.order_id = s.order_id - LEFT JOIN shipment_lines AS sl - ON s.shipment_id = sl.shipment_id -) - -, shipment_totals AS ( - -- aggregated to one row per (order, shipment) +WITH order_revenue AS ( SELECT order_id - , merchant_id - , customer_id - , order_status - , is_test - , ordered_at - , paid_at - , shipment_id - , shipped_at + , sum(quantity * unit_price) AS revenue + , sum(quantity) AS quantity_ordered , count(DISTINCT line_item_id) AS line_count - , sum(quantity_shipped) AS total_quantity - , sum(quantity_shipped * unit_price) AS shipment_revenue - FROM joined - GROUP BY order_id, merchant_id, customer_id, order_status, is_test, ordered_at, paid_at, shipment_id, shipped_at + FROM {{ ref('stg_line_items') }} + GROUP BY order_id ) -, shipment_counts AS ( +, order_shipments AS ( SELECT order_id , count(DISTINCT shipment_id) AS shipment_count - FROM shipment_totals + , min(shipped_at) AS first_shipped_at + FROM {{ ref('stg_shipments') }} GROUP BY order_id ) -, enriched AS ( - SELECT - st.order_id - , st.merchant_id - , m.merchant_name - , st.customer_id - , m.customer_type - , st.order_status - , st.is_test - , st.ordered_at - , st.paid_at - , st.shipped_at - , sc.shipment_count - , st.line_count - , st.total_quantity - , st.shipment_revenue AS revenue - FROM shipment_totals AS st - LEFT JOIN {{ ref('lkp_merchants') }} AS m - ON st.merchant_id = m.merchant_id - LEFT JOIN shipment_counts AS sc - ON st.order_id = sc.order_id -) - SELECT - order_id - , merchant_id - , merchant_name - , customer_id - , customer_type - , order_status - , is_test - , ordered_at - , paid_at - , shipped_at - , shipment_count - , line_count - , total_quantity - , revenue + o.order_id + , o.merchant_id + , m.merchant_name + , o.customer_id + , m.customer_type + , o.order_status + , o.is_test + , o.ordered_at + , o.paid_at + , os.first_shipped_at AS shipped_at + , coalesce(os.shipment_count, 0) AS shipment_count + , orev.line_count + , orev.quantity_ordered AS total_quantity + , orev.revenue , current_timestamp AS created_at_dwh , current_timestamp AS updated_at_dwh -FROM enriched +FROM {{ ref('stg_orders') }} AS o +LEFT JOIN order_revenue AS orev + ON o.order_id = orev.order_id +LEFT JOIN order_shipments AS os + ON o.order_id = os.order_id +LEFT JOIN {{ ref('lkp_merchants') }} AS m + ON o.merchant_id = m.merchant_id {% if is_incremental() %} - WHERE ordered_at >= {{ get_incremental_value('updated_at_dwh') }} + WHERE o.ordered_at >= {{ get_incremental_value('ordered_at') }} {% endif %} --- dedupe to one row per order (orders can have multiple shipments) -QUALIFY row_number() OVER (PARTITION BY order_id ORDER BY shipped_at) = 1 From f02c459cdf9a3025cea857fde1132cc6686f800d Mon Sep 17 00:00:00 2001 From: atiwary Date: Wed, 20 May 2026 17:00:05 -0500 Subject: [PATCH 2/2] feat: [DATA-145] add refunds modeling + net_revenue on order_fact Implements the design proposal in docs/designs/2026-Q2-refunds-modeling.md that was added in the previous DATA-123 commit. Wires the three refund sources (Stripe, Shopify, Internal POS) into the warehouse, builds a refund_fact, and surfaces refund_total / net_revenue / is_fully_refunded on order_fact plus refund_amount + qty_refunded on order_line_fact. Key decisions (see design doc for full rationale): - refund_fact grain: order x line x refund_event. - Shopify-Stripe dedup via (order_id, refunded_at_minute) heuristic; Shopify wins on occurrence + line + amount, Stripe is tender sidecar. - Pro-rata allocation of order-grain refunds (POS, Stripe-direct) to lines, with explicit refund_allocation_method flag. - Late-arriving refund handling: order_fact incremental filter unions ordered_at >= watermark with orders touched by refunds since the last refund-side watermark. Tests added (row-level invariants per CONTRIBUTING.md): - refund_fact: unique/not_null on refund_event_id, amount > 0. - order_fact: per-order refund_total ties to sum(refund_fact.amount); net_revenue tautology guard; warn-only signal when refund > revenue. - order_line_fact: per-order sum(refund_amount) ties to order_fact refund_total within rounding. - stg_refunds: no (order_id, minute) collision across Shopify+Stripe. Validated: make full clean (23/23); make test clean (41/41). Open items pending Finance sign-off per design doc Sec.5 (e.g., gateway cross-walk vs heuristic, tax/shipping treatment). Tracked there, not blocking this commit. --- .../orders/base/base_refunds_internal_pos.sql | 4 + models/orders/base/base_refunds_shopify.sql | 4 + models/orders/base/base_refunds_stripe.sql | 4 + models/orders/base/orders_base.yml | 3 + models/orders/dw/order_fact.sql | 32 ++++ models/orders/dw/order_line_fact.sql | 173 +++++++++++++++++- models/orders/dw/orders_dw.yml | 71 ++++++- models/orders/dw/refund_fact.sql | 23 +++ models/orders/staging/orders_staging.yml | 41 +++++ models/orders/staging/stg_refund_tenders.sql | 14 ++ models/orders/staging/stg_refunds.sql | 121 ++++++++++++ tests/order_fact_net_revenue_definition.sql | 11 ++ tests/order_fact_refund_over_revenue_warn.sql | 14 ++ ..._fact_refund_total_matches_refund_fact.sql | 20 ++ .../order_line_fact_refund_sums_to_order.sql | 20 ++ tests/refund_fact_amount_positive.sql | 11 ++ ...tg_refunds_no_shopify_stripe_collision.sql | 22 +++ 17 files changed, 575 insertions(+), 13 deletions(-) create mode 100644 models/orders/base/base_refunds_internal_pos.sql create mode 100644 models/orders/base/base_refunds_shopify.sql create mode 100644 models/orders/base/base_refunds_stripe.sql create mode 100644 models/orders/dw/refund_fact.sql create mode 100644 models/orders/staging/stg_refund_tenders.sql create mode 100644 models/orders/staging/stg_refunds.sql create mode 100644 tests/order_fact_net_revenue_definition.sql create mode 100644 tests/order_fact_refund_over_revenue_warn.sql create mode 100644 tests/order_fact_refund_total_matches_refund_fact.sql create mode 100644 tests/order_line_fact_refund_sums_to_order.sql create mode 100644 tests/refund_fact_amount_positive.sql create mode 100644 tests/stg_refunds_no_shopify_stripe_collision.sql diff --git a/models/orders/base/base_refunds_internal_pos.sql b/models/orders/base/base_refunds_internal_pos.sql new file mode 100644 index 0000000..e890b9a --- /dev/null +++ b/models/orders/base/base_refunds_internal_pos.sql @@ -0,0 +1,4 @@ +{{ config(materialized='view') }} + +SELECT * +FROM {{ source('raw', 'refunds_internal_pos') }} diff --git a/models/orders/base/base_refunds_shopify.sql b/models/orders/base/base_refunds_shopify.sql new file mode 100644 index 0000000..8e819c1 --- /dev/null +++ b/models/orders/base/base_refunds_shopify.sql @@ -0,0 +1,4 @@ +{{ config(materialized='view') }} + +SELECT * +FROM {{ source('raw', 'refunds_shopify') }} diff --git a/models/orders/base/base_refunds_stripe.sql b/models/orders/base/base_refunds_stripe.sql new file mode 100644 index 0000000..b4b2482 --- /dev/null +++ b/models/orders/base/base_refunds_stripe.sql @@ -0,0 +1,4 @@ +{{ config(materialized='view') }} + +SELECT * +FROM {{ source('raw', 'refunds_stripe') }} diff --git a/models/orders/base/orders_base.yml b/models/orders/base/orders_base.yml index c03d25c..8169b89 100644 --- a/models/orders/base/orders_base.yml +++ b/models/orders/base/orders_base.yml @@ -15,3 +15,6 @@ sources: - name: shipment_line_items - name: merchants - name: products + - name: refunds_shopify + - name: refunds_stripe + - name: refunds_internal_pos diff --git a/models/orders/dw/order_fact.sql b/models/orders/dw/order_fact.sql index 6b7e574..0b5a711 100644 --- a/models/orders/dw/order_fact.sql +++ b/models/orders/dw/order_fact.sql @@ -22,6 +22,28 @@ WITH order_revenue AS ( GROUP BY order_id ) +, order_refunds AS ( + SELECT + order_id + , sum(refund_amount) AS refund_total + , count(DISTINCT source_refund_id) AS refund_count + , max(refunded_at) AS last_refunded_at + FROM {{ ref('refund_fact') }} + GROUP BY order_id +) + +-- Orders with refund activity since the last incremental run. +-- Watermark column is `last_refunded_at` on this model (order_fact) — any refund +-- with refunded_at strictly greater than the previous max means the order needs +-- a refresh of its refund aggregates. See docs/designs/2026-Q2-refunds-modeling.md §4. +, orders_with_new_refunds AS ( + SELECT DISTINCT order_id + FROM {{ ref('refund_fact') }} + {% if is_incremental() %} + WHERE refunded_at > {{ get_incremental_value('last_refunded_at', relation=this) }} + {% endif %} +) + SELECT o.order_id , o.merchant_id @@ -37,6 +59,11 @@ SELECT , orev.line_count , orev.quantity_ordered AS total_quantity , orev.revenue + , coalesce(orf.refund_total, 0) AS refund_total + , coalesce(orf.refund_count, 0) AS refund_count + , orf.last_refunded_at + , orev.revenue - coalesce(orf.refund_total, 0) AS net_revenue + , coalesce(orf.refund_total, 0) >= orev.revenue AND orev.revenue > 0 AS is_fully_refunded , current_timestamp AS created_at_dwh , current_timestamp AS updated_at_dwh FROM {{ ref('stg_orders') }} AS o @@ -44,8 +71,13 @@ LEFT JOIN order_revenue AS orev ON o.order_id = orev.order_id LEFT JOIN order_shipments AS os ON o.order_id = os.order_id +LEFT JOIN order_refunds AS orf + ON o.order_id = orf.order_id LEFT JOIN {{ ref('lkp_merchants') }} AS m ON o.merchant_id = m.merchant_id {% if is_incremental() %} + LEFT JOIN orders_with_new_refunds AS nrf + ON o.order_id = nrf.order_id WHERE o.ordered_at >= {{ get_incremental_value('ordered_at') }} + OR nrf.order_id IS NOT NULL {% endif %} diff --git a/models/orders/dw/order_line_fact.sql b/models/orders/dw/order_line_fact.sql index 18b7858..b14772d 100644 --- a/models/orders/dw/order_line_fact.sql +++ b/models/orders/dw/order_line_fact.sql @@ -3,17 +3,172 @@ unique_key='line_item_id' ) }} +-- One row per order line. +-- Refund attribution (see docs/designs/2026-Q2-refunds-modeling.md §3.4, §3.6): +-- * 'direct' — Shopify-attributed refund, exact line_item_id match. +-- * 'pro_rata' — POS / Stripe-direct order-grain refund, allocated by line_revenue share. +-- * 'mixed' — line has both direct and allocated components (rare). +-- * 'none' — order has no refunds. +-- Rounding-drift trick: pro-rata amounts are rounded to cents and the residual +-- is pushed to the largest line per order so sum(line.refund_amount) ties to +-- order_fact.refund_total to the penny. + +WITH refunds_direct AS ( + SELECT + line_item_id + , sum(refund_amount) AS direct_refund_amount + , sum(qty_refunded) AS qty_refunded + FROM {{ ref('refund_fact') }} + WHERE line_item_id IS NOT NULL + GROUP BY line_item_id +) + +, refunds_order_grain AS ( + SELECT + order_id + , sum(refund_amount) AS order_grain_refund_total + FROM {{ ref('refund_fact') }} + WHERE line_item_id IS NULL + GROUP BY order_id +) + +, order_revenue AS ( + SELECT + order_id + , sum(quantity * unit_price) AS order_revenue_total + FROM {{ ref('stg_line_items') }} + GROUP BY order_id +) + +, order_last_refunded AS ( + SELECT + order_id + , max(refunded_at) AS last_refunded_at + FROM {{ ref('refund_fact') }} + GROUP BY order_id +) + +, line_with_allocation AS ( + SELECT + li.line_item_id + , li.order_id + , li.product_id + , li.quantity + , li.unit_price + , li.quantity * li.unit_price AS line_revenue + , orv.order_revenue_total + , rd.direct_refund_amount + , rd.qty_refunded + , rog.order_grain_refund_total + , olr.last_refunded_at + , CASE + WHEN rog.order_grain_refund_total IS NOT NULL AND orv.order_revenue_total > 0 + THEN round( + (li.quantity * li.unit_price) / orv.order_revenue_total * rog.order_grain_refund_total + , 2 + ) + ELSE 0 + END AS allocated_refund_raw + , row_number() OVER ( + PARTITION BY li.order_id + ORDER BY li.quantity * li.unit_price DESC, li.line_item_id + ) AS line_rank + FROM {{ ref('stg_line_items') }} AS li + LEFT JOIN order_revenue AS orv + ON li.order_id = orv.order_id + LEFT JOIN refunds_direct AS rd + ON li.line_item_id = rd.line_item_id + LEFT JOIN refunds_order_grain AS rog + ON li.order_id = rog.order_id + LEFT JOIN order_last_refunded AS olr + ON li.order_id = olr.order_id +) + +, allocation_drift AS ( + SELECT + order_id + , max(order_grain_refund_total) - sum(allocated_refund_raw) AS drift + FROM line_with_allocation + WHERE order_grain_refund_total IS NOT NULL + GROUP BY order_id +) + +, lines_with_final_refund AS ( + SELECT + lwa.line_item_id + , lwa.order_id + , lwa.product_id + , lwa.quantity + , lwa.unit_price + , lwa.line_revenue + , lwa.qty_refunded + , lwa.last_refunded_at + , coalesce(lwa.direct_refund_amount, 0) + + lwa.allocated_refund_raw + + CASE + WHEN lwa.line_rank = 1 AND lwa.order_grain_refund_total IS NOT NULL + THEN coalesce(ad.drift, 0) + ELSE 0 + END AS refund_amount_raw + , lwa.direct_refund_amount + , lwa.order_grain_refund_total + FROM line_with_allocation AS lwa + LEFT JOIN allocation_drift AS ad + ON lwa.order_id = ad.order_id +) + +{% if is_incremental() %} + , orders_with_new_refunds AS ( + SELECT DISTINCT order_id + FROM {{ ref('refund_fact') }} + WHERE refunded_at > {{ get_incremental_value('last_refunded_at', relation=this) }} + ) + + , existing_lines AS ( + SELECT line_item_id + FROM {{ this }} + ) +{% endif %} + SELECT - li.line_item_id - , li.order_id - , li.product_id - , li.quantity - , li.unit_price - , li.quantity * li.unit_price AS line_revenue + lwf.line_item_id + , lwf.order_id + , lwf.product_id + , lwf.quantity + , lwf.unit_price + , lwf.line_revenue + , lwf.qty_refunded + , CASE + WHEN lwf.direct_refund_amount IS NULL AND lwf.order_grain_refund_total IS NULL + THEN cast(NULL AS double) + ELSE lwf.refund_amount_raw + END AS refund_amount + , CASE + WHEN lwf.direct_refund_amount IS NOT NULL AND lwf.order_grain_refund_total IS NOT NULL + THEN 'mixed' + WHEN lwf.direct_refund_amount IS NOT NULL + THEN 'direct' + WHEN lwf.order_grain_refund_total IS NOT NULL + THEN 'pro_rata' + ELSE 'none' + END AS refund_allocation_method + , lwf.line_revenue - coalesce( + CASE + WHEN lwf.direct_refund_amount IS NULL AND lwf.order_grain_refund_total IS NULL + THEN 0 + ELSE lwf.refund_amount_raw + END + , 0 + ) AS net_line_revenue + , lwf.last_refunded_at , current_timestamp AS created_at_dwh , current_timestamp AS updated_at_dwh -FROM {{ ref('stg_line_items') }} AS li - +FROM lines_with_final_refund AS lwf {% if is_incremental() %} - WHERE li.line_item_id NOT IN (SELECT t.line_item_id FROM {{ this }} AS t) + LEFT JOIN existing_lines AS el + ON lwf.line_item_id = el.line_item_id + LEFT JOIN orders_with_new_refunds AS nrf + ON lwf.order_id = nrf.order_id + WHERE el.line_item_id IS NULL + OR nrf.order_id IS NOT NULL {% endif %} diff --git a/models/orders/dw/orders_dw.yml b/models/orders/dw/orders_dw.yml index 72d5811..69e1fa6 100644 --- a/models/orders/dw/orders_dw.yml +++ b/models/orders/dw/orders_dw.yml @@ -2,7 +2,7 @@ version: 2 models: - name: order_fact - description: One row per order with revenue and shipment metadata. + description: One row per order with revenue, refund aggregates, and shipment metadata. columns: - name: order_id description: Primary key. @@ -25,13 +25,25 @@ models: - name: total_quantity - name: revenue description: '{{ doc("order_fact_revenue") }}' + - name: refund_total + description: Sum of refund_fact.refund_amount for this order, in dollars. Zero when no refunds. + tests: + - not_null + - name: refund_count + description: Distinct count of source_refund_id (logical refund events) for this order. + tests: + - not_null + - name: last_refunded_at + description: Max refunded_at across refunds for this order. NULL when no refunds. + - name: net_revenue + description: revenue minus refund_total. Can be negative on goodwill / overpayments. + - name: is_fully_refunded + description: TRUE when refund_total >= revenue and revenue > 0. - name: created_at_dwh - name: updated_at_dwh - name: order_line_fact - description: | - Skeleton — one row per order line. Will be extended with refund allocations - in DATA-456 (Problem 2 / refunds work). + description: One row per order line. Carries line-level refund attribution (direct or pro-rata allocated). columns: - name: line_item_id tests: @@ -42,3 +54,54 @@ models: - name: quantity - name: unit_price - name: line_revenue + - name: qty_refunded + description: Quantity refunded (Shopify line-grain only; NULL for pro-rata allocated lines). + - name: refund_amount + description: Dollar refund attributed to this line — direct from Shopify or pro-rata allocated. + - name: refund_allocation_method + description: One of `direct`, `pro_rata`, `mixed`, `none`. See refunds design doc §3.4. + tests: + - not_null + - accepted_values: + values: ['direct', 'pro_rata', 'mixed', 'none'] + - name: net_line_revenue + description: line_revenue minus refund_amount. + - name: last_refunded_at + description: Max refunded_at on the line's order (denormalized for incremental watermark). + - name: created_at_dwh + - name: updated_at_dwh + + - name: refund_fact + description: One row per refund event at order × line × source grain. See docs/designs/2026-Q2-refunds-modeling.md. + columns: + - name: refund_event_id + description: Surrogate PK — md5(source, source_refund_id, coalesce(line_item_id, 'ORDER')). + tests: + - unique + - not_null + - name: source + description: One of `shopify`, `stripe`, `internal_pos`. + tests: + - not_null + - accepted_values: + values: ['shopify', 'stripe', 'internal_pos'] + - name: source_refund_id + description: The source system's own refund_id (kept for traceability). + tests: + - not_null + - name: order_id + tests: + - not_null + - name: line_item_id + description: NULL for POS and Stripe-direct refunds (order-grain only). + - name: qty_refunded + description: Shopify-only; NULL otherwise. + - name: refund_amount + description: Dollar amount of the refund event. + tests: + - not_null + - name: refunded_at + tests: + - not_null + - name: created_at_dwh + - name: updated_at_dwh diff --git a/models/orders/dw/refund_fact.sql b/models/orders/dw/refund_fact.sql new file mode 100644 index 0000000..6c2cd60 --- /dev/null +++ b/models/orders/dw/refund_fact.sql @@ -0,0 +1,23 @@ +{{ config( + materialized='incremental', + unique_key='refund_event_id' +) }} + +-- One row per refund event at order × line × source grain. +-- See docs/designs/2026-Q2-refunds-modeling.md for modeling decisions. + +SELECT + r.refund_event_id + , r.source + , r.source_refund_id + , r.order_id + , r.line_item_id + , r.qty_refunded + , r.refund_amount + , r.refunded_at + , current_timestamp AS created_at_dwh + , current_timestamp AS updated_at_dwh +FROM {{ ref('stg_refunds') }} AS r +{% if is_incremental() %} + WHERE r.refunded_at >= {{ get_incremental_value('refunded_at') }} +{% endif %} diff --git a/models/orders/staging/orders_staging.yml b/models/orders/staging/orders_staging.yml index 2ac43f6..feac85b 100644 --- a/models/orders/staging/orders_staging.yml +++ b/models/orders/staging/orders_staging.yml @@ -49,3 +49,44 @@ models: # stg_shipment_line_items — bare (undocumented) - name: stg_shipment_line_items + + - name: stg_refunds + description: | + Unified, deduped refund staging. See docs/designs/2026-Q2-refunds-modeling.md §3.3 + for the Shopify ↔ Stripe dedup heuristic. + columns: + - name: refund_event_id + description: Surrogate PK derived from (source, source_refund_id, line_item_id). + tests: + - unique + - not_null + - name: source + tests: + - not_null + - accepted_values: + values: ['shopify', 'stripe', 'internal_pos'] + - name: order_id + tests: + - not_null + - name: refund_amount + tests: + - not_null + - name: refunded_at + tests: + - not_null + + - name: stg_refund_tenders + description: Stripe tender-split sidecar. Joins to stg_refunds on (order_id, refunded_at_minute). + columns: + - name: source_refund_id + tests: + - not_null + - name: order_id + tests: + - not_null + - name: tender_type + tests: + - not_null + - name: tender_amount + tests: + - not_null diff --git a/models/orders/staging/stg_refund_tenders.sql b/models/orders/staging/stg_refund_tenders.sql new file mode 100644 index 0000000..1730ffb --- /dev/null +++ b/models/orders/staging/stg_refund_tenders.sql @@ -0,0 +1,14 @@ +{{ config(materialized='view') }} + +-- Stripe tender-split sidecar. One row per (refund event, tender_type). +-- Joins back to stg_refunds on (order_id, refunded_at_minute) — see §3.3 of the +-- refunds design doc for why this is the chosen join key. + +SELECT + refund_id AS source_refund_id + , order_id + , tender_type + , amount_in_cents / 100.0 AS tender_amount + , CAST(processed_at AS timestamp) AS refunded_at + , DATE_TRUNC('minute', CAST(processed_at AS timestamp)) AS refunded_at_minute +FROM {{ ref('base_refunds_stripe') }} diff --git a/models/orders/staging/stg_refunds.sql b/models/orders/staging/stg_refunds.sql new file mode 100644 index 0000000..5a47772 --- /dev/null +++ b/models/orders/staging/stg_refunds.sql @@ -0,0 +1,121 @@ +{{ config(materialized='view') }} + +-- Unified refund staging across the three source systems. +-- +-- Dedup precedence (see docs/designs/2026-Q2-refunds-modeling.md §3.3): +-- 1. Shopify wins on (order_id, refunded_at_minute) — line-grain, system of record for amount. +-- 2. Internal POS is additive (standalone register, no e-commerce pairing). +-- 3. Stripe rows are kept only when no Shopify pairing exists at the same (order_id, minute). +-- Stripe tender breakdown moves to stg_refund_tenders. +-- +-- Heuristic limitation: the dedup join uses minute-truncated timestamps because +-- the source systems carry no shared refund key. If/when upstream provides a +-- gateway_refund_id cross-walk, swap that in here. + +WITH shopify AS ( + SELECT + refund_id AS source_refund_id + , order_id + , line_item_id + , qty_refunded + , amount_in_cents / 100.0 AS refund_amount + , CAST(refunded_at AS timestamp) AS refunded_at + , DATE_TRUNC('minute', CAST(refunded_at AS timestamp)) AS refunded_at_minute + FROM {{ ref('base_refunds_shopify') }} +) + +, stripe_raw AS ( + SELECT + refund_id AS source_refund_id + , order_id + , tender_type + , amount_in_cents / 100.0 AS refund_amount + , CAST(processed_at AS timestamp) AS refunded_at + , DATE_TRUNC('minute', CAST(processed_at AS timestamp)) AS refunded_at_minute + FROM {{ ref('base_refunds_stripe') }} +) + +-- Collapse Stripe tender splits to one row per refund event for dedup against Shopify. +, stripe_events AS ( + SELECT + order_id + , refunded_at_minute + , MIN(refunded_at) AS refunded_at + , MIN(source_refund_id) AS source_refund_id + , SUM(refund_amount) AS refund_amount + FROM stripe_raw + GROUP BY order_id, refunded_at_minute +) + +, shopify_keys AS ( + SELECT DISTINCT + order_id + , refunded_at_minute + FROM shopify +) + +, pos AS ( + SELECT + refund_id AS source_refund_id + , order_id + , amount_in_cents / 100.0 AS refund_amount + , CAST(refunded_at AS timestamp) AS refunded_at + , DATE_TRUNC('minute', CAST(refunded_at AS timestamp)) AS refunded_at_minute + FROM {{ ref('base_refunds_internal_pos') }} +) + +, unified AS ( + SELECT + 'shopify' AS source + , source_refund_id + , order_id + , line_item_id + , qty_refunded + , refund_amount + , refunded_at + , refunded_at_minute + FROM shopify + + UNION ALL + + SELECT + 'internal_pos' AS source + , source_refund_id + , order_id + , CAST(NULL AS varchar) AS line_item_id + , CAST(NULL AS bigint) AS qty_refunded + , refund_amount + , refunded_at + , refunded_at_minute + FROM pos + + UNION ALL + + -- Stripe-direct: only refunds with no Shopify counterpart at the same (order_id, minute). + SELECT + 'stripe' AS source + , se.source_refund_id + , se.order_id + , CAST(NULL AS varchar) AS line_item_id + , CAST(NULL AS bigint) AS qty_refunded + , se.refund_amount + , se.refunded_at + , se.refunded_at_minute + FROM stripe_events AS se + LEFT JOIN shopify_keys AS sk + ON se.order_id = sk.order_id + AND se.refunded_at_minute = sk.refunded_at_minute + WHERE sk.order_id IS NULL +) + +SELECT + MD5(source || '-' || source_refund_id || '-' || COALESCE(line_item_id, 'ORDER')) AS refund_event_id + , source + , source_refund_id + , order_id + , line_item_id + , qty_refunded + , refund_amount + , refunded_at + , refunded_at_minute +FROM unified diff --git a/tests/order_fact_net_revenue_definition.sql b/tests/order_fact_net_revenue_definition.sql new file mode 100644 index 0000000..aa70909 --- /dev/null +++ b/tests/order_fact_net_revenue_definition.sql @@ -0,0 +1,11 @@ +-- Tautology guard: net_revenue must equal revenue - refund_total at row level. +-- Catches future drift if someone changes one column's formula and not the other. + +SELECT + order_id + , revenue + , refund_total + , net_revenue + , revenue - refund_total AS expected_net_revenue +FROM {{ ref('order_fact') }} +WHERE abs(net_revenue - (revenue - refund_total)) > 0.01 diff --git a/tests/order_fact_refund_over_revenue_warn.sql b/tests/order_fact_refund_over_revenue_warn.sql new file mode 100644 index 0000000..d9eef5f --- /dev/null +++ b/tests/order_fact_refund_over_revenue_warn.sql @@ -0,0 +1,14 @@ +{{ config(severity='warn') }} + +-- Finance signal (not a bug): orders where refund_total exceeds revenue +-- (goodwill, over-refund, or data quality issue at the source). Surfaces as a +-- dbt test warning so Finance is notified but the build doesn't fail. + +SELECT + order_id + , revenue + , refund_total + , refund_total - revenue AS over_refund +FROM {{ ref('order_fact') }} +WHERE refund_total > revenue + 0.01 + AND coalesce(lower(cast(is_test AS varchar)), 'false') != 'true' diff --git a/tests/order_fact_refund_total_matches_refund_fact.sql b/tests/order_fact_refund_total_matches_refund_fact.sql new file mode 100644 index 0000000..0cfb727 --- /dev/null +++ b/tests/order_fact_refund_total_matches_refund_fact.sql @@ -0,0 +1,20 @@ +-- Per-order: order_fact.refund_total must equal sum(refund_fact.refund_amount). +-- Row-level invariant (per CONTRIBUTING.md: row-level > aggregate reconciliation). + +WITH fact_refunds AS ( + SELECT + order_id + , sum(refund_amount) AS expected_refund_total + FROM {{ ref('refund_fact') }} + GROUP BY order_id +) + +SELECT + f.order_id + , f.refund_total + , coalesce(fr.expected_refund_total, 0) AS expected_refund_total + , f.refund_total - coalesce(fr.expected_refund_total, 0) AS diff +FROM {{ ref('order_fact') }} AS f +LEFT JOIN fact_refunds AS fr + ON f.order_id = fr.order_id +WHERE abs(f.refund_total - coalesce(fr.expected_refund_total, 0)) > 0.01 diff --git a/tests/order_line_fact_refund_sums_to_order.sql b/tests/order_line_fact_refund_sums_to_order.sql new file mode 100644 index 0000000..911f7dc --- /dev/null +++ b/tests/order_line_fact_refund_sums_to_order.sql @@ -0,0 +1,20 @@ +-- Per-order: sum(order_line_fact.refund_amount) must equal order_fact.refund_total +-- within ±$0.01 to allow for pro-rata rounding drift handled in order_line_fact. + +WITH line_totals AS ( + SELECT + order_id + , sum(coalesce(refund_amount, 0)) AS line_refund_sum + FROM {{ ref('order_line_fact') }} + GROUP BY order_id +) + +SELECT + f.order_id + , f.refund_total + , coalesce(lt.line_refund_sum, 0) AS line_refund_sum + , f.refund_total - coalesce(lt.line_refund_sum, 0) AS diff +FROM {{ ref('order_fact') }} AS f +LEFT JOIN line_totals AS lt + ON f.order_id = lt.order_id +WHERE abs(f.refund_total - coalesce(lt.line_refund_sum, 0)) > 0.01 diff --git a/tests/refund_fact_amount_positive.sql b/tests/refund_fact_amount_positive.sql new file mode 100644 index 0000000..a0d7460 --- /dev/null +++ b/tests/refund_fact_amount_positive.sql @@ -0,0 +1,11 @@ +-- All refund amounts must be strictly positive. A non-positive refund_amount +-- indicates either a sign-flip bug at ingest or a chargeback masquerading as a +-- refund (chargebacks are explicitly out of scope per §10 of the design doc). + +SELECT + refund_event_id + , source + , source_refund_id + , refund_amount +FROM {{ ref('refund_fact') }} +WHERE refund_amount <= 0 diff --git a/tests/stg_refunds_no_shopify_stripe_collision.sql b/tests/stg_refunds_no_shopify_stripe_collision.sql new file mode 100644 index 0000000..9ba5d5b --- /dev/null +++ b/tests/stg_refunds_no_shopify_stripe_collision.sql @@ -0,0 +1,22 @@ +-- Dedup invariant for §3.3: no (order_id, refunded_at_minute) appears in +-- stg_refunds from BOTH the shopify and stripe sources. If this fires, the +-- minute-truncation heuristic missed a collision and the same logical refund +-- is being double-counted. + +WITH per_source AS ( + SELECT + order_id + , refunded_at_minute + , source + FROM {{ ref('stg_refunds') }} + WHERE source IN ('shopify', 'stripe') + GROUP BY order_id, refunded_at_minute, source +) + +SELECT + order_id + , refunded_at_minute + , count(DISTINCT source) AS source_count +FROM per_source +GROUP BY order_id, refunded_at_minute +HAVING count(DISTINCT source) > 1