diff --git a/docs/presentations/refunds-net-revenue-finance-readout.md b/docs/presentations/refunds-net-revenue-finance-readout.md new file mode 100644 index 0000000..ec24de9 --- /dev/null +++ b/docs/presentations/refunds-net-revenue-finance-readout.md @@ -0,0 +1,307 @@ +# Refunds & Net Revenue — Finance Readout + +Paste-ready outline for Google Slides. One slide per `---`-separated block. +For each slide: title goes in the slide title field, body goes in the body, +and the **Speaker notes:** block goes in the notes pane. + +Audience: Jamie + finance stakeholders. Lean outcome-focused, not jargon-heavy. + +--- + +## Slide 1 — Title + +**Refunds & Net Revenue** +A design readout for Finance + +Presented by: Data Engineering +Date: [fill in] + +**Speaker notes:** +This readout covers two things — what we fixed in the Q1 revenue numbers, and +how we'll bring refunds into the warehouse so finance can pull net revenue and +reconcile to Stripe settlement. ~10 slides. We need 2 decisions from finance by +the end (3 are already decided as of this draft). + +--- + +## Slide 2 — TL;DR + +- **Q1 revenue is fixed.** `order_fact.revenue` now ties to Stripe captures exactly: **$12,989,886.01**. +- **Refunds aren't in the warehouse yet.** Net revenue and Stripe-settlement reconciliation are blocked until we land that work. +- **2 decisions still open** for finance (5 total — 3 already locked in). + +**Speaker notes:** +Three things on this slide. The first is closed — DATA-123 is fixed and a PR +is open for review. The second is the work we're scoping today. The third is +the ask: we need finance to weigh in on a few design questions before we can +build cleanly. The five questions are on slide 9. + +--- + +## Slide 3 — Q1 revenue gap: found and fixed (DATA-123) + +**Before:** +- `order_fact.revenue` (non-test) = **$10,522,258.04** +- Stripe captures (target) = **$12,989,886.01** +- Gap: **$2,467,627.97** + +**Root cause:** revenue was computed from shipments, not orders. Two interacting bugs: +- ~$1.55M lost — orders without shipments yet were silently dropped +- ~$0.92M lost — orders with multiple shipments only counted the first one + +**After fix:** $12,989,886.01 — matches Stripe to the penny. +PR open (PR #5). Reconciliation test now runs at `error` severity (build fails loudly if this ever drifts again). + +**Speaker notes:** +The Q1 number was off by ~$2.5M. Root cause: the model was building revenue +from the wrong table — shipments instead of line items. The design doc actually +said revenue should come from line items; the contractor's code didn't follow +the spec. Fixed by rewriting the model to match the documented definition. + +We also bumped the reconciliation test from "warning" to "error" — so if this +class of bug ever returns, dbt runs fail loudly instead of just logging a warning. + +--- + +## Slide 4 — What's next: net revenue + Stripe settlement recon + +Finance needs: +- `refund_amount` and `net_revenue` columns on `order_fact` +- Per-line refund amounts on `order_line_fact` +- A reconciliation column that matches Stripe settlement reports + +What we have to build with: **three refund sources in raw** — `refunds_stripe`, `refunds_shopify`, `refunds_internal_pos`. + +Each source records refunds from a different perspective. That's where the design problem lives. + +**Speaker notes:** +This is the work we're scoping. Two goals: produce net revenue numbers finance +can use day-to-day, and let Jamie tie a column directly to the Stripe settlement +report. We have three refund sources to merge. Calling that out upfront because +it's the meat of the problem — it's not "load one table and add a column," it's +"reconcile three sources of the same events." + +--- + +## Slide 5 — What the refund data looks like + +| Source | Has line_item_id? | Tender split? | Refund grain in source | +|---|---|---|---| +| Shopify | **Yes** | No | One row per (refund, line) with qty refunded | +| Stripe | No | **Yes** (card, store_credit) | One row per (refund, tender_type) | +| Internal POS | No | No | One row per refund | + +**Same refund event can appear in multiple sources for the same order.** +Example: a card refund processed through Shopify shows in both Shopify (line-level) AND Stripe (payment-level). + +**Speaker notes:** +Each source records refunds from its own perspective. Shopify is the +order-management view — it knows which product line and how many units were +returned. Stripe is the payment-processor view — it knows how the money flowed +back, split by tender type (card vs. store credit). Internal POS is a catch-all +for non-Shopify, non-Stripe events. + +The kicker is that the same refund often appears in two sources. That's the +central design problem we have to solve. + +--- + +## Slide 6 — Watch-outs: 4 traps in the raw data + +We inspected all 12 refund rows currently in raw. **Every grain trap that'll bite us at production scale already shows up in this small sample.** + +| # | Trap | Concrete example from raw data | +|---|---|---| +| 1 | **Cross-source double-count** | O005064: refunded once, recorded in BOTH Shopify ($1,789.05) AND Stripe ($894.52 card + $894.53 store_credit). Naive union = $3,578.10 (2x the real refund). | +| 2 | **Within-Stripe tender split** | One refund event becomes multiple Stripe rows when split between card and store credit (same `processed_at`). Even within Stripe, `refund_id` ≠ refund event. | +| 3 | **Sub-line partial-qty refunds** | O000015: a line had qty 3, customer returned qty 1. Refund grain is *finer* than line grain. Fine for dollar sums; matters if anyone ever wants "qty refunded per line." | +| 4 | **Order-grain refunds, no line detail** | O000286 (Stripe only) and O009009 (POS only): full refund recorded, multi-line orders, source doesn't say which line. We have to allocate. | + +**Speaker notes:** +This is the most important slide. The data we have today is tiny — 12 raw rows +— but it already contains every grain trap we'll hit at scale. Walking through +each: + +[Trap 1] When a card refund flows through Shopify, both Shopify and Stripe +record it independently. If we naively union the sources, we double-count. + +[Trap 2] Stripe splits a single refund into multiple rows when it's part-card +part-store-credit. Same event, two rows. + +[Trap 3] Customer returns 1 of 3 units on a line. Refund amount is fine; if +finance ever wants "quantity refunded per line" detail, that lives at a finer +grain (a separate model). + +[Trap 4] Stripe-only and POS-only refunds give us no line attribution at all. +For per-line refund amounts, we have to allocate. + +These four traps are the design pressure. The model below is shaped around +handling all of them correctly. + +--- + +## Slide 7 — Proposed model + +**Two new tables**, separate from `order_fact` (per the prior 2024-Q3 design doc — refund logic shouldn't live on the order grain, or status/revenue reasoning gets tangled again). + +- **`refund_fact`** — one row per *real* refund event (de-duped across sources). + - Columns: `order_id`, `refunded_at`, `refund_amount`, `source_system`, `has_line_attribution`, **`stripe_settled_amount`** (this is the Stripe-recon column) + +- **`refund_line_fact`** — one row per (refund event, line item). + - Direct `line_item_id` from Shopify when present. + - Pro-rata allocation by line revenue when source doesn't carry line detail. ← *Decision #4, confirmed* + +**Then denormalized columns added to existing facts:** +- `order_fact.cash_refund_amount` — card refunds + POS refunds (what reduces revenue) +- `order_fact.store_credit_issued` — separate, tracked as deferred liability ← *Decision #3* +- `order_fact.net_revenue` = `revenue − cash_refund_amount` (store credit excluded until redeemed) +- `order_line_fact.cash_refund_amount`, `order_line_fact.net_line_revenue` + +**Invariant:** `cash_refund_amount + store_credit_issued = total_refund_amount` (per order). + +**Speaker notes:** +Two new tables. The reason we don't just bolt all this onto order_fact is that +the prior design doc was specific: keep refund logic off the order grain. The +contractor warned us that mixing refund logic with status/revenue logic on a +single table creates the kind of bug DATA-123 was — and we'd be re-creating +that problem. + +So: refund *logic* (dedup, attribution, allocation, tender breakdown) lives in +the new tables. The order tables only get the *numbers* added as denormalized +columns. That way finance still gets `net_revenue` directly on `order_fact` +without paying the design cost. + +Worth highlighting the split: `cash_refund_amount` and `store_credit_issued` +are tracked separately on order_fact because they mean different things to +finance. Cash refunds reduce revenue immediately. Store credit is a deferred +liability — revenue stays whole until the credit is redeemed, at which point +it offsets a future order. We don't have store-credit-redemption data today, +so we just isolate the liability column for now and flag the offset modeling +as a follow-up. + +--- + +## Slide 8 — What finance can do with this + +| Question finance wants to answer | Query | +|---|---| +| Q1 net revenue? | `sum(net_revenue)` from `order_fact` where `ordered_at` in Q1 | +| Net revenue by week? | Same, group by week of `ordered_at` ← *Decision #5: refunds attribute back to order date* | +| Tie out to Stripe settlement for May? | `sum(stripe_settled_amount)` from `refund_fact` where `refunded_at` in May | +| Outstanding store-credit liability? | `sum(store_credit_issued)` from `order_fact` | +| Why did this order's net revenue drop? | `refund_fact` rows for that order | +| Which line on this order got refunded? | `refund_line_fact` rows for that order | + +**Net revenue definition** (per Decision #3): `revenue − cash_refund_amount`. Store credit does **not** reduce net revenue at issuance — it stays as a deferred liability until redeemed. + +**Built-in guardrail:** dbt test asserts `sum(order_line_fact.cash_refund_amount per order) = order_fact.cash_refund_amount`. Stops the next DATA-123-style drift before it ships. + +**Speaker notes:** +This is the day-to-day use case slide. If Jamie can answer all five of these +without a custom one-off query, the model is doing its job. + +Two things worth flagging: (1) net revenue by date uses the *order* date, not +the refund date — that's Decision #5. It means Q1's net revenue number is +*stable* once Q1 closes; a Q2 refund of a Q1 order shows up in Q1. That's the +accrual-accounting view, which matches how finance usually thinks. (2) The +reconciliation column for Stripe settlement is on `refund_fact`, not +`order_fact` — keeps the order table narrow. + +The dbt guardrail at the bottom is the lesson from DATA-123: build the +reconciliation test up front, run at error severity, so we catch drift the +moment it happens. + +--- + +## Slide 9 — Decisions from finance + +Three locked in. Two open. + +| # | Question | Status | Direction | +|---|---|---|---| +| 1 | Same refund in Shopify AND Stripe — which is canonical? | **Open** | Default: Shopify (has line-level detail) | +| 2 | Stripe settlement scope — card only, or all tenders? | **Open** | Default: card only (matches what Stripe actually settles) | +| 3 | Store credit treatment — refund or deferred liability? | **Decided** | **Deferred liability.** Doesn't reduce net revenue until redeemed ✓ | +| 4 | Allocate per-line refunds when source has no `line_item_id` — how? | **Decided** | Pro-rata by line revenue ✓ | +| 5 | "Q1 net revenue" — refunds bucket by order date or refund date? | **Decided** | **Order date** ✓ | + +Slack draft ready for Jamie on #1 and #2. + +**Speaker notes:** +Decision summary. Three locked in: +- #3: store credit is a deferred liability, not a revenue reversal. Net revenue + only moves when actual cash moves (card + POS refunds). Store credit gets its + own column, tracked as a liability until redeemed. +- #4: pro-rata allocation for per-line refunds when the source doesn't carry + line detail. Preserves the order = sum-of-lines invariant. +- #5: net revenue by date attributes refunds back to the original order's + date. Means Q1 numbers stay stable as later refunds come in — the accrual + view finance typically wants. + +Two still open — dedup precedence and Stripe settlement scope. Slack draft is +ready for Jamie. Defaults in the table are what we'd build if no input, but +we'd rather get her read explicitly. + +--- + +## Slide 10 — Timeline & next steps + +**Done** +- DATA-123 (Q1 revenue fix) — PR #5 open for review + +**Pending finance** +- Decisions #1 and #2 (Slack draft ready to send) + +**After decisions land** +- Short design doc → review → build: + - `base_refunds_*` (one per source) + - `stg_refunds` (unioned, canonical schema) + - `refund_fact` (de-duped refund events) + - `refund_line_fact` (per-line allocations) + - `refund_amount`, `net_revenue` columns on `order_fact` and `order_line_fact` + - Reconciliation tests at `error` severity +- Estimate: ~2 days build + 1 day review + +**Follow-ups (not in this scope)** +- **Store credit redemption modeling.** When a customer redeems store credit on a future order, that's when net revenue should reduce. Needs a store-credit-redemption data feed we don't have yet — `store_credit_issued` is the liability holder until then. +- Stripe settlement-report ingestion (automate the recon vs manual) +- Per-line *qty* refund detail (only if finance ever needs it) + +**Speaker notes:** +DATA-123 is closed pending merge. Refunds work is blocked on Jamie's input +but can kick off the day after we hear back. Build is ~2 days, review another +day, so end-to-end ~3 working days from green-light. + +Three follow-ups are explicitly out of scope. Calling them out so they don't +surprise anyone later — store credit redemption matters most: today +`store_credit_issued` just accumulates as a liability column. Once we get a +redemption feed, we can close that loop and reduce net revenue at redemption +time. Until then, finance can still see total outstanding store-credit +liability per merchant by summing the column. + +--- + +## Appendix — Source data for reference + +12 raw refund rows total across the three sources. Sample below for completeness; happy to walk through any specific row. + +``` +refunds_stripe (5 rows): + STR000002 O000286 card $1,925.22 2025-03-09 + STR000005 O005064 card $894.52 2026-04-18 + STR000006 O005064 store_credit $894.53 2026-04-18 ← same event as STR000005 + STR000008 O007544 card $951.42 2025-11-21 + STR000009 O007544 store_credit $951.43 2025-11-21 ← same event as STR000008 + +refunds_shopify (3 rows): + SHF000001 O000015 L0000029 qty=1 $367.70 2026-02-01 + SHF000004 O005064 L0009590 qty=5 $1,789.05 2026-04-18 ← same event as STR000005+006 + SHF000007 O007544 L0014230 qty=5 $1,902.85 2025-11-21 ← same event as STR000008+009 + +refunds_internal_pos (1 row): + POS000003 O009009 $1,212.11 2026-05-11 +``` + +**Speaker notes:** +Backup slide if anyone wants to walk the actual rows. The arrows mark the +cross-source duplicates that drive Decision #1. diff --git a/models/orders/base/base_refunds_internal_pos.sql b/models/orders/base/base_refunds_internal_pos.sql new file mode 100644 index 0000000..e890b9a --- /dev/null +++ b/models/orders/base/base_refunds_internal_pos.sql @@ -0,0 +1,4 @@ +{{ config(materialized='view') }} + +SELECT * +FROM {{ source('raw', 'refunds_internal_pos') }} diff --git a/models/orders/base/base_refunds_shopify.sql b/models/orders/base/base_refunds_shopify.sql new file mode 100644 index 0000000..8e819c1 --- /dev/null +++ b/models/orders/base/base_refunds_shopify.sql @@ -0,0 +1,4 @@ +{{ config(materialized='view') }} + +SELECT * +FROM {{ source('raw', 'refunds_shopify') }} diff --git a/models/orders/base/base_refunds_stripe.sql b/models/orders/base/base_refunds_stripe.sql new file mode 100644 index 0000000..b4b2482 --- /dev/null +++ b/models/orders/base/base_refunds_stripe.sql @@ -0,0 +1,4 @@ +{{ config(materialized='view') }} + +SELECT * +FROM {{ source('raw', 'refunds_stripe') }} diff --git a/models/orders/base/orders_base.yml b/models/orders/base/orders_base.yml index c03d25c..8169b89 100644 --- a/models/orders/base/orders_base.yml +++ b/models/orders/base/orders_base.yml @@ -15,3 +15,6 @@ sources: - name: shipment_line_items - name: merchants - name: products + - name: refunds_shopify + - name: refunds_stripe + - name: refunds_internal_pos diff --git a/models/orders/dw/order_fact.sql b/models/orders/dw/order_fact.sql index ecd392f..360710f 100644 --- a/models/orders/dw/order_fact.sql +++ b/models/orders/dw/order_fact.sql @@ -3,86 +3,73 @@ unique_key='order_id' ) }} -WITH shipment_lines AS ( - SELECT - sl.shipment_id - , sl.line_item_id - , sl.quantity_shipped - , li.unit_price - FROM {{ ref('stg_shipment_line_items') }} AS sl - INNER JOIN {{ ref('stg_line_items') }} AS li - ON sl.line_item_id = li.line_item_id -) +-- Note on refund columns: refunds usually post AFTER ordered_at, so in +-- incremental mode the refund_* / net_revenue columns on already-loaded +-- rows can go stale. Run `make full` periodically (or switch to a refund- +-- aware incremental strategy) to keep the denormalized refund totals in +-- sync. The canonical refund data always lives in refund_fact. -, joined AS ( +WITH order_revenue AS ( + -- Revenue at the order grain comes from line items, per the + -- `order_fact_revenue` doc: sum(quantity * unit_price), gross of + -- refunds. Includes orders that have not shipped yet. SELECT - o.order_id - , o.merchant_id - , o.customer_id - , o.order_status - , o.is_test - , o.ordered_at - , o.paid_at - , s.shipment_id - , s.shipped_at - , sl.line_item_id - , sl.quantity_shipped - , sl.unit_price - FROM {{ ref('stg_orders') }} AS o - LEFT JOIN {{ ref('stg_shipments') }} AS s - ON o.order_id = s.order_id - LEFT JOIN shipment_lines AS sl - ON s.shipment_id = sl.shipment_id + order_id + , count(1) AS line_count + , sum(quantity) AS total_quantity + , sum(quantity * unit_price) AS revenue + FROM {{ ref('stg_line_items') }} + GROUP BY order_id ) -, shipment_totals AS ( - -- aggregated to one row per (order, shipment) +, order_shipments AS ( SELECT order_id - , merchant_id - , customer_id - , order_status - , is_test - , ordered_at - , paid_at - , shipment_id - , shipped_at - , count(DISTINCT line_item_id) AS line_count - , sum(quantity_shipped) AS total_quantity - , sum(quantity_shipped * unit_price) AS shipment_revenue - FROM joined - GROUP BY order_id, merchant_id, customer_id, order_status, is_test, ordered_at, paid_at, shipment_id, shipped_at + , count(DISTINCT shipment_id) AS shipment_count + , min(shipped_at) AS shipped_at + FROM {{ ref('stg_shipments') }} + GROUP BY order_id ) -, shipment_counts AS ( +, order_refunds AS ( SELECT order_id - , count(DISTINCT shipment_id) AS shipment_count - FROM shipment_totals + , sum(refund_amount) AS refund_amount + , sum(cash_refund_amount) AS cash_refund_amount + , sum(store_credit_amount) AS store_credit_issued + FROM {{ ref('refund_fact') }} GROUP BY order_id ) , enriched AS ( SELECT - st.order_id - , st.merchant_id + o.order_id + , o.merchant_id , m.merchant_name - , st.customer_id + , o.customer_id , m.customer_type - , st.order_status - , st.is_test - , st.ordered_at - , st.paid_at - , st.shipped_at - , sc.shipment_count - , st.line_count - , st.total_quantity - , st.shipment_revenue AS revenue - FROM shipment_totals AS st + , o.order_status + , o.is_test + , o.ordered_at + , o.paid_at + , s.shipped_at + , coalesce(s.shipment_count, 0) AS shipment_count + , r.line_count + , r.total_quantity + , r.revenue + , coalesce(rf.refund_amount, 0) AS refund_amount + , coalesce(rf.cash_refund_amount, 0) AS cash_refund_amount + , coalesce(rf.store_credit_issued, 0) AS store_credit_issued + , r.revenue - coalesce(rf.cash_refund_amount, 0) AS net_revenue + FROM {{ ref('stg_orders') }} AS o + LEFT JOIN order_revenue AS r + ON o.order_id = r.order_id + LEFT JOIN order_shipments AS s + ON o.order_id = s.order_id + LEFT JOIN order_refunds AS rf + ON o.order_id = rf.order_id LEFT JOIN {{ ref('lkp_merchants') }} AS m - ON st.merchant_id = m.merchant_id - LEFT JOIN shipment_counts AS sc - ON st.order_id = sc.order_id + ON o.merchant_id = m.merchant_id ) SELECT @@ -100,11 +87,13 @@ SELECT , line_count , total_quantity , revenue + , refund_amount + , cash_refund_amount + , store_credit_issued + , net_revenue , current_timestamp AS created_at_dwh , current_timestamp AS updated_at_dwh FROM enriched {% if is_incremental() %} WHERE ordered_at >= {{ get_incremental_value('updated_at_dwh') }} {% endif %} --- dedupe to one row per order (orders can have multiple shipments) -QUALIFY row_number() OVER (PARTITION BY order_id ORDER BY shipped_at) = 1 diff --git a/models/orders/dw/order_line_fact.sql b/models/orders/dw/order_line_fact.sql index 18b7858..29e80a1 100644 --- a/models/orders/dw/order_line_fact.sql +++ b/models/orders/dw/order_line_fact.sql @@ -3,6 +3,19 @@ unique_key='line_item_id' ) }} +-- See order_fact for the same caveat: refund columns on already-loaded rows +-- go stale under the current incremental strategy. Run `make full` to keep +-- per-line refund totals in sync. Canonical refund data lives in +-- refund_line_fact. + +WITH line_refunds AS ( + SELECT + line_item_id + , sum(cash_refund_amount) AS cash_refund_amount + FROM {{ ref('refund_line_fact') }} + GROUP BY line_item_id +) + SELECT li.line_item_id , li.order_id @@ -10,9 +23,13 @@ SELECT , li.quantity , li.unit_price , li.quantity * li.unit_price AS line_revenue + , coalesce(lr.cash_refund_amount, 0) AS cash_refund_amount + , li.quantity * li.unit_price - coalesce(lr.cash_refund_amount, 0) AS net_line_revenue , current_timestamp AS created_at_dwh , current_timestamp AS updated_at_dwh FROM {{ ref('stg_line_items') }} AS li +LEFT JOIN line_refunds AS lr + ON li.line_item_id = lr.line_item_id {% if is_incremental() %} WHERE li.line_item_id NOT IN (SELECT t.line_item_id FROM {{ this }} AS t) diff --git a/models/orders/dw/orders_dw.yml b/models/orders/dw/orders_dw.yml index 72d5811..9309c56 100644 --- a/models/orders/dw/orders_dw.yml +++ b/models/orders/dw/orders_dw.yml @@ -2,7 +2,7 @@ version: 2 models: - name: order_fact - description: One row per order with revenue and shipment metadata. + description: One row per order with revenue, refund, and shipment metadata. columns: - name: order_id description: Primary key. @@ -25,13 +25,31 @@ models: - name: total_quantity - name: revenue description: '{{ doc("order_fact_revenue") }}' + - name: refund_amount + description: | + Total refunded for this order in dollars, deduped across Shopify, + Stripe, and internal POS. Includes all tenders (card, store credit, + POS). For per-event detail see refund_fact. + - name: cash_refund_amount + description: | + Portion of `refund_amount` that reduces revenue — excludes store + credit (Decision #3: store credit is a deferred liability). Equals + `refund_amount - store_credit_issued`. + - name: store_credit_issued + description: | + Portion of `refund_amount` issued as store credit. Tracked as a + deferred liability — does NOT reduce net_revenue at issuance. + Will offset future-order revenue when redemption data lands + (follow-up; see refunds presentation). + - name: net_revenue + description: | + `revenue - cash_refund_amount`. Bucket by `ordered_at` for time- + series views (Decision #5: refunds attribute back to order date). - name: created_at_dwh - name: updated_at_dwh - name: order_line_fact - description: | - Skeleton — one row per order line. Will be extended with refund allocations - in DATA-456 (Problem 2 / refunds work). + description: One row per order line with line revenue, line-level refund, and net. columns: - name: line_item_id tests: @@ -42,3 +60,85 @@ models: - name: quantity - name: unit_price - name: line_revenue + - name: cash_refund_amount + description: | + Portion of cash refunds (not store credit) attributed to this + line. Direct from Shopify when source had `line_item_id`; pro-rata + by line revenue otherwise (Decision #4). + - name: net_line_revenue + description: '`line_revenue - cash_refund_amount`.' + + - name: refund_fact + description: | + One row per canonical refund event, deduped across Shopify, Stripe, + and internal POS. Event identity = (order_id, refunded_at). + columns: + - name: refund_event_id + description: Primary key. md5(order_id || refunded_at). + tests: + - unique + - not_null + - name: order_id + description: FK to order_fact. + tests: + - not_null + - name: refunded_at + description: | + Event timestamp. For Stripe rows this is `processed_at`; for + Shopify and POS, `refunded_at`. Used as the join key for cross- + source dedup. + - name: refund_amount + description: | + Canonical total refunded for this event in dollars. Precedence + when same event recorded in multiple sources (Decision #1): + shopify > stripe > internal_pos. + - name: canonical_source + description: Which source provided `refund_amount`. + - name: source_systems + description: Comma-separated list of all sources that recorded this event. + - name: has_line_attribution + description: TRUE when Shopify recorded line-level detail for this event. + - name: stripe_settled_amount + description: | + Sum of Stripe rows where `tender_type='card'` for this event. + Use this column for Stripe settlement reconciliation + (Decision #2 — card only). + - name: store_credit_amount + description: | + Sum of Stripe rows where `tender_type='store_credit'` for this + event. Aggregates up to `order_fact.store_credit_issued`. + - name: cash_refund_amount + description: | + `refund_amount - store_credit_amount`. The portion that reduces + net revenue. + - name: created_at_dwh + - name: updated_at_dwh + + - name: refund_line_fact + description: | + One row per (refund_event, line_item). Direct from Shopify when source + had `line_item_id`; pro-rata by line revenue otherwise (Decision #4). + columns: + - name: refund_event_id + description: FK to refund_fact. + tests: + - not_null + - name: order_id + tests: + - not_null + - name: line_item_id + description: FK to order_line_fact. + tests: + - not_null + - name: refunded_at + - name: canonical_source + - name: refund_amount + description: Allocated refund amount for this (event, line). + - name: cash_refund_amount + description: | + Line's share of the event's cash refund. Scales by the event-level + cash-vs-store-credit ratio. + - name: store_credit_amount + description: Line's share of the event's store credit refund. + - name: created_at_dwh + - name: updated_at_dwh diff --git a/models/orders/dw/refund_fact.sql b/models/orders/dw/refund_fact.sql new file mode 100644 index 0000000..b4f92be --- /dev/null +++ b/models/orders/dw/refund_fact.sql @@ -0,0 +1,77 @@ +{{ config( + materialized='table' +) }} + +-- One row per CANONICAL refund event, deduped across the three raw sources. +-- An "event" is identified by (order_id, refunded_at) — Stripe tender-type +-- splits and cross-source duplicates collapse into a single row. +-- +-- Canonical refund_amount precedence (Decision #1): shopify > stripe > pos. +-- - Shopify wins when present because it carries line-level attribution. +-- - Within Stripe, rows are summed across tender_type for the event. +-- +-- stripe_settled_amount = sum of Stripe rows where tender_type='card' +-- (matches what Stripe actually settles — Decision #2). +-- +-- store_credit_amount = sum of Stripe rows where tender_type='store_credit'. +-- Per Decision #3, store credit is a deferred liability, so it's tracked +-- separately and excluded from cash_refund_amount. +-- +-- cash_refund_amount = refund_amount − store_credit_amount. +-- This is the portion that reduces net revenue. + +WITH per_source_event AS ( + SELECT + order_id + , refunded_at + , source_system + , sum(refund_amount) AS source_amount + FROM {{ ref('stg_refunds') }} + GROUP BY order_id, refunded_at, source_system +) + +, stripe_tender AS ( + SELECT + order_id + , refunded_at + , sum(CASE WHEN tender_type = 'card' THEN refund_amount ELSE 0 END) AS stripe_card_amount + , sum(CASE WHEN tender_type = 'store_credit' THEN refund_amount ELSE 0 END) AS stripe_store_credit_amount + FROM {{ ref('stg_refunds') }} + WHERE source_system = 'stripe' + GROUP BY order_id, refunded_at +) + +, events AS ( + SELECT + order_id + , refunded_at + , max(CASE WHEN source_system = 'shopify' THEN source_amount END) AS shopify_amount + , max(CASE WHEN source_system = 'stripe' THEN source_amount END) AS stripe_amount + , max(CASE WHEN source_system = 'internal_pos' THEN source_amount END) AS pos_amount + , string_agg(DISTINCT source_system, ',' ORDER BY source_system) AS source_systems + FROM per_source_event + GROUP BY order_id, refunded_at +) + +SELECT + md5(e.order_id || '|' || cast(e.refunded_at AS varchar)) AS refund_event_id + , e.order_id + , e.refunded_at + , coalesce(e.shopify_amount, e.stripe_amount, e.pos_amount) AS refund_amount + , CASE + WHEN e.shopify_amount IS NOT NULL THEN 'shopify' + WHEN e.stripe_amount IS NOT NULL THEN 'stripe' + ELSE 'internal_pos' + END AS canonical_source + , e.source_systems + , (e.shopify_amount IS NOT NULL) AS has_line_attribution + , coalesce(st.stripe_card_amount, 0) AS stripe_settled_amount + , coalesce(st.stripe_store_credit_amount, 0) AS store_credit_amount + , coalesce(e.shopify_amount, e.stripe_amount, e.pos_amount) + - coalesce(st.stripe_store_credit_amount, 0) AS cash_refund_amount + , current_timestamp AS created_at_dwh + , current_timestamp AS updated_at_dwh +FROM events AS e +LEFT JOIN stripe_tender AS st + ON e.order_id = st.order_id + AND e.refunded_at = st.refunded_at diff --git a/models/orders/dw/refund_line_fact.sql b/models/orders/dw/refund_line_fact.sql new file mode 100644 index 0000000..7aca067 --- /dev/null +++ b/models/orders/dw/refund_line_fact.sql @@ -0,0 +1,83 @@ +{{ config( + materialized='table' +) }} + +-- One row per (refund_event, line_item). +-- +-- Two allocation paths (Decision #4): +-- - Direct line attribution from Shopify rows when the event has a +-- line_item_id (`has_line_attribution = TRUE` on refund_fact). +-- - Pro-rata by line_revenue (= quantity * unit_price) across the order's +-- lines when the source didn't tell us which line was refunded. +-- +-- Per-line tender breakdown (cash vs store_credit) is allocated by scaling +-- the line's share against the EVENT-level cash/store_credit split. We don't +-- know which line a Stripe card-vs-store-credit dollar belongs to, so we +-- spread the tender ratio evenly across all lines in the event. + +WITH refund_events AS ( + SELECT * FROM {{ ref('refund_fact') }} +) + +, shopify_line_attributions AS ( + -- Direct line attribution. Sum in case Shopify has multiple rows for + -- the same (event, line). + SELECT + re.refund_event_id + , re.order_id + , sr.line_item_id + , sum(sr.refund_amount) AS line_refund_amount + FROM refund_events AS re + INNER JOIN {{ ref('stg_refunds') }} AS sr + ON re.order_id = sr.order_id + AND re.refunded_at = sr.refunded_at + WHERE sr.source_system = 'shopify' + GROUP BY re.refund_event_id, re.order_id, sr.line_item_id +) + +, order_line_revenue AS ( + SELECT + order_id + , line_item_id + , quantity * unit_price AS line_revenue + FROM {{ ref('stg_line_items') }} +) + +, prorata_allocations AS ( + -- For events WITHOUT line attribution, allocate the event's refund + -- across the order's lines pro-rata by line revenue. + SELECT + re.refund_event_id + , re.order_id + , olr.line_item_id + , re.refund_amount * olr.line_revenue + / nullif(sum(olr.line_revenue) OVER (PARTITION BY re.refund_event_id), 0) + AS line_refund_amount + FROM refund_events AS re + INNER JOIN order_line_revenue AS olr + ON re.order_id = olr.order_id + WHERE NOT re.has_line_attribution +) + +, line_allocations AS ( + SELECT * FROM shopify_line_attributions + UNION ALL + SELECT * FROM prorata_allocations +) + +SELECT + la.refund_event_id + , la.order_id + , la.line_item_id + , re.refunded_at + , re.canonical_source + , la.line_refund_amount AS refund_amount + , la.line_refund_amount * re.cash_refund_amount + / nullif(re.refund_amount, 0) AS cash_refund_amount + , la.line_refund_amount * re.store_credit_amount + / nullif(re.refund_amount, 0) AS store_credit_amount + , current_timestamp AS created_at_dwh + , current_timestamp AS updated_at_dwh +FROM line_allocations AS la +INNER JOIN refund_events AS re + ON la.refund_event_id = re.refund_event_id diff --git a/models/orders/staging/stg_refunds.sql b/models/orders/staging/stg_refunds.sql new file mode 100644 index 0000000..7cf2bf9 --- /dev/null +++ b/models/orders/staging/stg_refunds.sql @@ -0,0 +1,42 @@ +{{ config(materialized='view') }} + +-- Unions the three refund sources into a canonical schema. +-- One row per *source* refund record (NOT yet deduped across sources). +-- Source dedup + canonical refund-event grain lands in refund_fact. + +SELECT + refund_id + , 'shopify' AS source_system + , order_id + , line_item_id + , qty_refunded + , CAST(NULL AS varchar) AS tender_type + , amount_in_cents / 100.0 AS refund_amount + , CAST(refunded_at AS timestamp) AS refunded_at +FROM {{ ref('base_refunds_shopify') }} + +UNION ALL + +SELECT + refund_id + , 'stripe' AS source_system + , order_id + , CAST(NULL AS varchar) AS line_item_id + , CAST(NULL AS integer) AS qty_refunded + , tender_type + , amount_in_cents / 100.0 AS refund_amount + , CAST(processed_at AS timestamp) AS refunded_at +FROM {{ ref('base_refunds_stripe') }} + +UNION ALL + +SELECT + refund_id + , 'internal_pos' AS source_system + , order_id + , CAST(NULL AS varchar) AS line_item_id + , CAST(NULL AS integer) AS qty_refunded + , CAST(NULL AS varchar) AS tender_type + , amount_in_cents / 100.0 AS refund_amount + , CAST(refunded_at AS timestamp) AS refunded_at +FROM {{ ref('base_refunds_internal_pos') }} diff --git a/tests/order_fact_revenue_reconciliation.sql b/tests/order_fact_revenue_reconciliation.sql index 5c6952d..7bc5dc3 100644 --- a/tests/order_fact_revenue_reconciliation.sql +++ b/tests/order_fact_revenue_reconciliation.sql @@ -1,4 +1,4 @@ -{{ config(severity='warn') }} +{{ config(severity='error') }} -- Reconciles total order_fact.revenue against summed line items for non-test orders. -- Returns rows when the discrepancy exceeds $1 — that is, when something is broken. diff --git a/tests/refund_line_to_order_reconciliation.sql b/tests/refund_line_to_order_reconciliation.sql new file mode 100644 index 0000000..15134f3 --- /dev/null +++ b/tests/refund_line_to_order_reconciliation.sql @@ -0,0 +1,31 @@ +{{ config(severity='error') }} + +-- Reconciles per-order sum of order_line_fact.cash_refund_amount against +-- order_fact.cash_refund_amount. Same invariant DATA-123 broke for revenue +-- (sum-of-lines ≠ order total), now defended for cash refunds. +-- Returns rows when any order's discrepancy exceeds $0.01. + +WITH line_totals AS ( + SELECT + order_id + , sum(cash_refund_amount) AS line_cash_refund_total + FROM {{ ref('order_line_fact') }} + GROUP BY order_id +) + +, order_totals AS ( + SELECT + order_id + , cash_refund_amount AS order_cash_refund + FROM {{ ref('order_fact') }} +) + +SELECT + o.order_id + , o.order_cash_refund + , coalesce(l.line_cash_refund_total, 0) AS line_cash_refund_total + , o.order_cash_refund - coalesce(l.line_cash_refund_total, 0) AS discrepancy +FROM order_totals AS o +LEFT JOIN line_totals AS l + ON o.order_id = l.order_id +WHERE abs(o.order_cash_refund - coalesce(l.line_cash_refund_total, 0)) > 0.01