fix(order_fact): compute revenue from line items (DATA-123)#5
fix(order_fact): compute revenue from line items (DATA-123)#5aliciatoshima wants to merge 3 commits into
Conversation
`order_fact.revenue` was summing `quantity_shipped * unit_price` over the first shipment per order, which under-reported Q1 revenue by ~$2.47M vs Stripe captures. Two interacting bugs: - INNER JOIN to `stg_line_items` inside the shipments CTE meant orders with no shipments yet contributed NULL revenue (~$1.55M lost). - `QUALIFY row_number() ... = 1` kept only the first shipment per order, silently dropping revenue from later shipments (~$0.92M lost). Per `orders_dw.md` (`order_fact_revenue` doc), revenue is the sum of `quantity * unit_price` across an order's line items, gross of refunds. Restructured the model to compute revenue/line_count/total_quantity from `stg_line_items` at the order grain, and shipment_count/shipped_at from `stg_shipments` separately. The dedup `QUALIFY` is no longer needed. Also bumps `tests/order_fact_revenue_reconciliation.sql` from `warn` to `error` (per CONTRIBUTING.md: `severity: warn` is for noisy upstream conditions, not for "we know this is broken"). After fix: `order_fact` non-test revenue = \$12,989,886.01, matches the Stripe captures total exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR refactors the Warning Review ran into problems🔥 ProblemsErrors were encountered while retrieving linked issues. Errors (1)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
models/orders/dw/order_fact.sql (1)
71-72:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftIncremental filtering still misses late changes on existing orders.
Line 72 only reloads rows whose
ordered_atis newer than the last model run. If a shipment is added later, or line items are corrected for an older order, that order will never be recomputed, soshipment_count,shipped_at, andrevenuecan drift again after the initial load.This should key incremental processing off changed
order_ids from the contributing sources instead ofordered_at. At minimum, this rollout needs a full refresh or targeted backfill so historical rows pick up the new logic.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/orders/dw/order_fact.sql` around lines 71 - 72, The incremental WHERE currently uses ordered_at (get_incremental_value('updated_at_dwh')) which misses later changes to existing orders; change the incremental logic in the is_incremental() branch so rows are selected by changed order_id from upstream sources (e.g., compare orders.updated_at, shipments.updated_at, line_items.updated_at) instead of ordered_at — for example, replace the single ordered_at filter with a predicate that reloads order_id values whose latest source updated_at >= get_incremental_value('updated_at_dwh') (or perform a full_refresh/targeted backfill if you cannot derive changed order_ids), so computed fields like shipment_count, shipped_at, and revenue are recomputed when any contributing source changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@models/orders/dw/order_fact.sql`:
- Around line 71-72: The incremental WHERE currently uses ordered_at
(get_incremental_value('updated_at_dwh')) which misses later changes to existing
orders; change the incremental logic in the is_incremental() branch so rows are
selected by changed order_id from upstream sources (e.g., compare
orders.updated_at, shipments.updated_at, line_items.updated_at) instead of
ordered_at — for example, replace the single ordered_at filter with a predicate
that reloads order_id values whose latest source updated_at >=
get_incremental_value('updated_at_dwh') (or perform a full_refresh/targeted
backfill if you cannot derive changed order_ids), so computed fields like
shipment_count, shipped_at, and revenue are recomputed when any contributing
source changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 97bdd27c-eeb8-4369-bf7b-e6bb3dadc206
📒 Files selected for processing (2)
models/orders/dw/order_fact.sqltests/order_fact_revenue_reconciliation.sql
Paste-ready Google Slides outline for the upcoming refunds-into-warehouse work. Covers the DATA-123 fix recap, the 4 grain traps spotted in raw refund data, the proposed `refund_fact` / `refund_line_fact` structure plus denormalized columns on `order_fact` / `order_line_fact`, and the status of the 5 design decisions (3 locked: store-credit-as-deferred- liability, pro-rata per-line allocation, order-date attribution; 2 open pending Jamie: cross-source dedup precedence and Stripe settlement scope). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…der facts
Brings the three raw refund sources (Shopify, Stripe, internal POS) into the
warehouse as two new facts plus denormalized columns on order_fact /
order_line_fact, per the finance readout deck.
New models:
- base_refunds_{shopify,stripe,internal_pos}.sql
- stg_refunds.sql (unioned canonical schema; cents -> dollars)
- refund_fact.sql (one row per CANONICAL refund event, identified by
(order_id, refunded_at); de-dupes cross-source duplicates and Stripe
tender-type splits in a single pass)
- refund_line_fact.sql (one row per (event, line); direct line_item_id
from Shopify when present, pro-rata by line_revenue otherwise)
Updated order_fact / order_line_fact:
- order_fact gains refund_amount, cash_refund_amount, store_credit_issued,
net_revenue. net_revenue = revenue - cash_refund_amount (store credit is
a deferred liability per Decision #3, not a revenue reversal).
- order_line_fact gains cash_refund_amount and net_line_revenue.
Design decisions baked into the model (per the deck, finance-approved):
- #1 Cross-source dedup precedence: shopify > stripe > internal_pos
(Shopify wins because it carries line attribution).
- #2 Stripe settlement scope: stripe_settled_amount = sum of Stripe rows
with tender_type='card' for the event (matches Stripe settlement reports).
- #3 Store credit treatment: deferred liability. Excluded from
cash_refund_amount and net_revenue; tracked separately as
store_credit_issued.
- #4 Per-line allocation when source lacks line_item_id: pro-rata by line
revenue.
- #5 Date attribution for net revenue by date: bucket by ordered_at (the
refund attributes back to its order's date for a stable accrual view).
Tests added:
- unique + not_null on refund_event_id (refund_fact PK).
- not_null on refund_event_id, order_id, line_item_id (refund_line_fact).
- refund_line_to_order_reconciliation (severity=error): per-order
sum(order_line_fact.cash_refund_amount) must equal
order_fact.cash_refund_amount. Same invariant DATA-123 broke for revenue,
now guarded for cash refunds.
Verification (non-test orders):
- Revenue: $12,989,886.01 (DATA-123 still ties to Stripe captures)
- Refund total: $7,196.93
- Cash refunds: $5,350.97
- Store credit issued: $1,845.96
- Net revenue: $12,984,535.04
All 23 models build clean, all 19 tests pass.
Known follow-ups (out of scope, called out in deck):
- Store-credit redemption modeling once we get a redemption feed (today
store_credit_issued just accumulates as a liability column).
- Refund-aware incremental strategy for order_fact / order_line_fact (the
refund_* columns on already-loaded rows go stale between full refreshes;
refund_fact remains canonical).
- Stripe settlement-report ingestion to automate the recon.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes the Q1 revenue reconciliation gap reported in DATA-123.
order_fact.revenuewas under-reporting by ~$2.47M versus the Stripe captures total.Root cause was two interacting bugs in
models/orders/dw/order_fact.sql:stg_line_itemsinside the shipment CTE meant orders that hadn't shipped yet contributedNULLrevenue. ~$1.55M lost across 1,374 orders.QUALIFY row_number() … = 1kept only the first shipment per order, silently dropping revenue from later shipments. ~$0.92M lost.Per
models/orders/dw/orders_dw.md(theorder_fact_revenuedoc), revenue issum(quantity * unit_price)across an order's line items, gross of refunds. Restructured the model to compute revenue / line_count / total_quantity fromstg_line_itemsat the order grain, and shipment_count / shipped_at fromstg_shipmentsseparately. TheQUALIFYdedup is no longer needed.Also bumps the existing reconciliation test from
severity='warn'toerror, perCONTRIBUTING.md: "severity: warn is for noisy upstream conditions, not for 'we know this is broken.'"Reconciliation
order_factnon-test revenueTest plan
make full— clean rebuild succeeds (17/17)make test— all 12 tests pass, including the now-errorreconciliation testsqlfluff lint models/orders/dw/order_fact.sql— cleansum(revenue) where !is_testinorder_factmatchessum(quantity * unit_price)fromstg_line_itemsfor non-test ordersFollow-ups (not in this PR)
daily_revenue.sql(and any other consumers oforder_fact) auto-corrects since it selects fromorder_fact.updated_at_dwh = current_timestamp, which means the watermark advances by wall-clock rather than by data time. Worth a separate ticket — out of scope here.tests/order_fact_revenue_reconciliation.sql(lowercase keywords) left alone to keep PR scope tight.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests