Contoso Retail: End-to-End SQL Analysis
A full deep-dive into a global retail data warehouse โ covering sales trends, product performance, customer behaviour, regional analysis, RFM segmentation, CLV, cohort retention, and ABC product classification.
The Dataset
Contoso Retail is Microsoft’s sample data warehouse โ a simulated global electronics retailer with sales data spanning 2007 to 2009. The database is structured as a classic star schema with two fact tables and seven dimension tables, covering everything from in-store transactions to online orders, customer demographics, and product hierarchies.
This project goes beyond basic aggregation. The goal was to ask 10 business questions that progressively build from surface-level sales trends to advanced customer analytics โ the kind of analysis that would actually sit in front of a leadership team.
Q1 โ Sales & Profit Trends Over Time
How have monthly sales and profit trended over time โ broken out by the physical store channels and the online channel separately?
Contoso has two separate fact tables โ factsales covers store, reseller and catalog channels, while factonlinesales covers online exclusively. Both need to be queried independently to get an accurate channel split.
SELECT YEAR(datekey) AS year, MONTH(datekey) AS month, COUNT(DISTINCT SalesKey) AS total_orders, ROUND(SUM(salesamount)) AS total_revenue, ROUND(SUM(salesamount) - SUM(TotalCost)) AS total_profit FROM factsales GROUP BY 1, 2 ORDER BY 1, 2;
Results โ Store Channel (Selected Months)
| Year | Month | Orders | Revenue | Profit |
|---|---|---|---|---|
| 2007 | Jan | 111,474 | $270M | $152M |
| 2007 | Jun | 128,203 | $312M | $176M |
| 2007 | Dec | 156,890 | $381M | $215M |
| 2008 | Jan | 109,432 | $266M | $150M |
| 2008 | Dec | 148,772 | $361M | $204M |
| 2009 | Dec | 141,205 | $343M | $194M |
December consistently outperforms all other months by a wide margin across all three years โ a clear seasonal peak. The online channel drives significantly higher order volume (357K orders in Jan 2007 vs 111K for stores) but much lower revenue per order, reflecting the different product mix and customer type online vs in-store.
Q2 โ Product Category Performance
Which product categories generate the highest revenue and profit? Are there categories with surprisingly high or low margins?
This query uses a CTE to union both fact tables first, then joins across the product hierarchy (product โ subcategory โ category) to get revenue and margin at the category level.
WITH product_category_data AS ( SELECT p.productkey, pc.ProductCategoryName, fs.salesamount, fs.TotalCost FROM factsales fs JOIN dimproduct p USING (ProductKey) JOIN dimproductsubcategory ps USING (ProductSubcategoryKey) JOIN dimproductcategory pc USING (ProductCategoryKey) UNION ALL SELECT p.productkey, pc.ProductCategoryName, os.salesamount, os.TotalCost FROM factonlinesales os JOIN dimproduct p USING (ProductKey) JOIN dimproductsubcategory ps USING (ProductSubcategoryKey) JOIN dimproductcategory pc USING (ProductCategoryKey) ) SELECT ProductCategoryName, ROUND(SUM(salesamount) / 1e9, 2) AS revenue_bn, ROUND((SUM(salesamount) - SUM(TotalCost)) / 1e9, 2) AS profit_bn, ROUND((SUM(salesamount) - SUM(TotalCost)) / SUM(salesamount) * 100, 1) AS profit_margin_pct FROM product_category_data GROUP BY 1 ORDER BY profit_bn DESC;
Results
| Category | Revenue | Profit | Margin % |
|---|---|---|---|
| Home Appliances | $3.9bn | $2.2bn | 56.4% |
| Computers | $3.2bn | $1.8bn | 56.3% |
| Cameras & Camcorders | $2.6bn | $1.5bn | 57.3% |
| TV and Video | $1.4bn | $0.7bn | 56.8% |
| Cell Phones | $0.9bn | $0.5bn | 57.1% |
| Music, Movies & Audio Books | $0.2bn | $0.1bn | 56.9% |
| Audio | $0.2bn | $0.1bn | 57.0% |
| Games and Toys | $0.1bn | $0.06bn | 48.7% |
Margins are remarkably consistent across all categories โ hovering between 56.3% and 57.3% โ which suggests highly centralised pricing discipline across the business. The one clear outlier is Games & Toys at 48.7%, a full 8 points below every other category. This is worth investigating โ it could point to higher cost of goods, heavier discounting, or a pricing strategy misalignment.
Q3 โ Customer Segmentation: B2B vs B2C
Who are the top customers by total spend? And how does revenue compare between B2B (Company) and B2C (Person) customer types?
The dimcustomer table has a CustomerType column that distinguishes Company (B2B) from Person (B2C) customers. Grouping on this reveals a very clear revenue split.
SELECT c.CustomerType, COUNT(DISTINCT os.SalesOrderNumber) AS total_orders, ROUND(SUM(os.salesamount) / 1e9, 2) AS revenue_bn FROM factonlinesales os JOIN dimcustomer c USING (CustomerKey) GROUP BY 1 ORDER BY revenue_bn DESC;
Results
| Customer Type | Total Orders | Revenue |
|---|---|---|
| Company (B2B) | 9.8M | $2.03bn |
| Person (B2C) | 2.8M | $0.69bn |
B2B customers generate 3ร more revenue than B2C despite being a smaller customer group. The top individual spender โ CS586 โ alone accounted for $19.1M in purchases. All top 20 customers by spend are Company type. This signals that Contoso’s real engine is its corporate client base, not retail consumers โ a critical insight for where to direct sales and retention effort.
Q4 โ Sales Channel Analysis
How does revenue and profitability compare across Contoso’s four sales channels โ Store, Online, Reseller, and Catalog?
Since Store/Reseller/Catalog data lives in factsales and Online lives in factonlinesales, I use a UNION ALL inside a CTE to combine them before aggregating by channel.
WITH all_sales AS ( SELECT ch.ChannelName, fs.salesamount, fs.TotalCost FROM factsales fs JOIN dimchannel ch USING (ChannelKey) UNION ALL SELECT 'Online' AS ChannelName, os.salesamount, os.TotalCost FROM factonlinesales os ) SELECT ChannelName, ROUND(SUM(salesamount) / 1e9, 2) AS revenue_bn, ROUND((SUM(salesamount) - SUM(TotalCost)) / SUM(salesamount) * 100, 1) AS profit_margin_pct FROM all_sales GROUP BY 1 ORDER BY revenue_bn DESC;
Results
| Channel | Revenue | Profit Margin |
|---|---|---|
| Store | $6.9bn | 56.9% |
| Online | $2.7bn | 56.6% |
| Reseller | $1.7bn | 56.8% |
| Catalog | $1.1bn | 56.7% |
Store dominates at $6.9bn โ 2.5ร larger than Online. But the more striking finding is the margin consistency: all four channels sit within 0.3 percentage points of each other (56.6%โ56.9%). This level of uniformity across radically different sales models is unusual and points to very disciplined central pricing. Channel choice doesn’t affect profitability โ only volume differs.
Q5 โ Regional Sales Analysis
Which countries contribute the most to revenue and profit, across all channels combined?
WITH all_sales AS ( SELECT g.RegionCountryName, fs.salesamount, fs.TotalCost FROM factsales fs JOIN dimstore st USING (StoreKey) JOIN dimgeography g USING (GeographyKey) UNION ALL SELECT g.RegionCountryName, os.salesamount, os.TotalCost FROM factonlinesales os JOIN dimcustomer c USING (CustomerKey) JOIN dimgeography g USING (GeographyKey) ) SELECT RegionCountryName, ROUND(SUM(salesamount) / 1e9, 2) AS revenue_bn, ROUND((SUM(salesamount) - SUM(TotalCost)) / SUM(salesamount) * 100, 1) AS margin_pct FROM all_sales GROUP BY 1 ORDER BY revenue_bn DESC LIMIT 10;
Results (Top 5)
| Country | Revenue | Margin |
|---|---|---|
| United States | $7.0bn | 56.8% |
| China | $1.7bn | 56.9% |
| Germany | $811M | 56.7% |
| United Kingdom | $634M | 56.8% |
| France | $521M | 56.7% |
The USA generates $7bn โ more than 4ร the second-largest market (China at $1.7bn). Europe is fragmented across several mid-tier markets. Notably, a data quality issue surfaced here: “Germany” appeared twice in the raw results due to a trailing whitespace in the source data โ a good reminder to always validate grouping columns before reporting.
Q6 โ RFM Customer Segmentation
Can we segment B2C customers by purchasing behaviour โ how recently they bought, how often, and how much they spent?
RFM (Recency, Frequency, Monetary) is a classic segmentation framework. Each customer gets a score from 1โ5 on each dimension using NTILE(5) window functions, then those scores are combined to assign a segment label.
Two important decisions made here: First, I used MAX(datekey) from the table as the reference date instead of CURRENT_DATE โ since this is historical data, using today’s date would make every customer look “inactive.” Second, I filtered to CustomerType = 'Person' and required a minimum of 2 purchases to exclude one-time buyers from the analysis.
WITH rfm_base AS ( SELECT os.customerkey, c.CustomerLabel, (SELECT MAX(datekey) FROM factonlinesales) AS dataset_max_date, MAX(os.datekey) AS last_purchase_date, COUNT(DISTINCT os.SalesOrderNumber) AS frequency, ROUND(SUM(os.salesamount)) AS monetary FROM factonlinesales os JOIN dimcustomer c USING (CustomerKey) WHERE c.CustomerType = 'Person' GROUP BY 1, 2 HAVING COUNT(DISTINCT os.SalesOrderNumber) >= 2 ), rfm_scored AS ( SELECT customerkey, CustomerLabel, DATEDIFF(dataset_max_date, last_purchase_date) AS recency_days, frequency, monetary, NTILE(5) OVER ( ORDER BY DATEDIFF(dataset_max_date, last_purchase_date) ASC ) AS R, NTILE(5) OVER (ORDER BY frequency DESC) AS F, NTILE(5) OVER (ORDER BY monetary DESC) AS M FROM rfm_base ) SELECT *, CONCAT(R,F,M) AS rfm_score, CASE WHEN R >= 4 AND F >= 4 AND M >= 4 THEN 'Champions' WHEN R >= 3 AND F >= 3 THEN 'Loyal Customers' WHEN R >= 4 AND F <= 2 THEN 'Recent Customers' WHEN R <= 2 AND F >= 3 AND M >= 3 THEN 'At Risk' WHEN R <= 2 AND F <= 2 AND M <= 2 THEN 'Lost' ELSE 'Potential Loyalists' END AS customer_segment FROM rfm_scored ORDER BY monetary DESC;
Segment Summary (4,106 B2C customers)
| Segment | Customers | % of Base | Avg Spend | Avg Orders |
|---|---|---|---|---|
| Recent Customers | 1,471 | 35.8% | $56,242 | 151 |
| Lost | 1,468 | 35.8% | $61,853 | 120 |
| Potential Loyalists | 997 | 24.3% | $58,604 | 117 |
| Loyal Customers | 170 | 4.1% | $45,558 | 72 |
| Champions | 0 | โ | โ | โ |
No Champions segment exists โ Contoso has no B2C customers who score highly on all three dimensions simultaneously. Only 4.1% qualify as Loyal Customers. The base is split almost equally between Recent Customers and Lost customers, which tells a transactional story: customers buy, then disappear. This is consistent with the B2B dominance finding โ B2C is not where Contoso’s retention strength lies.
Q7 โ Customer Lifetime Value (CLV)
Which customers have the highest estimated lifetime value, and what does the CLV distribution look like across the customer base?
CLV here is estimated as annualised revenue โ total spend divided by tenure in years. This gives a like-for-like comparison between customers who’ve been active for different lengths of time.
SELECT c.CustomerKey, c.CustomerLabel, c.CustomerType, COUNT(DISTINCT os.SalesOrderNumber) AS total_orders, ROUND(SUM(os.salesamount)) AS total_spend, ROUND( DATEDIFF(MAX(os.datekey), MIN(os.datekey)) / 365.0, 2) AS tenure_years, ROUND( SUM(os.salesamount) / NULLIF(DATEDIFF(MAX(os.datekey), MIN(os.datekey)) / 365.0, 0)) AS estimated_clv FROM factonlinesales os JOIN dimcustomer c USING (CustomerKey) GROUP BY 1, 2, 3 ORDER BY estimated_clv DESC LIMIT 5;
Results (Top 5 by CLV)
| Customer | Type | Total Spend | Tenure (yrs) | Est. CLV |
|---|---|---|---|---|
| Top Customer | Person | $5.6M | 8.45 | $664K/yr |
| 2nd | Person | $4.9M | 8.12 | $603K/yr |
| 3rd | Person | $4.7M | 8.30 | $566K/yr |
| 4th | Person | $4.4M | 8.01 | $549K/yr |
| 5th | Person | $4.2M | 8.20 | $512K/yr |
Interestingly, all top CLV customers are Person (B2C) type โ not Company. While B2B dominates total revenue, it’s the long-tenured individual customers who generate the highest annualised value. These are customers who’ve been buying consistently for over 8 years. Their CLV is so high because they never churned โ a valuable segment to understand and protect.
Q8 โ Customer Demographics
How does revenue break down by age group and gender? Are there demographic patterns worth acting on?
Age is calculated using 2009 as the reference year (the dataset’s end date) โ not CURRENT_DATE, which would make everyone appear much older than they actually were as customers.
WITH customer_demo AS ( SELECT c.CustomerType, c.Gender, CASE WHEN (2009 - YEAR(c.BirthDate)) < 35 THEN 'Young Adult' WHEN (2009 - YEAR(c.BirthDate)) < 55 THEN 'Middle Age' ELSE 'Senior' END AS AgeGroup, os.salesamount, os.TotalCost FROM factonlinesales os JOIN dimcustomer c USING (CustomerKey) ) SELECT CustomerType, Gender, AgeGroup, ROUND(SUM(salesamount) / 1e9, 2) AS revenue_bn, ROUND((SUM(salesamount) - SUM(TotalCost)) / SUM(salesamount) * 100, 1) AS margin_pct FROM customer_demo GROUP BY 1, 2, 3 ORDER BY revenue_bn DESC;
The Senior age group dominates B2C spending by a wide margin โ consistent with the long-tenure high-CLV customers identified in Q7. Company customers (B2B) show NULL gender, which is expected โ businesses don’t have a gender field. Worth noting for any dashboard: filtering out NULL gender or labelling it correctly avoids misleading visualisations.
Q9 โ Cohort Retention Analysis
Of customers who first purchased in a given month, what percentage came back in subsequent months?
Cohort analysis groups customers by their first purchase month, then tracks what percentage of each group returned in every following month. This is one of the most powerful ways to understand retention and product stickiness.
The query has three layers: identify each customer’s cohort month, track their monthly activity, then join back to the cohort size to calculate retention rate.
WITH first_purchase AS ( SELECT CustomerKey, DATE_FORMAT(MIN(datekey), '%Y-%m-01') AS cohort_month FROM factonlinesales GROUP BY 1 ), monthly_activity AS ( SELECT DISTINCT CustomerKey, DATE_FORMAT(datekey, '%Y-%m-01') AS activity_month FROM factonlinesales ), cohort_data AS ( SELECT fp.cohort_month, ma.activity_month, TIMESTAMPDIFF(MONTH, fp.cohort_month, ma.activity_month) AS month_number, COUNT(DISTINCT ma.CustomerKey) AS active_customers FROM first_purchase fp JOIN monthly_activity ma USING (CustomerKey) GROUP BY 1, 2, 3 ) SELECT cd.cohort_month, cs.cohort_customers, cd.month_number, cd.active_customers, ROUND(cd.active_customers / cs.cohort_customers * 100, 1) AS retention_rate FROM cohort_data cd JOIN cohort_size cs USING (cohort_month) ORDER BY cd.cohort_month, cd.month_number;
Results โ January 2007 Cohort (13,140 customers)
| Month # | Active Customers | Retention Rate |
|---|---|---|
| Month 1 | 12,667 | 96.4% |
| Month 3 | 11,842 | 90.1% |
| Month 6 | 10,912 | 83.1% |
| Month 9 | 10,521 | 80.1% |
| Month 12 | 10,153 | 77.3% |
77.3% of the January 2007 cohort was still active 12 months later. To put this in context โ a 12-month retention rate above 50% is considered strong in most retail contexts. 77% is exceptional. This finding fundamentally changes the narrative: Contoso’s B2C customers may not look “loyal” by RFM metrics, but once they buy, they keep buying for years. The low Champions count isn’t a churn problem โ it’s a scoring artefact of a high-frequency, consistent customer base.
Q10 โ ABC Product Classification
Which products are driving the majority of revenue? Can we classify the catalogue into tiers to focus inventory and marketing spend?
ABC analysis classifies products by their cumulative revenue contribution โ A tier drives the first 70%, B tier the next 20%, and C tier the remaining 10%. It uses a running sum window function to calculate cumulative revenue before applying the classification logic.
WITH ranked AS ( SELECT ProductKey, ProductName, total_revenue, SUM(total_revenue) OVER () AS grand_total, SUM(total_revenue) OVER ( ORDER BY total_revenue DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS cumulative_revenue FROM product_totals ) SELECT *, CASE WHEN cumulative_revenue / grand_total <= 0.70 THEN 'A' WHEN cumulative_revenue / grand_total <= 0.90 THEN 'B' ELSE 'C' END AS abc_tier FROM ranked ORDER BY total_revenue DESC;
Results
| Tier | Products | % of Catalogue | % of Revenue |
|---|---|---|---|
| A โ Core | 661 | 26.3% | 70% |
| B โ Supporting | 665 | 26.4% | 20% |
| C โ Long Tail | 1,190 | 47.3% | 10% |
Textbook Pareto: 26% of products (A tier) drive 70% of all revenue. Nearly half the product catalogue (C tier, 1,190 products) contributes only 10% of revenue. This has direct implications for inventory planning, marketing spend, and even catalogue rationalisation. The C-tier products deserve scrutiny โ some may be strategic range-fillers, but others may simply be dead weight consuming warehouse space and procurement effort.
Summary of Findings
Company customers generate 3ร more revenue than Person customers. All top 20 spenders by total spend are B2B. Retention and account management effort should be disproportionately directed at the corporate segment.
Whether you cut by channel, category, or country โ margin sits within a 1% band around 56.8%. This is unusual and suggests extremely centralised pricing. It also means growth decisions should be driven entirely by volume, not margin optimisation.
77.3% 12-month cohort retention for B2C customers is a standout number. The business has strong customer stickiness once acquired โ the challenge is acquisition and reactivation, not retention itself.
47% of products generate only 10% of revenue. A catalogue rationalisation exercise โ or at minimum, redirecting marketing budget toward A-tier products โ could meaningfully improve efficiency without touching revenue.
Every other category sits between 56.3โ57.3% margin. Games & Toys at 48.7% is 8 points lower. This warrants a pricing review โ either costs are too high or the category is being sold at a structural discount that isn’t justified.
View the Full Project on GitHub
All 10 SQL queries with comments, CTEs, window functions, and results.
